Week 2
(June 01, 2025 – June 08, 2025)
Meeting 1
Date: June 04, 2025
Attendees:
Summary:
- Presented the findings and the architectural decisions I made and received inputs from the mentors.
Document:
0. Recap (already agreed — not for re-discussion)
- Goal: merge N compliance reports of one format → one authoritative report of the same format, with per-field traceability and traceable edits.
- Scope = same-format only. Same format in → same format out. No cross-format conversion (that's translation, already solved elsewhere). SPDX is handled as two independent pipelines: SPDX 2 merges with SPDX 2, SPDX 3 merges with SPDX 3 — never mixed.
- Formats: ReadMeOSS · CycloneDX · SPDX 2 · SPDX 3 · CLIXML · DEP5 (6 independent pipelines).
- Build shape: standalone Python project (Apache-2.0); later wired into FOSSology via a thin GPL-2.0 subprocess shim agent. (Standalone unlocks the Apache-2.0 Python SBOM ecosystem that an in-tree GPL-2.0 agent could not link.)
1. Architecture: a 4-stage pipeline around one canonical IR
INGEST → CANONICAL IR → MERGE → RENDER
(format → IR) (one in-memory (N IRs → 1 IR, (IR → format)
graph: nodes + deep, per-field
relationships) provenance)
▲ │ │
│ ▼ ▼
fossology-python provenance map + edit log (RFC-6902 patches) same format out
(REST pull) ───────────────────────────────────────────► + .docx / quality score
Everything hinges on the canonical IR: ingest each format into one shape, merge once (format-agnostic), then render back. Provenance and the edit log live on the IR, not on any one format. The IR choice is the single most important technical decision (see §8 Q1).
2. How tools plug in — three integration modes
The whole research was framed around how a tool can be consumed, because that drives both licensing and packaging:
| Mode | Mechanism | When we use it | Cost |
|---|---|---|---|
| A — Library import | pip install + import | Python libs (Apache/MIT/BSD) — read/write/merge in-process | Cleanest; full data fidelity |
| B — Subprocess | subprocess.run([...]) | Best-in-class tools written in Go/.NET (no Python lib) | Sidecar binary to package; JSON across the boundary |
| HTTP | requests | Remote services (enrichment) | Network dependency; no linking |
Because we're standalone Apache-2.0, licensing no longer forces Mode B — Mode B is now used only when the best tool isn't available in Python (Go/.NET), never to dodge a licence.
3. Tool landscape — researched, by role
39 tools surveyed; below are the ones that earn a place, grouped by the job they do. Legend: ✅ use · 🟡 optional · ❌ rejected · Mode A/B/HTTP.
3.1 Canonical IR & core read/write (Mode A, Python)
| Tool | Lic | Use | Role |
|---|---|---|---|
| lib4sbom | Apache-2.0 | ✅ | Single Python abstraction over SPDX 2.3 + CDX 1.4–1.5 → IR backbone candidate |
| spdx-tools (official) | Apache-2.0 | ✅ | SPDX 2.2/2.3 read/write + validation |
| cyclonedx-python-lib (official) | Apache-2.0 | ✅ | CDX 1.x read/write/validate |
| spdx-python-model | Apache-2.0 | ✅ | SPDX 3.0.1 bindings (the realistic SPDX-3 entry point) |
Tension: lib4sbom is the cleanest single IR but only covers SPDX 2.3 + CDX 1.4–1.5 — not SPDX 3, and none of the FOSSology-native text/XML formats (ReadMeOSS, CLIXML, DEP5, .docx). So lib4sbom alone can't be the universal IR. → §8 Q1.
3.2 Merge primitives (the actual "combine" logic)
| Tool | Lang/Lic | Mode | Use | Merge semantics it brings |
|---|---|---|---|---|
| SPDXMerge (Philips) | Py / MIT | A | ✅ | SPDX 2.3 deep vs shallow (externalDocumentRefs) baseline |
| spdx3merge (JPEWdev) | Py / MIT | A | ✅ | Only viable SPDX 3 merger — but v0.0.3, early (risk → §8 Q5) |
| sbomasm (Interlynk) | Go / Apache | B | 🟡 | hierarchical / flat / augment / overwrite assembly modes |
| cyclonedx-cli (spec authors) | .NET / Apache | B | 🟡 | authoritative CDX hierarchical merge w/ parent metadata |
| sbommerge (Harrison) | Py / Apache | — | ❌ | 2-input only; study the per-package conflict rule, don't depend |
3.3 Diff / edit / quality / enrichment
| Tool | Lang/Lic | Mode | Use | Role |
|---|---|---|---|---|
| jsonpatch | Py / BSD | A | ✅ | RFC-6902 = the edit-history primitive. Every UI edit = one replayable patch |
| sbomdiff (Harrison) | Py / Apache | A | ✅ | per-package diff (SPDX+CDX); extend for ReadMeOSS/CLIXML; drives 3-way merge |
| sbomqs (Interlynk) | Go / Apache | B | ✅ | post-merge quality score vs NTIA/BSI/FSCT/OpenChain (CRA-readiness badge) |
| sbom-utility | Go / Apache | B | 🟡 | RFC-6902 patch + SQL-like query + validate (CI gate) |
| ClearlyDefined | HTTP / CC0 | HTTP | ✅ | fill license/copyright gaps post-merge |
3.4 Connectors (ecosystem integration)
| Tool | Lic | Use | Role |
|---|---|---|---|
| fossology-python | MIT | ✅ | Primary ingest — pull all report formats via REST (incl. SPDX 3, CLIXML) |
| sw360python | MIT | ✅ | Primary SW360 push connector |
| CaPyCLI | MIT | 🟡 | reference for SW360 release-attachment / clearing-report patterns |
3.5 Output / rendering helpers
| Tool | Lic | Use | Role |
|---|---|---|---|
| python-docx | MIT | ✅ | re-render Unified Report .docx from IR (port of unifiedreport template) |
| lxml | BSD | ✅ | CLIXML + CDX-XML round-trip (stdlib etree too weak for the XPath/XSLT we need) |
| siemens-standard-bom | MIT | 🟡 | optional Siemens/SW360-flavoured CDX output |
3.6 Studied as reference, not dependencies
protobom/bomctl (Go — borrow the graph-IR + cache-DB idea, reach via CLI only) ·
sbom-manager (file catalogue, wrong layer) · ORT (Freemarker NOTICE templates — idea only) ·
Dependency-Track/GUAC/Trustify (aggregate inventories, not documents — possible future
sinks) · sbomify, Microsoft sbom-tool (competing platforms — UX/aggregation reference) ·
Syft/Trivy (generate, we aggregate).
4. Per-format strategy (tool + approach per format)
This is where the abstract IR meets reality — each format needs a different merge approach:
| Format | Read | Merge approach | Write | Hardest part |
|---|---|---|---|---|
| SPDX 2.3 | spdx-tools / lib4sbom | SPDXMerge (deep/shallow) | spdx-tools | dedup key across docs (§8 Q3) |
| SPDX 3 | spdx-python-model | spdx3merge (extend) | spdx-python-model | tool maturity (v0.0.3) |
| CycloneDX | cyclonedx-python-lib | our own in IR; optional sbomasm/cyclonedx-cli | cyclonedx-python-lib | flat vs hierarchical default |
| ReadMeOSS | custom parser | build new — section-aware text merge, de-dup license blocks by SPDX-id | text | no tool exists; it's NOTICE text |
| CLIXML | lxml | build new — XML merge preserving <audit> + source attribution | lxml | recovering schema from src/clixml/ |
| DEP5 | custom parser | build new — paragraph union, Files: glob overlap detection | text | overlap conflicts |
| Unified Report | (from IR) | re-render, never byte-merge .docx | python-docx | ~16-section template port (1–2 wks) |
Cross-cutting merge-semantics decisions (need a default each):
- Deep vs shallow merge (inline everything vs. keep
externalDocumentRefs). - Flat vs hierarchical vs augment (CDX/sbomasm modes — flat union? product-rollup parent? enrich-primary?).
- Package identity / dedup key — purl? name+version? SPDX-id? — decides what counts as "the same package."
- Field-conflict policy — same package, different license across inputs: last-writer / union / flag-for-review?
5. The genuinely new work (no tool provides it)
These are the differentiators — the reason this is a project and not a shell script:
- Per-field provenance map —
{json-pointer → source-doc-id}. No merger emits this. This is what powers the "source ribbon" UI. - Three-way merge —
base ⊕ upstreamNew ⊕ userEdits → newAggregate; the SBOM equivalent ofgit merge. Built on jsonpatch + sbomdiff. Nobody does this for SBOMs. - Replayable edit log — every correction stored as an RFC-6902 patch, attributed to a user;
aggregate =
merge(inputs) ⊕ apply(patches). Survives re-ingest of changed inputs. - The three text/XML mergers (ReadMeOSS, CLIXML, DEP5) and the .docx re-renderer.
- Aggregation View UI — provenance ribbon, inline edit → patch, conflict resolution, quality badge.
6. Quality & validation strategy
- sbomqs after every merge → store score with the aggregate, surface as a CRA/NTIA/BSI badge. Guards against "merge silently degraded the SBOM."
- Golden-file tests per format — N known-good FOSSology reports → merge → snapshot output, + assert sbomqs threshold.
- Optional sbom-utility validate as a CI gate.
7. Phased plan (technical milestones)
| Phase | Deliverable | Key deps |
|---|---|---|
| 0 | Repo bootstrap (Apache-2.0, CI, REUSE) | — |
| 1 | Core: SPDX 2.3 + CDX merge, N→1, deep, per-field provenance | lib4sbom, spdx-tools, cyclonedx-python-lib, SPDXMerge, jsonpatch |
| 2 | FOSSology-native: ReadMeOSS, CLIXML, DEP5, Unified .docx | lxml, python-docx |
| 3 | SPDX 3 | spdx-python-model, spdx3merge |
| 4 | Cache + replayable edit log | SQLite/SQLAlchemy |
| 5 | Three-way merge + conflict surfacing | sbomdiff (extended) |
| 6 | Web UI (ribbon, inline edit, sbomqs badge) | FastAPI + React |
| 7 | FOSSology shim agent (src/reportaggregator/) | — |
| 8 | SW360 push (release attachment) | sw360python |
| 9 | Hardening (CI quality gate, ClearlyDefined enrichment) | — |
Discussion Regarding the Presented Architecture
- This proposed approach was dropped after feedback from Kaushl. He pointed out that the architecture had been designed assuming we would merge multiple reports of different formats and generate output in any supported format. However, our actual use case involves merging multiple reports of the same format and producing output in that same format, making the proposed design unnecessarily complex.
- Kaushl recommended adopting a simpler pipeline-based architecture following the flow: Ingestion → Loading → Merging → Rendering. This design would leverage custom adapters, mappings, and parsers, eliminating the need for an intermediate model while better aligning with the project requirements.
- During the evaluation of open-source tooling, I initially focused on tools with licenses that are not ideal for integration with FOSSology and SW360. Kaushl advised prioritizing tools released under the MIT License to ensure better compatibility and compliance with the FOSSology ecosystem.
Next Steps
- Explore the approach Kaushl suggested and look into open source tools under MIT license while can be integrated with FOSSology for this project.