Skip to main content

Week 2

(June 01, 2025 – June 08, 2025)

Meeting 1

Date: June 04, 2025
Attendees:

Summary:

  • Presented the findings and the architectural decisions I made and received inputs from the mentors.

Document:

0. Recap (already agreed — not for re-discussion)

  • Goal: merge N compliance reports of one format → one authoritative report of the same format, with per-field traceability and traceable edits.
  • Scope = same-format only. Same format in → same format out. No cross-format conversion (that's translation, already solved elsewhere). SPDX is handled as two independent pipelines: SPDX 2 merges with SPDX 2, SPDX 3 merges with SPDX 3 — never mixed.
  • Formats: ReadMeOSS · CycloneDX · SPDX 2 · SPDX 3 · CLIXML · DEP5 (6 independent pipelines).
  • Build shape: standalone Python project (Apache-2.0); later wired into FOSSology via a thin GPL-2.0 subprocess shim agent. (Standalone unlocks the Apache-2.0 Python SBOM ecosystem that an in-tree GPL-2.0 agent could not link.)

1. Architecture: a 4-stage pipeline around one canonical IR

  INGEST            →   CANONICAL IR        →   MERGE             →   RENDER
(format → IR) (one in-memory (N IRs → 1 IR, (IR → format)
graph: nodes + deep, per-field
relationships) provenance)
▲ │ │
│ ▼ ▼
fossology-python provenance map + edit log (RFC-6902 patches) same format out
(REST pull) ───────────────────────────────────────────► + .docx / quality score

Everything hinges on the canonical IR: ingest each format into one shape, merge once (format-agnostic), then render back. Provenance and the edit log live on the IR, not on any one format. The IR choice is the single most important technical decision (see §8 Q1).


2. How tools plug in — three integration modes

The whole research was framed around how a tool can be consumed, because that drives both licensing and packaging:

ModeMechanismWhen we use itCost
A — Library importpip install + importPython libs (Apache/MIT/BSD) — read/write/merge in-processCleanest; full data fidelity
B — Subprocesssubprocess.run([...])Best-in-class tools written in Go/.NET (no Python lib)Sidecar binary to package; JSON across the boundary
HTTPrequestsRemote services (enrichment)Network dependency; no linking

Because we're standalone Apache-2.0, licensing no longer forces Mode B — Mode B is now used only when the best tool isn't available in Python (Go/.NET), never to dodge a licence.


3. Tool landscape — researched, by role

39 tools surveyed; below are the ones that earn a place, grouped by the job they do. Legend: ✅ use · 🟡 optional · ❌ rejected · Mode A/B/HTTP.

3.1 Canonical IR & core read/write (Mode A, Python)

ToolLicUseRole
lib4sbomApache-2.0Single Python abstraction over SPDX 2.3 + CDX 1.4–1.5 → IR backbone candidate
spdx-tools (official)Apache-2.0SPDX 2.2/2.3 read/write + validation
cyclonedx-python-lib (official)Apache-2.0CDX 1.x read/write/validate
spdx-python-modelApache-2.0SPDX 3.0.1 bindings (the realistic SPDX-3 entry point)

Tension: lib4sbom is the cleanest single IR but only covers SPDX 2.3 + CDX 1.4–1.5not SPDX 3, and none of the FOSSology-native text/XML formats (ReadMeOSS, CLIXML, DEP5, .docx). So lib4sbom alone can't be the universal IR. → §8 Q1.

3.2 Merge primitives (the actual "combine" logic)

ToolLang/LicModeUseMerge semantics it brings
SPDXMerge (Philips)Py / MITASPDX 2.3 deep vs shallow (externalDocumentRefs) baseline
spdx3merge (JPEWdev)Py / MITAOnly viable SPDX 3 merger — but v0.0.3, early (risk → §8 Q5)
sbomasm (Interlynk)Go / ApacheB🟡hierarchical / flat / augment / overwrite assembly modes
cyclonedx-cli (spec authors).NET / ApacheB🟡authoritative CDX hierarchical merge w/ parent metadata
sbommerge (Harrison)Py / Apache2-input only; study the per-package conflict rule, don't depend

3.3 Diff / edit / quality / enrichment

ToolLang/LicModeUseRole
jsonpatchPy / BSDARFC-6902 = the edit-history primitive. Every UI edit = one replayable patch
sbomdiff (Harrison)Py / ApacheAper-package diff (SPDX+CDX); extend for ReadMeOSS/CLIXML; drives 3-way merge
sbomqs (Interlynk)Go / ApacheBpost-merge quality score vs NTIA/BSI/FSCT/OpenChain (CRA-readiness badge)
sbom-utilityGo / ApacheB🟡RFC-6902 patch + SQL-like query + validate (CI gate)
ClearlyDefinedHTTP / CC0HTTPfill license/copyright gaps post-merge

3.4 Connectors (ecosystem integration)

ToolLicUseRole
fossology-pythonMITPrimary ingest — pull all report formats via REST (incl. SPDX 3, CLIXML)
sw360pythonMITPrimary SW360 push connector
CaPyCLIMIT🟡reference for SW360 release-attachment / clearing-report patterns

3.5 Output / rendering helpers

ToolLicUseRole
python-docxMITre-render Unified Report .docx from IR (port of unifiedreport template)
lxmlBSDCLIXML + CDX-XML round-trip (stdlib etree too weak for the XPath/XSLT we need)
siemens-standard-bomMIT🟡optional Siemens/SW360-flavoured CDX output

3.6 Studied as reference, not dependencies

protobom/bomctl (Go — borrow the graph-IR + cache-DB idea, reach via CLI only) · sbom-manager (file catalogue, wrong layer) · ORT (Freemarker NOTICE templates — idea only) · Dependency-Track/GUAC/Trustify (aggregate inventories, not documents — possible future sinks) · sbomify, Microsoft sbom-tool (competing platforms — UX/aggregation reference) · Syft/Trivy (generate, we aggregate).


4. Per-format strategy (tool + approach per format)

This is where the abstract IR meets reality — each format needs a different merge approach:

FormatReadMerge approachWriteHardest part
SPDX 2.3spdx-tools / lib4sbomSPDXMerge (deep/shallow)spdx-toolsdedup key across docs (§8 Q3)
SPDX 3spdx-python-modelspdx3merge (extend)spdx-python-modeltool maturity (v0.0.3)
CycloneDXcyclonedx-python-libour own in IR; optional sbomasm/cyclonedx-clicyclonedx-python-libflat vs hierarchical default
ReadMeOSScustom parserbuild new — section-aware text merge, de-dup license blocks by SPDX-idtextno tool exists; it's NOTICE text
CLIXMLlxmlbuild new — XML merge preserving <audit> + source attributionlxmlrecovering schema from src/clixml/
DEP5custom parserbuild new — paragraph union, Files: glob overlap detectiontextoverlap conflicts
Unified Report(from IR)re-render, never byte-merge .docxpython-docx~16-section template port (1–2 wks)

Cross-cutting merge-semantics decisions (need a default each):

  • Deep vs shallow merge (inline everything vs. keep externalDocumentRefs).
  • Flat vs hierarchical vs augment (CDX/sbomasm modes — flat union? product-rollup parent? enrich-primary?).
  • Package identity / dedup key — purl? name+version? SPDX-id? — decides what counts as "the same package."
  • Field-conflict policy — same package, different license across inputs: last-writer / union / flag-for-review?

5. The genuinely new work (no tool provides it)

These are the differentiators — the reason this is a project and not a shell script:

  1. Per-field provenance map{json-pointer → source-doc-id}. No merger emits this. This is what powers the "source ribbon" UI.
  2. Three-way mergebase ⊕ upstreamNew ⊕ userEdits → newAggregate; the SBOM equivalent of git merge. Built on jsonpatch + sbomdiff. Nobody does this for SBOMs.
  3. Replayable edit log — every correction stored as an RFC-6902 patch, attributed to a user; aggregate = merge(inputs) ⊕ apply(patches). Survives re-ingest of changed inputs.
  4. The three text/XML mergers (ReadMeOSS, CLIXML, DEP5) and the .docx re-renderer.
  5. Aggregation View UI — provenance ribbon, inline edit → patch, conflict resolution, quality badge.

6. Quality & validation strategy

  • sbomqs after every merge → store score with the aggregate, surface as a CRA/NTIA/BSI badge. Guards against "merge silently degraded the SBOM."
  • Golden-file tests per format — N known-good FOSSology reports → merge → snapshot output, + assert sbomqs threshold.
  • Optional sbom-utility validate as a CI gate.

7. Phased plan (technical milestones)

PhaseDeliverableKey deps
0Repo bootstrap (Apache-2.0, CI, REUSE)
1Core: SPDX 2.3 + CDX merge, N→1, deep, per-field provenancelib4sbom, spdx-tools, cyclonedx-python-lib, SPDXMerge, jsonpatch
2FOSSology-native: ReadMeOSS, CLIXML, DEP5, Unified .docxlxml, python-docx
3SPDX 3spdx-python-model, spdx3merge
4Cache + replayable edit logSQLite/SQLAlchemy
5Three-way merge + conflict surfacingsbomdiff (extended)
6Web UI (ribbon, inline edit, sbomqs badge)FastAPI + React
7FOSSology shim agent (src/reportaggregator/)
8SW360 push (release attachment)sw360python
9Hardening (CI quality gate, ClearlyDefined enrichment)

Discussion Regarding the Presented Architecture

  • This proposed approach was dropped after feedback from Kaushl. He pointed out that the architecture had been designed assuming we would merge multiple reports of different formats and generate output in any supported format. However, our actual use case involves merging multiple reports of the same format and producing output in that same format, making the proposed design unnecessarily complex.
  • Kaushl recommended adopting a simpler pipeline-based architecture following the flow: Ingestion → Loading → Merging → Rendering. This design would leverage custom adapters, mappings, and parsers, eliminating the need for an intermediate model while better aligning with the project requirements.
  • During the evaluation of open-source tooling, I initially focused on tools with licenses that are not ideal for integration with FOSSology and SW360. Kaushl advised prioritizing tools released under the MIT License to ensure better compatibility and compliance with the FOSSology ecosystem.

Next Steps

  • Explore the approach Kaushl suggested and look into open source tools under MIT license while can be integrated with FOSSology for this project.