Week 3
(June 09, 2025 – June 16, 2025)
Meeting 1
Date: June 11, 2025
Attendees:
Summary:
- Presented the findings and the architectural design I created with the inputs Kaushl made in the last meet.
- I wasn't able to find many open source projects under MIT license which would be helpful to integrate in this project.
Document:
How the Report Aggregator Architecture Works
Think of this as a smart combiner for compliance reports. You give it several reports FOSSology already generated (same format), and it produces one merged report - not by gluing files together, but by understanding what each report contains and folding duplicates intelligently.
The problem it solves
Organizations often have many separate compliance reports:
- One per upload
- One per project
- One per team scan
They want one authoritative report that:
- Contains everything from all inputs
- Does not list the same component twice if it appeared in multiple reports
- Shows where each piece of data came from
- Lets humans fix mistakes without losing that traceability
FOSSology today can combine some formats at generation time (uploadsAdd), but that is mostly concatenation, not true merge. CycloneDX does not even merge content. This tool fixes that after reports exist, by reading files from disk.
The big idea: two specialists + one brain
The architecture avoids building one giant "universal data model" for every format. Instead:
| Piece | Role | Analogy |
|---|---|---|
| Adapters (one per format) | Read and write one format | Translators who speak SPDX, CycloneDX, DEP5, etc. |
Mapping files (.toml) | Tell the engine where fields live and how to compare them | A recipe card per format |
| Merge engine (one, shared) | Dedup, merge, fix IDs, track provenance | The chef who follows the recipe |
Important: data stays in its native shape (JSON dict, XML tree, text stanzas). The engine does not convert everything into one internal graph format.
The pipeline, step by step
Read files → Parse (adapter) → Merge (engine) → Write (adapter) → Output + provenance.json
1. Ingestion
You point the tool at N report files, e.g.:
report-aggregator merge report-a.json report-b.json -o merged.json
Rules:
- Same format only - SPDX with SPDX, CycloneDX with CycloneDX
- SPDX 2 never mixes with SPDX 3
- No live FOSSology database calls in v1 - just files
2. Loading (adapter .load())
Each format has its own adapter that parses bytes into a native structure:
- CycloneDX → Python dict from JSON
- SPDX tag-value → parsed blocks
- DEP5 / ReadMeOSS → custom text structures
Because FOSSology output is template-driven and predictable, parsers only need to handle FOSSology's shape, not every possible variant of the spec.
3. Input normalization (adapter .entries())
Before merging, each report is broken into a flat list of mergeable items:
- SPDX: packages and files (one input file may already contain multiple packages from
uploadsAdd) - CycloneDX: upload (
metadata.component) and files (components[]) - DEP5: license stanzas (
Files:blocks) - ReadMeOSS: license blocks in MAIN / OTHER / ACKNOWLEDGEMENTS
The engine never assumes "one input file = one package."
4. Merging (the engine - 6 steps)
Step 1 - Collect
Pull all entries from all N reports.
Step 2 - Group by identity
Ask: "Is this the same real-world thing as that?"
- Same upload/package → same SHA1 checksum
- Same file → same file SHA1
- Same license text → same
md5(normalized text)
If two entries share an identity key, they go in the same bucket and get merged into one.
Step 3 - Merge fields in each bucket
Per the mapping file:
- Union fields (e.g. licenses): combine - if one report says MIT and another says Zlib, output has both
- Conflict fields (e.g. copyright, package name): if they disagree, keep the first value but flag the conflict in the sidecar
- Everything else: first input wins, with provenance recorded
Step 4 - Fix IDs and references (graph formats only)
SPDX and CycloneDX use local IDs like SPDXRef-upload10 or bom-ref: "10-3". Different reports can reuse the same IDs for different things.
The engine builds a remap table and rewrites:
- SPDX IDs
- Relationship links (
DESCRIBES,CONTAINS, etc.) - CycloneDX
bom-refvalues (for uniqueness)
Text formats (DEP5, ReadMeOSS) skip this - they have no ID graph.
Step 5 - Record provenance
For every merged field, record which input reports contributed. Written to merged.provenance.json.
Step 6 - Assemble and render
Build one output document (one header, merged entries, fixed references) and serialize back to the same format.
Two kinds of formats (this drives behavior)
Category A - Document graph formats
SPDX 2/3, CycloneDX, CLIXML (to be researched and read upon)
These are structured graphs: entities (packages, files) linked by IDs and relationships.
Merge is semantic:
- Dedup by checksum
- Rewire references
- Dedup license text blocks
- Regenerate document metadata (e.g. new
documentNamespacefor SPDX)
Category B - Stanza/section formats
DEP5, ReadMeOSS
These are text organized in paragraphs/sections, not a graph.
Merge is content-based:
- Group stanzas by license expression or text hash
- Union file lists under the same license
- Flag overlaps when the same file path gets different licenses
- No ID rewiring
Graph formats ──► dedup + rewire IDs + relationships
Stanza formats ──► dedup by text/content + section merge
both ──► provenance sidecar
How "sameness" is decided (identity rules)
This is one of the most important design choices.
| What | How we know it's the same | Why |
|---|---|---|
| Package / upload | SHA1 (fallback MD5, SHA256) | FOSSology stores upload hash - filename is unreliable |
| File | File SHA1 | Content hash is definitive |
| License text | md5(normalized text) | LicenseRef-fossology-xyz names differ across FOSSology instances |
| DEP5 stanza | md5(license + sorted file globs) | Groups "these files under this license" |
| ReadMeOSS block | md5(normalized text) | Same idea as license text |
Not used for identity: package name, version, purl - FOSSology often leaves these empty ("NA").
License text normalization before hashing:
- Normalize line endings to
\n - Strip trailing whitespace per line
- For ReadMeOSS: strip OSSelot prefix if present
- Do not strip blank lines inside license text (may matter legally)
Concrete example: two CycloneDX reports
Report A (upload appA, sha1 aaa):
- 1 file:
src/png.c, sha1abc, license MIT
Report B (upload appB, sha1 bbb):
src/png.c, sha1abc, license Zlib (same file as A)src/zlib.c, sha1def, license Zlib
What the engine does:
- Upload tier:
aaaandbbbare different → keep both astype:libraryentries - File tier:
abcappears in A and B → one merged file entry with licenses[MIT, Zlib] - File
def: only in B → kept as-is - Provenance: records that
abc's Zlib license came from report B - Output: one JSON with 2 library entries + 2 file entries +
provenance.json
That is merge, not concatenation. Concatenation would give you two png.c entries.
Mapping files: the engine's instruction manual
Each format has a TOML file like mappings/spdx2.toml:
package_identity = ["checksum.SHA1", "checksum.MD5", "checksum.SHA256"]
union_fields = ["hasExtractedLicensingInfos"]
conflict_fields = ["PackageName", "PackageCopyrightText"]
The engine reads this and knows:
- Where packages live
- What field decides "same package"
- Which fields to combine vs which to treat as conflicts
Adding a new format = new adapter + new mapping file. The core engine stays unchanged.
Provenance sidecar: the audit trail
Every merge produces two outputs:
merged-report(SPDX, JSON, text, etc.)merged-report.provenance.json
The sidecar contains:
- inputs - which files were merged
- field_provenance - e.g.
/packages/0/licenseConcludedcame from report-A and report-B - conflicts - fields that disagreed, what each input said, what was chosen
- edits - human corrections applied later
This powers the future UI: "show me which report contributed this license."
Edit layer: corrections that survive re-merge
Users will fix wrong data. Edits are stored as RFC-6902 JSON patches (like Git-style change records):
aggregate = merge(all inputs) + apply(user patches)
Patches target the internal structure (before rendering to text/JSON), so:
- Re-running merge after an input changes does not wipe manual fixes
- Patches stay valid when the report is re-serialized
Each edit records who, when, and what changed.
How this relates to FOSSology uploadsAdd today
| Aspect | uploadsAdd (built into FOSSology) | Report Aggregator (new tool) |
|---|---|---|
| When | While generating from DB | After reports exist on disk |
| How | Mostly append blocks | Identity-based dedup |
| CycloneDX | Broken - only changes filename | Actually merges uploads + files |
| Provenance | None | Full sidecar |
| Conflicts | Silent | Flagged for review |
They complement each other. You can merge individual per-upload reports, or reports that were already partially combined by uploadsAdd.
What gets built (project structure)
report-aggregator/ ├── mappings/ # TOML recipes per format ├── src/report_aggregator/ │ ├── engine/ # Shared merge logic │ ├── adapters/ # One per format (spdx2, cyclonedx, dep5, ...) │ └── cli.py # Command-line entry point └── tests/golden/ # Real FOSSology outputs for regression tests
MIT-only: mostly Python stdlib (json, xml.etree, tomllib). Custom parsers for SPDX tag-value, DEP5, ReadMeOSS because existing libraries are GPL/Apache/BSD.
v1 scope (what ships first)
| In v1 | Deferred |
|---|---|
| CycloneDX 1.4 JSON | CLIXML (phase 4) |
| SPDX 2 tag-value | SPDX RDF/XML, CSV |
| DEP5 | Hierarchical CDX parent root |
| ReadMeOSS | Unified Report .docx |
| SPDX 3 JSON (phase 2b) |
One-sentence summary
Adapters parse FOSSology reports into native structures, a shared engine merges them by content hash (not filename), fixes ID collisions in graph formats, records every decision in a provenance sidecar, and adapters write back one clean report of the same format.
Discussion Regarding the Presented Architecture
- The mentors liked this architectural design and agreed upon me starting the initial implementation of the project.
- Any further enhancements will be made while the implementation is going on.
Next Steps
- Start the initial implementation of the project.
- Create a repository on github and push the code there.
- This repository onwership will be transfered to FOSSology in the future.