Week 3

(June 09, 2025 – June 16, 2025)

Meeting 1

Date: June 11, 2025
Attendees:

Summary:

Presented the findings and the architectural design I created with the inputs Kaushl made in the last meet.
I wasn't able to find many open source projects under MIT license which would be helpful to integrate in this project.

Document:

How the Report Aggregator Architecture Works

Think of this as a smart combiner for compliance reports. You give it several reports FOSSology already generated (same format), and it produces one merged report - not by gluing files together, but by understanding what each report contains and folding duplicates intelligently.

The problem it solves

Organizations often have many separate compliance reports:

One per upload
One per project
One per team scan

They want one authoritative report that:

Contains everything from all inputs
Does not list the same component twice if it appeared in multiple reports
Shows where each piece of data came from
Lets humans fix mistakes without losing that traceability

FOSSology today can combine some formats at generation time (uploadsAdd), but that is mostly concatenation, not true merge. CycloneDX does not even merge content. This tool fixes that after reports exist, by reading files from disk.

The big idea: two specialists + one brain

The architecture avoids building one giant "universal data model" for every format. Instead:

Piece	Role	Analogy
Adapters (one per format)	Read and write one format	Translators who speak SPDX, CycloneDX, DEP5, etc.
Mapping files (`.toml`)	Tell the engine where fields live and how to compare them	A recipe card per format
Merge engine (one, shared)	Dedup, merge, fix IDs, track provenance	The chef who follows the recipe

Important: data stays in its native shape (JSON dict, XML tree, text stanzas). The engine does not convert everything into one internal graph format.

The pipeline, step by step

Read files → Parse (adapter) → Merge (engine) → Write (adapter) → Output + provenance.json

1. Ingestion

You point the tool at N report files, e.g.:

report-aggregator merge report-a.json report-b.json -o merged.json

Rules:

Same format only - SPDX with SPDX, CycloneDX with CycloneDX
SPDX 2 never mixes with SPDX 3
No live FOSSology database calls in v1 - just files

2. Loading (adapter `.load()`)

Each format has its own adapter that parses bytes into a native structure:

CycloneDX → Python dict from JSON
SPDX tag-value → parsed blocks
DEP5 / ReadMeOSS → custom text structures

Because FOSSology output is template-driven and predictable, parsers only need to handle FOSSology's shape, not every possible variant of the spec.

3. Input normalization (adapter `.entries()`)

Before merging, each report is broken into a flat list of mergeable items:

SPDX: packages and files (one input file may already contain multiple packages from uploadsAdd)
CycloneDX: upload (metadata.component) and files (components[])
DEP5: license stanzas (Files: blocks)
ReadMeOSS: license blocks in MAIN / OTHER / ACKNOWLEDGEMENTS

The engine never assumes "one input file = one package."

4. Merging (the engine - 6 steps)

Step 1 - Collect
Pull all entries from all N reports.

Step 2 - Group by identity
Ask: "Is this the same real-world thing as that?"

Same upload/package → same SHA1 checksum
Same file → same file SHA1
Same license text → same md5(normalized text)

If two entries share an identity key, they go in the same bucket and get merged into one.

Step 3 - Merge fields in each bucket
Per the mapping file:

Union fields (e.g. licenses): combine - if one report says MIT and another says Zlib, output has both
Conflict fields (e.g. copyright, package name): if they disagree, keep the first value but flag the conflict in the sidecar
Everything else: first input wins, with provenance recorded

Step 4 - Fix IDs and references (graph formats only)
SPDX and CycloneDX use local IDs like SPDXRef-upload10 or bom-ref: "10-3". Different reports can reuse the same IDs for different things.

The engine builds a remap table and rewrites:

SPDX IDs
Relationship links (DESCRIBES, CONTAINS, etc.)
CycloneDX bom-ref values (for uniqueness)

Text formats (DEP5, ReadMeOSS) skip this - they have no ID graph.

Step 5 - Record provenance
For every merged field, record which input reports contributed. Written to merged.provenance.json.

Step 6 - Assemble and render
Build one output document (one header, merged entries, fixed references) and serialize back to the same format.

Two kinds of formats (this drives behavior)

Category A - Document graph formats

SPDX 2/3, CycloneDX, CLIXML (to be researched and read upon)

These are structured graphs: entities (packages, files) linked by IDs and relationships.

Merge is semantic:

Dedup by checksum
Rewire references
Dedup license text blocks
Regenerate document metadata (e.g. new documentNamespace for SPDX)

Category B - Stanza/section formats

DEP5, ReadMeOSS

These are text organized in paragraphs/sections, not a graph.

Merge is content-based:

Group stanzas by license expression or text hash
Union file lists under the same license
Flag overlaps when the same file path gets different licenses
No ID rewiring

Graph formats ──► dedup + rewire IDs + relationships

Stanza formats ──► dedup by text/content + section merge

both ──► provenance sidecar

How "sameness" is decided (identity rules)

This is one of the most important design choices.

What	How we know it's the same	Why
Package / upload	SHA1 (fallback MD5, SHA256)	FOSSology stores upload hash - filename is unreliable
File	File SHA1	Content hash is definitive
License text	`md5(normalized text)`	`LicenseRef-fossology-xyz` names differ across FOSSology instances
DEP5 stanza	`md5(license + sorted file globs)`	Groups "these files under this license"
ReadMeOSS block	`md5(normalized text)`	Same idea as license text

Not used for identity: package name, version, purl - FOSSology often leaves these empty ("NA").

License text normalization before hashing:

Normalize line endings to \n
Strip trailing whitespace per line
For ReadMeOSS: strip OSSelot prefix if present
Do not strip blank lines inside license text (may matter legally)

Concrete example: two CycloneDX reports

Report A (upload appA, sha1 aaa):

1 file: src/png.c, sha1 abc, license MIT

Report B (upload appB, sha1 bbb):

src/png.c, sha1 abc, license Zlib (same file as A)
src/zlib.c, sha1 def, license Zlib

What the engine does:

Upload tier: aaa and bbb are different → keep both as type:library entries
File tier: abc appears in A and B → one merged file entry with licenses [MIT, Zlib]
File def: only in B → kept as-is
Provenance: records that abc's Zlib license came from report B
Output: one JSON with 2 library entries + 2 file entries + provenance.json

That is merge, not concatenation. Concatenation would give you two png.c entries.

Mapping files: the engine's instruction manual

Each format has a TOML file like mappings/spdx2.toml:

package_identity = ["checksum.SHA1", "checksum.MD5", "checksum.SHA256"]

union_fields = ["hasExtractedLicensingInfos"]

conflict_fields = ["PackageName", "PackageCopyrightText"]

The engine reads this and knows:

Where packages live
What field decides "same package"
Which fields to combine vs which to treat as conflicts

Adding a new format = new adapter + new mapping file. The core engine stays unchanged.

Provenance sidecar: the audit trail

Every merge produces two outputs:

merged-report (SPDX, JSON, text, etc.)
merged-report.provenance.json

The sidecar contains:

inputs - which files were merged
field_provenance - e.g. /packages/0/licenseConcluded came from report-A and report-B
conflicts - fields that disagreed, what each input said, what was chosen
edits - human corrections applied later

This powers the future UI: "show me which report contributed this license."

Edit layer: corrections that survive re-merge

Users will fix wrong data. Edits are stored as RFC-6902 JSON patches (like Git-style change records):

aggregate = merge(all inputs) + apply(user patches)

Patches target the internal structure (before rendering to text/JSON), so:

Re-running merge after an input changes does not wipe manual fixes
Patches stay valid when the report is re-serialized

Each edit records who, when, and what changed.

How this relates to FOSSology `uploadsAdd` today

Aspect	`uploadsAdd` (built into FOSSology)	Report Aggregator (new tool)
When	While generating from DB	After reports exist on disk
How	Mostly append blocks	Identity-based dedup
CycloneDX	Broken - only changes filename	Actually merges uploads + files
Provenance	None	Full sidecar
Conflicts	Silent	Flagged for review

They complement each other. You can merge individual per-upload reports, or reports that were already partially combined by uploadsAdd.

What gets built (project structure)

report-aggregator/ ├── mappings/ # TOML recipes per format ├── src/report_aggregator/ │ ├── engine/ # Shared merge logic │ ├── adapters/ # One per format (spdx2, cyclonedx, dep5, ...) │ └── cli.py # Command-line entry point └── tests/golden/ # Real FOSSology outputs for regression tests

MIT-only: mostly Python stdlib (json, xml.etree, tomllib). Custom parsers for SPDX tag-value, DEP5, ReadMeOSS because existing libraries are GPL/Apache/BSD.

v1 scope (what ships first)

In v1	Deferred
CycloneDX 1.4 JSON	CLIXML (phase 4)
SPDX 2 tag-value	SPDX RDF/XML, CSV
DEP5	Hierarchical CDX parent root
ReadMeOSS	Unified Report `.docx`
SPDX 3 JSON (phase 2b)

One-sentence summary

Adapters parse FOSSology reports into native structures, a shared engine merges them by content hash (not filename), fixes ID collisions in graph formats, records every decision in a provenance sidecar, and adapters write back one clean report of the same format.

Discussion Regarding the Presented Architecture

The mentors liked this architectural design and agreed upon me starting the initial implementation of the project.
Any further enhancements will be made while the implementation is going on.

Next Steps

Start the initial implementation of the project.
Create a repository on github and push the code there.
This repository onwership will be transfered to FOSSology in the future.

Meeting 1​

Summary:​

Document:​

How the Report Aggregator Architecture Works

The problem it solves​

The big idea: two specialists + one brain​

The pipeline, step by step​

1. Ingestion​

2. Loading (adapter .load())​

3. Input normalization (adapter .entries())​

4. Merging (the engine - 6 steps)​

Two kinds of formats (this drives behavior)​

Category A - Document graph formats​

Category B - Stanza/section formats​

How "sameness" is decided (identity rules)​

Concrete example: two CycloneDX reports​

Mapping files: the engine's instruction manual​

Provenance sidecar: the audit trail​

Edit layer: corrections that survive re-merge​

How this relates to FOSSology uploadsAdd today​

What gets built (project structure)​

v1 scope (what ships first)​

One-sentence summary​

Discussion Regarding the Presented Architecture​

Next Steps​