Skip to main content

Week 3

(June 09, 2025 – June 16, 2025)

Meeting 1

Date: June 11, 2025
Attendees:

Summary:

  • Presented the findings and the architectural design I created with the inputs Kaushl made in the last meet.
  • I wasn't able to find many open source projects under MIT license which would be helpful to integrate in this project.

Document:

How the Report Aggregator Architecture Works

Think of this as a smart combiner for compliance reports. You give it several reports FOSSology already generated (same format), and it produces one merged report - not by gluing files together, but by understanding what each report contains and folding duplicates intelligently.


The problem it solves

Organizations often have many separate compliance reports:

  • One per upload
  • One per project
  • One per team scan

They want one authoritative report that:

  1. Contains everything from all inputs
  2. Does not list the same component twice if it appeared in multiple reports
  3. Shows where each piece of data came from
  4. Lets humans fix mistakes without losing that traceability

FOSSology today can combine some formats at generation time (uploadsAdd), but that is mostly concatenation, not true merge. CycloneDX does not even merge content. This tool fixes that after reports exist, by reading files from disk.


The big idea: two specialists + one brain

The architecture avoids building one giant "universal data model" for every format. Instead:

PieceRoleAnalogy
Adapters (one per format)Read and write one formatTranslators who speak SPDX, CycloneDX, DEP5, etc.
Mapping files (.toml)Tell the engine where fields live and how to compare themA recipe card per format
Merge engine (one, shared)Dedup, merge, fix IDs, track provenanceThe chef who follows the recipe

Important: data stays in its native shape (JSON dict, XML tree, text stanzas). The engine does not convert everything into one internal graph format.


The pipeline, step by step

Read files → Parse (adapter) → Merge (engine) → Write (adapter) → Output + provenance.json

1. Ingestion

You point the tool at N report files, e.g.:

report-aggregator merge report-a.json report-b.json -o merged.json

Rules:

  • Same format only - SPDX with SPDX, CycloneDX with CycloneDX
  • SPDX 2 never mixes with SPDX 3
  • No live FOSSology database calls in v1 - just files

2. Loading (adapter .load())

Each format has its own adapter that parses bytes into a native structure:

  • CycloneDX → Python dict from JSON
  • SPDX tag-value → parsed blocks
  • DEP5 / ReadMeOSS → custom text structures

Because FOSSology output is template-driven and predictable, parsers only need to handle FOSSology's shape, not every possible variant of the spec.

3. Input normalization (adapter .entries())

Before merging, each report is broken into a flat list of mergeable items:

  • SPDX: packages and files (one input file may already contain multiple packages from uploadsAdd)
  • CycloneDX: upload (metadata.component) and files (components[])
  • DEP5: license stanzas (Files: blocks)
  • ReadMeOSS: license blocks in MAIN / OTHER / ACKNOWLEDGEMENTS

The engine never assumes "one input file = one package."

4. Merging (the engine - 6 steps)

Step 1 - Collect
Pull all entries from all N reports.

Step 2 - Group by identity
Ask: "Is this the same real-world thing as that?"

  • Same upload/package → same SHA1 checksum
  • Same file → same file SHA1
  • Same license text → same md5(normalized text)

If two entries share an identity key, they go in the same bucket and get merged into one.

Step 3 - Merge fields in each bucket
Per the mapping file:

  • Union fields (e.g. licenses): combine - if one report says MIT and another says Zlib, output has both
  • Conflict fields (e.g. copyright, package name): if they disagree, keep the first value but flag the conflict in the sidecar
  • Everything else: first input wins, with provenance recorded

Step 4 - Fix IDs and references (graph formats only)
SPDX and CycloneDX use local IDs like SPDXRef-upload10 or bom-ref: "10-3". Different reports can reuse the same IDs for different things.

The engine builds a remap table and rewrites:

  • SPDX IDs
  • Relationship links (DESCRIBESCONTAINS, etc.)
  • CycloneDX bom-ref values (for uniqueness)

Text formats (DEP5, ReadMeOSS) skip this - they have no ID graph.

Step 5 - Record provenance
For every merged field, record which input reports contributed. Written to merged.provenance.json.

Step 6 - Assemble and render
Build one output document (one header, merged entries, fixed references) and serialize back to the same format.


Two kinds of formats (this drives behavior)

Category A - Document graph formats

SPDX 2/3, CycloneDX, CLIXML (to be researched and read upon)

These are structured graphs: entities (packages, files) linked by IDs and relationships.

Merge is semantic:

  • Dedup by checksum
  • Rewire references
  • Dedup license text blocks
  • Regenerate document metadata (e.g. new documentNamespace for SPDX)

Category B - Stanza/section formats

DEP5, ReadMeOSS

These are text organized in paragraphs/sections, not a graph.

Merge is content-based:

  • Group stanzas by license expression or text hash
  • Union file lists under the same license
  • Flag overlaps when the same file path gets different licenses
  • No ID rewiring

Graph formats ──► dedup + rewire IDs + relationships

Stanza formats ──► dedup by text/content + section merge

both ──► provenance sidecar


How "sameness" is decided (identity rules)

This is one of the most important design choices.

WhatHow we know it's the sameWhy
Package / uploadSHA1 (fallback MD5, SHA256)FOSSology stores upload hash - filename is unreliable
FileFile SHA1Content hash is definitive
License textmd5(normalized text)LicenseRef-fossology-xyz names differ across FOSSology instances
DEP5 stanzamd5(license + sorted file globs)Groups "these files under this license"
ReadMeOSS blockmd5(normalized text)Same idea as license text

Not used for identity: package name, version, purl - FOSSology often leaves these empty ("NA").

License text normalization before hashing:

  1. Normalize line endings to \n
  2. Strip trailing whitespace per line
  3. For ReadMeOSS: strip OSSelot prefix if present
  4. Do not strip blank lines inside license text (may matter legally)

Concrete example: two CycloneDX reports

Report A (upload appA, sha1 aaa):

  • 1 file: src/png.c, sha1 abc, license MIT

Report B (upload appB, sha1 bbb):

  • src/png.c, sha1 abc, license Zlib (same file as A)
  • src/zlib.c, sha1 def, license Zlib

What the engine does:

  1. Upload tier: aaa and bbb are different → keep both as type:library entries
  2. File tier: abc appears in A and B → one merged file entry with licenses [MIT, Zlib]
  3. File def: only in B → kept as-is
  4. Provenance: records that abc's Zlib license came from report B
  5. Output: one JSON with 2 library entries + 2 file entries + provenance.json

That is merge, not concatenation. Concatenation would give you two png.c entries.


Mapping files: the engine's instruction manual

Each format has a TOML file like mappings/spdx2.toml:

package_identity = ["checksum.SHA1", "checksum.MD5", "checksum.SHA256"]

union_fields = ["hasExtractedLicensingInfos"]

conflict_fields = ["PackageName", "PackageCopyrightText"]

The engine reads this and knows:

  • Where packages live
  • What field decides "same package"
  • Which fields to combine vs which to treat as conflicts

Adding a new format = new adapter + new mapping file. The core engine stays unchanged.


Provenance sidecar: the audit trail

Every merge produces two outputs:

  1. merged-report (SPDX, JSON, text, etc.)
  2. merged-report.provenance.json

The sidecar contains:

  • inputs - which files were merged
  • field_provenance - e.g. /packages/0/licenseConcluded came from report-A and report-B
  • conflicts - fields that disagreed, what each input said, what was chosen
  • edits - human corrections applied later

This powers the future UI: "show me which report contributed this license."


Edit layer: corrections that survive re-merge

Users will fix wrong data. Edits are stored as RFC-6902 JSON patches (like Git-style change records):

aggregate = merge(all inputs) + apply(user patches)

Patches target the internal structure (before rendering to text/JSON), so:

  • Re-running merge after an input changes does not wipe manual fixes
  • Patches stay valid when the report is re-serialized

Each edit records who, when, and what changed.


How this relates to FOSSology uploadsAdd today

AspectuploadsAdd (built into FOSSology)Report Aggregator (new tool)
WhenWhile generating from DBAfter reports exist on disk
HowMostly append blocksIdentity-based dedup
CycloneDXBroken - only changes filenameActually merges uploads + files
ProvenanceNoneFull sidecar
ConflictsSilentFlagged for review

They complement each other. You can merge individual per-upload reports, or reports that were already partially combined by uploadsAdd.


What gets built (project structure)

report-aggregator/ ├── mappings/ # TOML recipes per format ├── src/report_aggregator/ │ ├── engine/ # Shared merge logic │ ├── adapters/ # One per format (spdx2, cyclonedx, dep5, ...) │ └── cli.py # Command-line entry point └── tests/golden/ # Real FOSSology outputs for regression tests

MIT-only: mostly Python stdlib (jsonxml.etreetomllib). Custom parsers for SPDX tag-value, DEP5, ReadMeOSS because existing libraries are GPL/Apache/BSD.


v1 scope (what ships first)

In v1Deferred
CycloneDX 1.4 JSONCLIXML (phase 4)
SPDX 2 tag-valueSPDX RDF/XML, CSV
DEP5Hierarchical CDX parent root
ReadMeOSSUnified Report .docx
SPDX 3 JSON (phase 2b)

One-sentence summary

Adapters parse FOSSology reports into native structures, a shared engine merges them by content hash (not filename), fixes ID collisions in graph formats, records every decision in a provenance sidecar, and adapters write back one clean report of the same format.

Discussion Regarding the Presented Architecture

  • The mentors liked this architectural design and agreed upon me starting the initial implementation of the project.
  • Any further enhancements will be made while the implementation is going on.

Next Steps

  • Start the initial implementation of the project.
  • Create a repository on github and push the code there.
  • This repository onwership will be transfered to FOSSology in the future.