Week 1

(June 2, 2025 - June 9, 2025)

Meeting 1

(June 4, 2025)

Shared updates on implementing a keyword-based prefiltering mechanism similar to the Nomos scanner.
The goal of the approach is to reduce the candidate license set before passing it to the Atarashi similarity-based agents.
Discussed the limitations of keyword-based models and explored the need to move toward ML-based pre-filtering.
Talked about how this new KeywordAgent integrates with Atarashi’s architecture and potential enhancements going forward.

Implemented a new KeywordAgent which performs keyword-based filtering before running Atarashi's similarity-based scanners.
Created a keyword set that likely appears in licenses to act as early indicators.
Used GPT-4o to help generate a broad list of licenses and associated keyword groups.
Integrated the agent to mark a license candidate when more than 75% of keywords are found in a file’s content.
Forwarded positively matched files to Atarashi’s agents like:
- TfIdfAgent
- DamerauLevenshteinDistance
- WordFrequencySimilarity
- NgramSimilarity

The keyword list is still static — it must be updated manually as new licenses appear.
Since this is rule-based, it cannot generalize well to unseen licenses or variations in text.
Identified this phase as exploratory; will now begin transitioning toward an ML-based prefiltering model for robustness and better generalization.

Start prototyping an ML-based model for prefiltering using the Minerva dataset.
Setup the Minerva Dataset locally, run the augmentation steps and create the database from scratch.