Week 1
(June 2, 2025 - June 9, 2025)
Meeting 1
(June 4, 2025)
Attendees
Discussions
- Shared updates on implementing a keyword-based prefiltering mechanism similar to the Nomos scanner.
- The goal of the approach is to reduce the candidate license set before passing it to the Atarashi similarity-based agents.
- Discussed the limitations of keyword-based models and explored the need to move toward ML-based pre-filtering.
- Talked about how this new KeywordAgent integrates with Atarashi’s architecture and potential enhancements going forward.
Updates
- Implemented a new
KeywordAgent
which performs keyword-based filtering before running Atarashi's similarity-based scanners. - Created a keyword set that likely appears in licenses to act as early indicators.
- Used GPT-4o to help generate a broad list of licenses and associated keyword groups.
- Integrated the agent to mark a license candidate when more than 75% of keywords are found in a file’s content.
- Forwarded positively matched files to Atarashi’s agents like:
TfIdfAgent
DamerauLevenshteinDistance
WordFrequencySimilarity
NgramSimilarity
Problems Identified
- The keyword list is still static — it must be updated manually as new licenses appear.
- Since this is rule-based, it cannot generalize well to unseen licenses or variations in text.
- Identified this phase as exploratory; will now begin transitioning toward an ML-based prefiltering model for robustness and better generalization.
Planning for next week
- Start prototyping an ML-based model for prefiltering using the Minerva dataset.
- Setup the Minerva Dataset locally, run the augmentation steps and create the database from scratch.