Skip to main content

Week 1

(June 2, 2025 - June 9, 2025)

Meeting 1

(June 4, 2025)

Attendees

Discussions

  • Shared updates on implementing a keyword-based prefiltering mechanism similar to the Nomos scanner.
  • The goal of the approach is to reduce the candidate license set before passing it to the Atarashi similarity-based agents.
  • Discussed the limitations of keyword-based models and explored the need to move toward ML-based pre-filtering.
  • Talked about how this new KeywordAgent integrates with Atarashi’s architecture and potential enhancements going forward.

Updates

  • Implemented a new KeywordAgent which performs keyword-based filtering before running Atarashi's similarity-based scanners.
  • Created a keyword set that likely appears in licenses to act as early indicators.
  • Used GPT-4o to help generate a broad list of licenses and associated keyword groups.
  • Integrated the agent to mark a license candidate when more than 75% of keywords are found in a file’s content.
  • Forwarded positively matched files to Atarashi’s agents like:
    • TfIdfAgent
    • DamerauLevenshteinDistance
    • WordFrequencySimilarity
    • NgramSimilarity

Problems Identified

  • The keyword list is still static — it must be updated manually as new licenses appear.
  • Since this is rule-based, it cannot generalize well to unseen licenses or variations in text.
  • Identified this phase as exploratory; will now begin transitioning toward an ML-based prefiltering model for robustness and better generalization.

Planning for next week

  • Start prototyping an ML-based model for prefiltering using the Minerva dataset.
  • Setup the Minerva Dataset locally, run the augmentation steps and create the database from scratch.