Skip to main content

Introduction

Author

Rajul Jha

Contact info

Project title

Enhancing Atarashi License Scanner

What's the project about?

Atarashi is a modern, information-retrieval-based license scanner integrated into the FOSSology ecosystem. It utilizes statistical techniques such as TF-IDF, cosine similarity, Damerau-Levenshtein distance, and N-gram distance to identify licenses in source code files. While Atarashi demonstrates promising performance with an accuracy of around 80%, this project aims to significantly improve both the accuracy and robustness of its predictions.

The main objectives of this project include:

  • Adding a keyword-based pre-filtering mechanism to improve match precision and reduce the redundant time spent by the agents scanning.
  • Enhancing the existing classifier with better similarity metrics and model tuning.
  • Incorporating fallback logic to handle ambiguous or low-confidence license predictions.
  • Utilizing the Minerva license dataset to train and evaluate the model more effectively.
  • Ensuring seamless integration of improvements into the existing open pull request #1634.

What should be done?

Integrating a keyword-based pre-filtering model

  • Develop a pre-filtering module that leverages a configurable keyword list.
  • This filter will help reduce candidate licenses for better focus in classification.
  • Document the keyword matching logic and make the keywords configurable.
  • Move towards ML based approach for keyword prefiltering.

Improving the classifier

  • Analyze the current classifier’s performance using Minerva as a benchmark.
  • Explore enhancements to the similarity metrics or switching to more robust statistical models.
  • Retrain and validate the model with improved datasets and parameters.

Fallback mechanism for ambiguous predictions

  • Define thresholds for low-confidence matches.
  • In cases where confidence is below the threshold, add a secondary mechanism such as fuzzy match fallback or keyword-only fallback.
  • Clearly log fallback occurrences for later analysis.

Utilize Minerva dataset for training and evaluation

  • Integrate the Minerva dataset into the Atarashi pipeline for model refinement.
  • Apply data pre-processing and augmentation where necessary.
  • Compare performance with and without Minerva enhancement.

Seamless integration with FOSSology pull request

  • All changes must be backward compatible and align with the architecture in PR #1634.
  • Create a Atarashi wrapper for FOSSology and introduce it as a FOSSology agent.
  • Write tests with good test coverage.