Introduction
Author
Contact info
Project title
Enhancing Atarashi License Scanner
What's the project about?
Atarashi is a modern, information-retrieval-based license scanner integrated into the FOSSology ecosystem. It utilizes statistical techniques such as TF-IDF, cosine similarity, Damerau-Levenshtein distance, and N-gram distance to identify licenses in source code files. While Atarashi demonstrates promising performance with an accuracy of around 80%, this project aims to significantly improve both the accuracy and robustness of its predictions.
The main objectives of this project include:
- Adding a keyword-based pre-filtering mechanism to improve match precision and reduce the redundant time spent by the agents scanning.
- Enhancing the existing classifier with better similarity metrics and model tuning.
- Incorporating fallback logic to handle ambiguous or low-confidence license predictions.
- Utilizing the Minerva license dataset to train and evaluate the model more effectively.
- Ensuring seamless integration of improvements into the existing open pull request #1634.
What should be done?
Integrating a keyword-based pre-filtering model
- Develop a pre-filtering module that leverages a configurable keyword list.
- This filter will help reduce candidate licenses for better focus in classification.
- Document the keyword matching logic and make the keywords configurable.
- Move towards ML based approach for keyword prefiltering.
Improving the classifier
- Analyze the current classifier’s performance using Minerva as a benchmark.
- Explore enhancements to the similarity metrics or switching to more robust statistical models.
- Retrain and validate the model with improved datasets and parameters.
Fallback mechanism for ambiguous predictions
- Define thresholds for low-confidence matches.
- In cases where confidence is below the threshold, add a secondary mechanism such as fuzzy match fallback or keyword-only fallback.
- Clearly log fallback occurrences for later analysis.
Utilize Minerva dataset for training and evaluation
- Integrate the Minerva dataset into the Atarashi pipeline for model refinement.
- Apply data pre-processing and augmentation where necessary.
- Compare performance with and without Minerva enhancement.
Seamless integration with FOSSology pull request
- All changes must be backward compatible and align with the architecture in PR #1634.
- Create a Atarashi wrapper for FOSSology and introduce it as a FOSSology agent.
- Write tests with good test coverage.