Introduction
Author
Contact info
Project title
Enhancing Nirjas & Atarashi for Accurate, Scalable License Intelligence
What's the project about?
This project improves FOSSology's license intelligence pipeline in two connected parts:
-
Nirjas revamp
- Move from regex-heavy extraction to Tree-sitter based parsing for robust, language-aware comment extraction.
- Build a scalable multi-language extraction flow using language packs and compatibility checks.
-
Atarashi upgrade
- Move from broad full-file matching to a signal-driven retrieval + classification approach.
- Support confidence-aware predictions and a safer
UNKNOWN/ abstain behavior for weak evidence.
The data and model work is backed by an upgraded Minerva-style dataset pipeline, with balanced task-specific datasets for both Nirjas and Atarashi.
What should be done?
1) Tree-sitter based Nirjas extraction pipeline
- Generalize the existing Tree-sitter proof-of-concept into a reusable extraction architecture.
- Build language-pack management and parser compatibility validation.
- Maintain a normalized extraction contract for downstream consumers.
- Add fallback behavior for unsupported/incompatible parsers.
2) Dataset generation for Nirjas and Atarashi
- Use a unified data pipeline over multiple sources (e.g. license corpora + non-license comments).
- Build train/validation/test splits for:
- Nirjas (
license_commentvsnon_license_comment) - Atarashi (
license_idmulticlass)
- Nirjas (
- Include hard negatives, augmentation, and near-dedup for better generalization.
3) Nirjas classifier
- Select strong teacher embeddings using benchmark context + task evaluations.
- Distill to lightweight
model2vecrepresentations. - Train a binary classifier optimized for high-throughput CPU inference.
4) Atarashi model path selection
- Implement and compare two tracks:
- Distilled
model2vecretrieval/classification path. - Fine-tuned + quantized embedding baseline for CPU-first deployment.
- Distilled
- Select production path using quality/latency trade-offs on held-out data.
5) Confidence-aware output and integration
- Add confidence scoring and
UNKNOWNbehavior in low-evidence cases. - Integrate Nirjas → Atarashi flow end-to-end.
- Add tests, documentation, and reproducible evaluation steps.
Expected outcomes
- More accurate and robust extraction of license-relevant comments from real-world source files.
- Faster and more reliable license prediction on high-signal fragments.
- Better scalability and maintainability for FOSSology's license detection workflows.
- Clearer model outputs for compliance workflows with confidence and abstention support.