Introduction
Author
Contact info
Project title
Data Pipelining For Safaa
What's the project about?
Currently, Safaa provides a strong framework designed to deal with copyright notices particularly focusing on the identification and reduction of false positives, as well as streamlining the decluttering procedure to remove unnecessary content. Key features of Safaa include:
- Model Flexibility
- Integration with scikit-learn
- spaCy Integration
- Preprocessing Tools
However, Currently in the Safaa Project, data is manually curated And we see that most of the things are manual here. This project will concentrate on creating a pipeline, Utilizing LLMs if required to increase the accuracy, or use deep learning techniques to improve.
Writing scripts to copy copyright data automatically(group's data or some users data) from fossology instance to train the model.
What should be done?
Here are the key tasks planned for the project:
- Create Scripts to fetch the copyright data from FOSSology Server copyright table (localhost)
- Clean and preprocess fetched copyright data (utilize prewritten processing functions)
- Preprocess data should have label and clean text.
- Split data for training/validation/test.
- Train false/positive model as well as declutter model (utilize prewritten train functions)
- Model evaluation (check for precision, recall etc..)
- Model versioning and release.
- Should work for both Gitlab and Github.
- Manual trigger.
- Should also have a functionality to work as cron job.