Introduction

Author

Contact info

Project title

Data Pipelining For Safaa

What's the project about?

Currently, Safaa provides a strong framework designed to deal with copyright notices particularly focusing on the identification and reduction of false positives, as well as streamlining the decluttering procedure to remove unnecessary content. Key features of Safaa include:

Model Flexibility
Integration with scikit-learn
spaCy Integration
Preprocessing Tools

However, Currently in the Safaa Project, data is manually curated And we see that most of the things are manual here. This project will concentrate on creating a pipeline, Utilizing LLMs if required to increase the accuracy, or use deep learning techniques to improve.

Writing scripts to copy copyright data automatically(group's data or some users data) from fossology instance to train the model.

What should be done?

Here are the key tasks planned for the project:

Create Scripts to fetch the copyright data from FOSSology Server copyright table (localhost)
Clean and preprocess fetched copyright data (utilize prewritten processing functions)
- Preprocess data should have label and clean text.
Split data for training/validation/test.
Train false/positive model as well as declutter model (utilize prewritten train functions)
Model evaluation (check for precision, recall etc..)
Model versioning and release.
Should work for both Gitlab and Github.
- Manual trigger.
- Should also have a functionality to work as cron job.

Author​

Contact info​

Project title​

What's the project about?​

What should be done?​

Author

Contact info

Project title

What's the project about?

What should be done?