Skip to main content

WEEK 6

(July 4, 2024)

Attendees:

Discussion:

  1. Completed the Python script for fetching copyright contents from the database, incorporating Gaurav's recommendation to also retrieve user-modified contents. The updated script now collects copyrights and stores them in a CSV file with four columns:
original_contentoriginal_is_enablededited_contentmodified_is_enabled

You can find the updated script here.

  1. Automated the model training process with the idea that at a threshold of 500 new entries in the database, the Safaa model should be retrained. I explored GitHub Actions and wrote a YAML script to check the number of new entries and trigger the model retraining script if the threshold is met. However, due to connection issues between GitHub Actions and the locally hosted database, I consulted the mentors. They suggested making a connection for retraining when a new copyright file is uploaded to the repository. This task will be continued in the coming week, and updates will be provided in the following meeting.

  2. Explored incremental learning in Safaa. Currently, Safaa uses Scikit-learn's SVM implementation, which retrains from scratch. Since SVM is incapable of incremental learning, I switched to the SGD Classifier model from Scikit-learn, which supports incremental learning. I calculated its metric reports and found that its results are similar to those of the SVM. As per the mentors' suggestions, I will create a PR showing the results from both SVM and SGD. You can find my implementation for SVM here, for SGD here, and the comparison between them can be found below. The dataset used for implementation can be found here.

The results are as follows:

SGD Classifier:

precisionrecallf1-scoresupport
00.990.990.992878
10.960.980.971016

SVM Classifier:

precisionrecallf1-scoresupport
00.990.990.992878
10.960.970.971016
  • Started working on creating a Python library for the NER-POS tagging task. I experimented with the Stanford NER Tagger. You can find my work here. However, I plan on exploring the fine-tuning of BERT or GPT for this task in the coming weeks.

Subsequent Steps

  • Address the issue with GitHub Actions by establishing a connection for retraining when a new copyright file is uploaded to the repository. Do all the implementations locally and then create the final yaml file to try out GitHub Actions.
  • Explore and implement the fine-tuning of BERT or GPT for the NER-POS tagging task.