Week 12

(August,16,2023)

Attendees:

Started the week by testing the performance of different embedding methods in conjunction with my new preprocessing function.
Using GloVe, achieved an accuracy with around 1.24% misclassified copyrights and 1.95% misclassified false positives.
Despite variations in preprocessing parameters, GloVe's performance lagged considerably behind the best model I've developed using TF-IDF — almost a tenfold difference.

Conducted an analysis to determine the proportion of words in the datasets recognized by GloVe:
- Embeddings found for 60.68% of vocab
- Embeddings found for 91.12% of all text
Given that copyrights predominantly contain elements like names, dates, and organizations, the subpar performance of GloVe — not specifically trained on this data — in comparison to TF-IDF became clearer.

Experimental trials with FastText embeddings did not lead to significant performance improvements, even with different preprocessing.

Current best performance indicates 0.16% misclassifications for copyrights and 0.48% for false positive misclassifications.
These numbers can be reduced further to 0.04% and 3.17%, respectively, by applying a stricter confidence threshold of 0.99.

Initiated testing of Named Entity Recognition (NER) models to potentially replace the copyright holder entity.
Due to recurring mentions of numerous copyright holders across different files and dataset rows, there's a concern about the model's generalization capability. The idea is to use NER to replace these mentions with generic tags for persons and organizations.

Conducted initial tests with the compact spaCy English model due to space limitations.
Preliminary results were not very promising:
- ] ] copyrightsymbol ] date [siemens (ORG) ag
- ] ] copyrightsymbol ] date [siemens (ORG) ag ] author [gaurav (PERSON) mishra ] email
- ] copyright ] copyrightsymbol ] date ] date [free (ORG) software foundation inc franklin street [fifth (ORDINAL) ] floor [boston (ORG) ma date date ] usa
The model could recognize some entities, but significant refinement is needed to improve its reliability in detecting PERSON and ORG entities.

Plan to explore other pretrained NER models that might be suitable for the task at hand.