Week 12
(August,16,2023)
Attendees:
Updates:
1. Embedding Methods Testing:
- Started the week by testing the performance of different embedding methods in conjunction with my new preprocessing function.
- Using GloVe, achieved an accuracy with around 1.24% misclassified copyrights and 1.95% misclassified false positives.
- Despite variations in preprocessing parameters, GloVe's performance lagged considerably behind the best model I've developed using TF-IDF — almost a tenfold difference.
2. GloVe Embedding Analysis:
- Conducted an analysis to determine the proportion of words in the datasets recognized by GloVe:
Embeddings found for 60.68% of vocab
Embeddings found for 91.12% of all text
- Given that copyrights predominantly contain elements like names, dates, and organizations, the subpar performance of GloVe — not specifically trained on this data — in comparison to TF-IDF became clearer.
3. FastText Experiments:
- Experimental trials with FastText embeddings did not lead to significant performance improvements, even with different preprocessing.
4. Performance Benchmarks:
- Current best performance indicates 0.16% misclassifications for copyrights and 0.48% for false positive misclassifications.
- These numbers can be reduced further to 0.04% and 3.17%, respectively, by applying a stricter confidence threshold of 0.99.
5. Exploratory Testing of NER Models:
- Initiated testing of Named Entity Recognition (NER) models to potentially replace the copyright holder entity.
- Due to recurring mentions of numerous copyright holders across different files and dataset rows, there's a concern about the model's generalization capability. The idea is to use NER to replace these mentions with generic tags for persons and organizations.
6. Trials with Compact spaCy Model:
- Conducted initial tests with the compact spaCy English model due to space limitations.
- Preliminary results were not very promising:
] ] copyrightsymbol ] date [siemens (ORG) ag
] ] copyrightsymbol ] date [siemens (ORG) ag ] author [gaurav (PERSON) mishra ] email
] copyright ] copyrightsymbol ] date ] date [free (ORG) software foundation inc franklin street [fifth (ORDINAL) ] floor [boston (ORG) ma date date ] usa
- The model could recognize some entities, but significant refinement is needed to improve its reliability in detecting PERSON and ORG entities.
Conclusion and Future Plans:
NER Model Exploration
- Plan to explore other pretrained NER models that might be suitable for the task at hand.