Skip to main content

Week 12

(August,16,2023)

Attendees:

Updates:

1. Embedding Methods Testing:

  • Started the week by testing the performance of different embedding methods in conjunction with my new preprocessing function.
  • Using GloVe, achieved an accuracy with around 1.24% misclassified copyrights and 1.95% misclassified false positives.
  • Despite variations in preprocessing parameters, GloVe's performance lagged considerably behind the best model I've developed using TF-IDF — almost a tenfold difference.

2. GloVe Embedding Analysis:

  • Conducted an analysis to determine the proportion of words in the datasets recognized by GloVe:
    • Embeddings found for 60.68% of vocab
    • Embeddings found for 91.12% of all text
  • Given that copyrights predominantly contain elements like names, dates, and organizations, the subpar performance of GloVe — not specifically trained on this data — in comparison to TF-IDF became clearer.

3. FastText Experiments:

  • Experimental trials with FastText embeddings did not lead to significant performance improvements, even with different preprocessing.

4. Performance Benchmarks:

  • Current best performance indicates 0.16% misclassifications for copyrights and 0.48% for false positive misclassifications.
  • These numbers can be reduced further to 0.04% and 3.17%, respectively, by applying a stricter confidence threshold of 0.99.

5. Exploratory Testing of NER Models:

  • Initiated testing of Named Entity Recognition (NER) models to potentially replace the copyright holder entity.
  • Due to recurring mentions of numerous copyright holders across different files and dataset rows, there's a concern about the model's generalization capability. The idea is to use NER to replace these mentions with generic tags for persons and organizations.

6. Trials with Compact spaCy Model:

  • Conducted initial tests with the compact spaCy English model due to space limitations.
  • Preliminary results were not very promising:
    • ] ] copyrightsymbol ] date [siemens (ORG) ag
    • ] ] copyrightsymbol ] date [siemens (ORG) ag ] author [gaurav (PERSON) mishra ] email
    • ] copyright ] copyrightsymbol ] date ] date [free (ORG) software foundation inc franklin street [fifth (ORDINAL) ] floor [boston (ORG) ma date date ] usa
  • The model could recognize some entities, but significant refinement is needed to improve its reliability in detecting PERSON and ORG entities.

Conclusion and Future Plans:

NER Model Exploration

  • Plan to explore other pretrained NER models that might be suitable for the task at hand.