Week 16

(September,13,2023)

Attendees:

Initiated code to clean the conll2003 dataset as mentioned in week 14:
- Merged PER and ORG entities.
- Discarded LOC and MISC entities since they are not pertinent to my requirements.

Conducted another round of fine-tuning using 750 examples from my dataset and assessed the NER model's performance within my preprocessing function.
- Noticed a slight dip in performance due to obfuscation of repetitive copyright holder names in the dataset.
Labeled an additional 750 examples, totaling slightly over 1500, and fine-tuned the primary model with this data.
- The model, while proficient, occasionally mislabeled non-copyright sentences as ENT (the copyright holder entity), potentially increasing false positives.
- Below are some detection results using the dataset from the feature extraction paper to test on unseen examples (detected entities are highlighted):
  1. Copyright (C) 2017 DENX Software Engineering
  2. Copyright (C) IBM Corporation 2016
  3. Copyright (c) 2000-2005 Vojtech Pavlik <vojtech@suse.cz>
  4. Copyright (c) 2009, Microsoft Corporation.
  5. Copyright (C) ST-Ericsson 2010 - 2013 (Entity missed)
  6. Copyright (c) 2012 Steffen Trumtrar <s.trumtrar@pengutronix.de>, Pengutronix
  7. Copyright 2008 GE Intelligent Platforms Embedded Systems, Inc.
- The model detected the majority of entities, missing less than 5%.
- Adopted semi-supervised training by using the preceding model to label the entire dataset and trained on it. This refined model, now in use, missed under 1% of the copyright holder entities in the same test set.

Initiate the decluttering procedure, which will bear similarities to the copyright holder entity detection process.