Skip to main content

Week 16

(September,13,2023)

Attendees:

Updates:

1. Dataset Cleanup:

  • Initiated code to clean the conll2003 dataset as mentioned in week 14:
    • Merged PER and ORG entities.
    • Discarded LOC and MISC entities since they are not pertinent to my requirements.

2. Fine-tuning and Testing:

  • Conducted another round of fine-tuning using 750 examples from my dataset and assessed the NER model's performance within my preprocessing function.
    • Noticed a slight dip in performance due to obfuscation of repetitive copyright holder names in the dataset.
  • Labeled an additional 750 examples, totaling slightly over 1500, and fine-tuned the primary model with this data.
    • The model, while proficient, occasionally mislabeled non-copyright sentences as ENT (the copyright holder entity), potentially increasing false positives.
    • Below are some detection results using the dataset from the feature extraction paper to test on unseen examples (detected entities are highlighted):
      1. Copyright (C) 2017 DENX Software Engineering
      2. Copyright (C) IBM Corporation 2016
      3. Copyright (c) 2000-2005 Vojtech Pavlik <vojtech@suse.cz>
      4. Copyright (c) 2009, Microsoft Corporation.
      5. Copyright (C) ST-Ericsson 2010 - 2013 (Entity missed)
      6. Copyright (c) 2012 Steffen Trumtrar <s.trumtrar@pengutronix.de>, Pengutronix
      7. Copyright 2008 GE Intelligent Platforms Embedded Systems, Inc.
    • The model detected the majority of entities, missing less than 5%.
    • Adopted semi-supervised training by using the preceding model to label the entire dataset and trained on it. This refined model, now in use, missed under 1% of the copyright holder entities in the same test set.

Conclusion and Future Plans:

1. Fossology Integration:

  • Aim to integrate the false positive copyright detection code into Fossology.

2. Decluttering Process:

  • Initiate the decluttering procedure, which will bear similarities to the copyright holder entity detection process.