Week 16
(September,13,2023)
Attendees:
Updates:
1. Dataset Cleanup:
- Initiated code to clean the conll2003 dataset as mentioned in week 14:
- Merged
PER
andORG
entities. - Discarded
LOC
andMISC
entities since they are not pertinent to my requirements.
- Merged
2. Fine-tuning and Testing:
- Conducted another round of fine-tuning using 750 examples from my dataset and assessed the NER model's performance within my preprocessing function.
- Noticed a slight dip in performance due to obfuscation of repetitive copyright holder names in the dataset.
- Labeled an additional 750 examples, totaling slightly over 1500, and fine-tuned the primary model with this data.
- The model, while proficient, occasionally mislabeled non-copyright sentences as
ENT
(the copyright holder entity), potentially increasing false positives. - Below are some detection results using the dataset from the feature extraction paper to test on unseen examples (detected entities are highlighted):
- Copyright (C) 2017
DENX Software Engineering
- Copyright (C)
IBM Corporation
2016 - Copyright (c) 2000-2005
Vojtech Pavlik
<vojtech@suse.cz> - Copyright (c) 2009,
Microsoft Corporation
. - Copyright (C) ST-Ericsson 2010 - 2013 (Entity missed)
- Copyright (c) 2012
Steffen Trumtrar
<s.trumtrar@pengutronix.de>,Pengutronix
- Copyright 2008
GE Intelligent Platforms Embedded Systems
, Inc.
- Copyright (C) 2017
- The model detected the majority of entities, missing less than 5%.
- Adopted semi-supervised training by using the preceding model to label the entire dataset and trained on it. This refined model, now in use, missed under 1% of the copyright holder entities in the same test set.
- The model, while proficient, occasionally mislabeled non-copyright sentences as
Conclusion and Future Plans:
1. Fossology Integration:
- Aim to integrate the false positive copyright detection code into Fossology.
2. Decluttering Process:
- Initiate the decluttering procedure, which will bear similarities to the copyright holder entity detection process.