Week 13
(August,23,2023)
Attendees:
Updates:
1. Exploring Potential NER Models:
- Explored various NER models suitable for integration into Fossology, focusing on those with a size limit of around 40 megabytes.
- Narrowed down the selection based on the size constraint.
2. License Compatibility & Model Selection:
- Many potential models were excluded due to incompatible licenses with Fossology's
GNU General Public License v2.0
. - Identified BERT variants, specifically "tiny BERT" (around 18 megabytes) and "Mobile BERT," as feasible options.
- Discovered a pretrained tiny BERT model on the
conll2003
dataset. However, the model had no associated license.
3. Testing the Tiny BERT Model:
- Tested the model provisionally, assuming that if it performed well, I could train a similar one from scratch.
- Model's primary classification targets were organizations and persons.
- Sample performance indicators:
- copyright (c) date,
date hewlett - packard development company, l. p
- copyright (c) date - date,
date siemens ag
- copyright (c) date siemens ag author:
daniele fognini
- copyright (c) date siemens ag author:
j. najjar
- copyright (c) date, date siemens ag author:
daniele fognini
,anupam
.ghosh
@siemens
.com
- copyright (c) date,
- Perceived performance of the tiny BERT model seemed superior to the SpaCy model, though enhanced entity visualization might have influenced this perception.
4. Integration and Preprocessing Considerations:
- Pondered on how to best integrate the model into my preprocessing function.
- Experimented with various entity replacement strategies:
- Replacing person entities with
PERSON
offered a minor performance boost. - Substituting organization entities with
ORG
slightly degraded performance. - Employing both replacements was still suboptimal compared to the initial approach.
- Replacing person entities with
- These results suggest that as NER performance improves, the model will rely more on contextual cues than mere memorization of copyright holder names and organizations.
5. Language Detection Model:
- Identified a promising language detection model developed by Facebook.
Conclusion and further plans:
1. Training a Custom Tiny BERT Model
- Initiate training of a custom 'tiny BERT' model from scratch. This is to address potential licensing concerns with existing pre-trained models.
- Exploration of Modern NER Datasets
- Train the model on these newly discovered datasets for better performance.
2. Domain-Specific Dataset Training
- Investigate the feasibility of creating a domain-specific dataset for our project.
- This would involve labeling a subset of the current copyrights dataset.
- Fine-tune or train the model on this specialized dataset to enhance its relevance and accuracy for our application.