Skip to main content

Week 13

(August,23,2023)

Attendees:

Updates:

1. Exploring Potential NER Models:

  • Explored various NER models suitable for integration into Fossology, focusing on those with a size limit of around 40 megabytes.
  • Narrowed down the selection based on the size constraint.

2. License Compatibility & Model Selection:

  • Many potential models were excluded due to incompatible licenses with Fossology's GNU General Public License v2.0.
  • Identified BERT variants, specifically "tiny BERT" (around 18 megabytes) and "Mobile BERT," as feasible options.
  • Discovered a pretrained tiny BERT model on the conll2003 dataset. However, the model had no associated license.

3. Testing the Tiny BERT Model:

  • Tested the model provisionally, assuming that if it performed well, I could train a similar one from scratch.
  • Model's primary classification targets were organizations and persons.
  • Sample performance indicators:
    • copyright (c) date, date hewlett - packard development company, l. p
    • copyright (c) date - date, date siemens ag
    • copyright (c) date siemens ag author: daniele fognini
    • copyright (c) date siemens ag author: j. najjar
    • copyright (c) date, date siemens ag author: daniele fognini, anupam. ghosh@siemens.com
  • Perceived performance of the tiny BERT model seemed superior to the SpaCy model, though enhanced entity visualization might have influenced this perception.

4. Integration and Preprocessing Considerations:

  • Pondered on how to best integrate the model into my preprocessing function.
  • Experimented with various entity replacement strategies:
    • Replacing person entities with PERSON offered a minor performance boost.
    • Substituting organization entities with ORG slightly degraded performance.
    • Employing both replacements was still suboptimal compared to the initial approach.
  • These results suggest that as NER performance improves, the model will rely more on contextual cues than mere memorization of copyright holder names and organizations.

5. Language Detection Model:

  • Identified a promising language detection model developed by Facebook.

Conclusion and further plans:

1. Training a Custom Tiny BERT Model

  • Initiate training of a custom 'tiny BERT' model from scratch. This is to address potential licensing concerns with existing pre-trained models.
  • Exploration of Modern NER Datasets
  • Train the model on these newly discovered datasets for better performance.

2. Domain-Specific Dataset Training

  • Investigate the feasibility of creating a domain-specific dataset for our project.
  • This would involve labeling a subset of the current copyrights dataset.
  • Fine-tune or train the model on this specialized dataset to enhance its relevance and accuracy for our application.