Skip to main content

Week 14

(August,30,2023)

Attendees:

Updates:

1. Revisiting SpaCy NER:

  • Opted to retest the SpaCy NER for several reasons:
    • Earlier attempts lacked proper visualization, making it hard to assess performance on my dataset.
    • Training a SpaCy model is simplified with well-documented commands:
      • Dataset Labeling: This is a time-intensive step. I utilized visual annotation tools like doccano.
      • Data Transformation: Converting datasets into a SpaCy-compatible format is straightforward.
    • Encountered difficulties while coding for the tiny BERT model training.

2. Insights on SpaCy's NER Model:

  • SpaCy's NER model is trained on the OntoNotes 5 dataset. This dataset, released in late 2013, features 18 entities in contrast to the four in the conll2003 dataset.

3. SpaCy vs. Tiny BERT:

  • For a fair comparison, I trained the SpaCy model from scratch on the conll2003 dataset:
    • Tiny BERT achieved an F1 score of 0.8177, while SpaCy reached 0.8182 — nearly identical performance.
    • NER entity visualization in SpaCy is straightforward via the displacy module.
    • Chose SpaCy due to its ease of use, training, visualization, and a smaller model size compared to tiny BERT.

4. Refining Entity Recognition:

  • Realized that distinguishing between PER and ORG entities was non-essential. My primary goal is identifying copyright holder entities. Decided to merge them for future training.

5. Labeling and Fine-tuning:

  • Labeled 750 examples from my dataset using doccano.
  • Fine-tuned the SpaCy model trained on conll2003 with this data.

6. Process Optimization:

  • Continually working to enhance the process. Will present NER labeled sentences in the coming update.

Conclusion and Future Plans:

1. Enhancing the NER Labeling and Training:

  • Merge the PER and ORG entities from the conll2003 dataset during training and ignore the other entities as they're not relevant to my goals.
  • Increase the labeled samples from the copyrights dataset to generate a more extensive dataset for training and refinement.