Skip to main content

Week 14

(August,30,2023)

Attendees:

Updates:

1. Revisiting SpaCy NER:

Opted to retest the SpaCy NER for several reasons:
- Earlier attempts lacked proper visualization, making it hard to assess performance on my dataset.
- Training a SpaCy model is simplified with well-documented commands:
  - Dataset Labeling: This is a time-intensive step. I utilized visual annotation tools like doccano.
  - Data Transformation: Converting datasets into a SpaCy-compatible format is straightforward.
- Encountered difficulties while coding for the tiny BERT model training.

2. Insights on SpaCy's NER Model:

SpaCy's NER model is trained on the OntoNotes 5 dataset. This dataset, released in late 2013, features 18 entities in contrast to the four in the conll2003 dataset.

3. SpaCy vs. Tiny BERT:

For a fair comparison, I trained the SpaCy model from scratch on the conll2003 dataset:
- Tiny BERT achieved an F1 score of 0.8177, while SpaCy reached 0.8182 — nearly identical performance.
- NER entity visualization in SpaCy is straightforward via the displacy module.
- Chose SpaCy due to its ease of use, training, visualization, and a smaller model size compared to tiny BERT.

4. Refining Entity Recognition:

Realized that distinguishing between PER and ORG entities was non-essential. My primary goal is identifying copyright holder entities. Decided to merge them for future training.

5. Labeling and Fine-tuning:

Labeled 750 examples from my dataset using doccano.
Fine-tuned the SpaCy model trained on conll2003 with this data.

6. Process Optimization:

Continually working to enhance the process. Will present NER labeled sentences in the coming update.

Conclusion and Future Plans:

1. Enhancing the NER Labeling and Training:

Merge the PER and ORG entities from the conll2003 dataset during training and ignore the other entities as they're not relevant to my goals.
Increase the labeled samples from the copyrights dataset to generate a more extensive dataset for training and refinement.

Attendees:
Updates:
Conclusion and Future Plans:
- 1. Enhancing the NER Labeling and Training: