Week 14
(August,30,2023)
Attendees:
Updates:
1. Revisiting SpaCy NER:
- Opted to retest the SpaCy NER for several reasons:
- Earlier attempts lacked proper visualization, making it hard to assess performance on my dataset.
- Training a SpaCy model is simplified with well-documented commands:
- Dataset Labeling: This is a time-intensive step. I utilized visual annotation tools like
doccano
. - Data Transformation: Converting datasets into a SpaCy-compatible format is straightforward.
- Dataset Labeling: This is a time-intensive step. I utilized visual annotation tools like
- Encountered difficulties while coding for the tiny BERT model training.
2. Insights on SpaCy's NER Model:
- SpaCy's NER model is trained on the OntoNotes 5 dataset. This dataset, released in late 2013, features 18 entities in contrast to the four in the conll2003 dataset.
3. SpaCy vs. Tiny BERT:
- For a fair comparison, I trained the SpaCy model from scratch on the conll2003 dataset:
- Tiny BERT achieved an F1 score of 0.8177, while SpaCy reached 0.8182 — nearly identical performance.
- NER entity visualization in SpaCy is straightforward via the
displacy
module. - Chose SpaCy due to its ease of use, training, visualization, and a smaller model size compared to tiny BERT.
4. Refining Entity Recognition:
- Realized that distinguishing between PER and ORG entities was non-essential. My primary goal is identifying copyright holder entities. Decided to merge them for future training.
5. Labeling and Fine-tuning:
- Labeled 750 examples from my dataset using
doccano
. - Fine-tuned the SpaCy model trained on conll2003 with this data.
6. Process Optimization:
- Continually working to enhance the process. Will present NER labeled sentences in the coming update.
Conclusion and Future Plans:
1. Enhancing the NER Labeling and Training:
- Merge the PER and ORG entities from the conll2003 dataset during training and ignore the other entities as they're not relevant to my goals.
- Increase the labeled samples from the copyrights dataset to generate a more extensive dataset for training and refinement.