Skip to main content

WEEK 5

(June 27, 2024)

Attendees:

Discussion:

  • Checked use cases to evaluate the current model's preprocessing output and discussed findings. Checking of corner cases can be found here and my findings can be found documented in this word file.

  • I researched and explored various NER taggers and began creating a library for NER-POS tagging using Stanford's NER tagger. In parallel, I initiated work on developing a dedicated library for this task and investigated multiple models:

    • SpaCy
    • NLTK
    • Flair
    • Stanford’s CoreNLP
    • AllenNLP
    • Apache OpenNLP

I tested Stanford’s CoreNLP with some random data from the internet to evaluate its model. You can find my code for experimentation here.

  • I wrote a Python script using Psycopg to connect to the Fossology database and retrieve copyright contents. During this process, I encountered and addressed several issues. You can find the script here.

Subsequent Steps

  • The mentors emphasized the critical importance of completing the framework before progressing further. Therefore, my plan for the next week is to focus on getting the framework ready.
  • In the NER tagging task, I plan to focus on fine-tuning BERT to perform both NER and POS tagging tasks. Multitask learners, like fine-tuned language models, generally achieve better performance on downstream tasks.