WEEK 2
(June 6, 2024)
Attendees:
Discussion:
- I presented a detailed plan via PowerPoint outlining my approach to the project. This included an explanation of Safaa's current functionality and its key features, along with identified areas for enhancing pipeline efficiency.
- I initiated a discussion comparing TF-IDF with BERT/GPT embeddings. Given our relatively small dataset (~20k), I proposed transitioning to a transformer model if we scale up our dataset size.
- Additionally, I recommended replacing the current SVM model with either a transformer model or LSTM for improved performance.
- I also suggested initiating parallel work on developing a Python library for the NER-POS tagging task.
- To guide me in formulating a structured approach to the pipeline steps, Kaushal shared his perspective by presenting his envisioned pipeline for the project. His insights proved invaluable in identifying and comprehending critical aspects of the pipeline's development and implementation.
Engagements
Further discussions covered aspects of model development, validation, testing, monitoring, and ongoing maintenance. I participated in the general meeting, providing comprehensive updates on project progress.
Subsequent Steps
- I was assigned the task of writing a script to connect to the Fossology database and retrieve copyright contents. Initially, I was instructed to test the script on my locally hosted Fossology instance using dummy data. Once validated, I will proceed to use the script to fetch actual data.
- Additionally, I was tasked with creating a MVP to substantiate my proposal for using embedding models and transformers instead of TF-IDF and SVM. This prototype aims to verify their effectiveness and potential for significant improvements.