Week 10
(August,02,2023)
Attendees:
Updates:
Preprocessing Function Creation
- I devised a preprocessing function to test different text manipulations:
- Convert all text to lowercase.
- Replace
(c)
,(C)
, and©
withCOPYRIGHT_SYMBOL
. - Tokenize text using the
word_tokenize
function from the NLTK library. - Remove punctuation.
- Exclude stopwords.
- Lemmatize the text.
- Experiment with various combinations of the above steps.
Vectorization Methods
- Results using TF-IDF outperformed those from Bag-of-Words (BoW).
- While the GloVe embeddings led to a 1-2% improvement, they still lagged behind TF-IDF.
- FastText yielded a modest performance boost compared to GloVe but remained suboptimal.
Hyperparameter Tuning
- Despite manually fine-tuning the parameters, I also tried applying GridSearch on the SVM and FastText parameters. Due to the combinatorial explosion in parameter space, it wasn't feasible.
Confidence Thresholding with predict_proba
- I tested various confidence thresholds (0.999, 0.99, 0.95) and determined that 0.99 was generally the most optimal.
Model Performance Without Threshold
- Number of misclassifications in class 0: 145 out of 16079 (approx. 0.9% misclassified)
- Number of misclassifications in class 1: 81 out of 5691 (approx. 1.42% misclassified)
Performance with 0.999 Threshold
- Number of misclassifications in class 0: 6 out of 16079 (approx. 0.04% misclassified)
- Number of misclassifications in class 1: 4072 out of 5691 (approx. 71.55% misclassified)
Performance with 0.99 Threshold
- Number of misclassifications in class 0: 27 out of 16079 (approx. 0.17% misclassified)
- Number of misclassifications in class 1: 721 out of 5691 (approx. 12.67% misclassified)
Performance with 0.95 Threshol
- Number of misclassifications in class 0: 41 out of 16079 (approx. 0.25% misclassified)
- Number of misclassifications in class 1: 387 out of 5691 (approx. 6.8% misclassified)
Choice of Threshold
- Ultimately, we settled on the 0.99 threshold. By further enhancing model performance, we aim to reduce the error rate to around or below 0.1%, which equates to roughly 1 misclassification per 1000 actual copyrights.
Conclusion and Further Plans:
TF-IDF Performance
- Focus on amplifying the TF-IDF's effectiveness:
- Exploration of varying TF-IDF parameters holds promise for potential enhancements.
- Refinement opportunities exist within the preprocessing function, tailored to our copyright classification objectives.
RNN Model Exploration
- Intend to assess the performance of an RNN model combined with the improved preprocessing function.
GitHub Repository
- Transition from using gists to a full-fledged GitHub repository for enhanced documentation.
Language Detection
- Work on devising a language detection mechanism to address rows in languages other than English, aiming to further optimize classification.