Skip to main content

Week 10

(August,02,2023)

Attendees:

Updates:

Preprocessing Function Creation

I devised a preprocessing function to test different text manipulations:
- Convert all text to lowercase.
- Replace (c), (C), and © with COPYRIGHT_SYMBOL.
- Tokenize text using the word_tokenize function from the NLTK library.
- Remove punctuation.
- Exclude stopwords.
- Lemmatize the text.
- Experiment with various combinations of the above steps.

Vectorization Methods

Results using TF-IDF outperformed those from Bag-of-Words (BoW).
While the GloVe embeddings led to a 1-2% improvement, they still lagged behind TF-IDF.
FastText yielded a modest performance boost compared to GloVe but remained suboptimal.

Hyperparameter Tuning

Despite manually fine-tuning the parameters, I also tried applying GridSearch on the SVM and FastText parameters. Due to the combinatorial explosion in parameter space, it wasn't feasible.

Confidence Thresholding with `predict_proba`

I tested various confidence thresholds (0.999, 0.99, 0.95) and determined that 0.99 was generally the most optimal.

Model Performance Without Threshold

Number of misclassifications in class 0: 145 out of 16079 (approx. 0.9% misclassified)
Number of misclassifications in class 1: 81 out of 5691 (approx. 1.42% misclassified)

Performance with 0.999 Threshold

Number of misclassifications in class 0: 6 out of 16079 (approx. 0.04% misclassified)
Number of misclassifications in class 1: 4072 out of 5691 (approx. 71.55% misclassified)

Performance with 0.99 Threshold

Number of misclassifications in class 0: 27 out of 16079 (approx. 0.17% misclassified)
Number of misclassifications in class 1: 721 out of 5691 (approx. 12.67% misclassified)

Performance with 0.95 Threshol

Number of misclassifications in class 0: 41 out of 16079 (approx. 0.25% misclassified)
Number of misclassifications in class 1: 387 out of 5691 (approx. 6.8% misclassified)

Choice of Threshold

Ultimately, we settled on the 0.99 threshold. By further enhancing model performance, we aim to reduce the error rate to around or below 0.1%, which equates to roughly 1 misclassification per 1000 actual copyrights.

Conclusion and Further Plans:

TF-IDF Performance

Focus on amplifying the TF-IDF's effectiveness:
- Exploration of varying TF-IDF parameters holds promise for potential enhancements.
- Refinement opportunities exist within the preprocessing function, tailored to our copyright classification objectives.

RNN Model Exploration

Intend to assess the performance of an RNN model combined with the improved preprocessing function.

GitHub Repository

Transition from using gists to a full-fledged GitHub repository for enhanced documentation.

Language Detection

Work on devising a language detection mechanism to address rows in languages other than English, aiming to further optimize classification.

Attendees:
Updates:
Conclusion and Further Plans: