Skip to main content

Week 10

(August,02,2023)

Attendees:

Updates:

Preprocessing Function Creation

  • I devised a preprocessing function to test different text manipulations:
    • Convert all text to lowercase.
    • Replace (c), (C), and © with COPYRIGHT_SYMBOL.
    • Tokenize text using the word_tokenize function from the NLTK library.
    • Remove punctuation.
    • Exclude stopwords.
    • Lemmatize the text.
    • Experiment with various combinations of the above steps.

Vectorization Methods

  • Results using TF-IDF outperformed those from Bag-of-Words (BoW).
  • While the GloVe embeddings led to a 1-2% improvement, they still lagged behind TF-IDF.
  • FastText yielded a modest performance boost compared to GloVe but remained suboptimal.

Hyperparameter Tuning

  • Despite manually fine-tuning the parameters, I also tried applying GridSearch on the SVM and FastText parameters. Due to the combinatorial explosion in parameter space, it wasn't feasible.

Confidence Thresholding with predict_proba

  • I tested various confidence thresholds (0.999, 0.99, 0.95) and determined that 0.99 was generally the most optimal.

Model Performance Without Threshold

  • Number of misclassifications in class 0: 145 out of 16079 (approx. 0.9% misclassified)
  • Number of misclassifications in class 1: 81 out of 5691 (approx. 1.42% misclassified)

Performance with 0.999 Threshold

  • Number of misclassifications in class 0: 6 out of 16079 (approx. 0.04% misclassified)
  • Number of misclassifications in class 1: 4072 out of 5691 (approx. 71.55% misclassified)

Performance with 0.99 Threshold

  • Number of misclassifications in class 0: 27 out of 16079 (approx. 0.17% misclassified)
  • Number of misclassifications in class 1: 721 out of 5691 (approx. 12.67% misclassified)

Performance with 0.95 Threshol

  • Number of misclassifications in class 0: 41 out of 16079 (approx. 0.25% misclassified)
  • Number of misclassifications in class 1: 387 out of 5691 (approx. 6.8% misclassified)

Choice of Threshold

  • Ultimately, we settled on the 0.99 threshold. By further enhancing model performance, we aim to reduce the error rate to around or below 0.1%, which equates to roughly 1 misclassification per 1000 actual copyrights.

Conclusion and Further Plans:

TF-IDF Performance

  • Focus on amplifying the TF-IDF's effectiveness:
    • Exploration of varying TF-IDF parameters holds promise for potential enhancements.
    • Refinement opportunities exist within the preprocessing function, tailored to our copyright classification objectives.

RNN Model Exploration

  • Intend to assess the performance of an RNN model combined with the improved preprocessing function.

GitHub Repository

  • Transition from using gists to a full-fledged GitHub repository for enhanced documentation.

Language Detection

  • Work on devising a language detection mechanism to address rows in languages other than English, aiming to further optimize classification.