Skip to main content

Week 11

(August,09,2023)

Attendees:

Updates:

Datasets & Findings:

  • Dataset Corrections: This week commenced with a detailed inspection of datasets which led to the rectification of various errors. The corrected datasets and predictions from the current model have been updated in this spreadsheet.

  • Inconsistencies Addressed: I found that the treatment of separate language rows varied across datasets. To maintain consistency, all such records have been treated as copyrights, requiring manual intervention later.

  • Annotating Mistakes: Through model predictions, I detected errors in dataset annotations. These errors have been fixed and the updates can be found in the aforementioned spreadsheet.

  • Dataset Merging: Given the presence of different languages across datasets, I decided to consolidate all datasets for training, setting aside 20% for testing. The new dataset comprises:

    • Class 0 (copyrights): 75.22% (16377 rows)
    • Class 1: 24.77% (5393 rows)
    • Total rows: 21770
  • Additional Dataset: Gaurav has provided an additional dataset comprising 26188 unique rows. I've yet to label this dataset.

Model Performance:

  • TF-IDF Vectorizer: The model achieved significant results using the TF-IDF vectorizer, without additional preprocessing:

    • Class 0 misclassifications: 0.32% (52 out of 16377)
    • Class 1 misclassifications: 0.61% (33 out of 5393)
  • Preprocessing Enhancements: I devised a preprocessing function which improved the model's performance. These enhancements include replacing digits, copyright symbols, emails, and more. This approach reduced the misclassifications:

    • Class 0: 0.26% (43 out of 16377)
    • Class 1: 0.82% (44 out of 5393)
  • TF-IDF Parameter Tweaking: Further fine-tuning of TF-IDF parameters allowed the model to achieve:

    • Class 0 misclassifications: 0.16% (27 out of 16377)
    • Class 1 misclassifications: 0.54% (29 out of 5393)
  • Thresholding at 0.99: Applying a threshold of 0.99 rendered impressive results:

    • Class 0 misclassifications: 0.03% (5 out of 16377)
    • Class 1 misclassifications: 4.6% (248 out of 5393)

External Datasets Testing:

  • Fossology-provided-2 dataset: Initial results on this dataset indicated:

    • Class 0 misclassifications: 0.46% (27 out of 5808)
    • However, after manual inspection, only 12 were genuine misclassifications.
  • Dataset from Paper: I tested the model on the dataset from this paper. The results were:

    • Class 0 misclassifications: 0.09% (2 out of 2146)
    • Class 1 misclassifications: 1.32% (2 out of 151)
    • Notably, the two misclassifications in class 1 were found to be correctly predicted by our model.

Feature Extraction & LDA:

  • Feature Extraction from Paper: Implementing the paper's feature extraction method yielded the following results:

    • Class 0 misclassifications: 2.91% (477 out of 16377)
    • Class 1 misclassifications: 6.93% (374 out of 5393)
  • LDA Analysis: Leveraging LDA, I identified the 20 most frequent words in each class, offering insights for potential feature extraction enhancements.

Language Detection:

  • cld3 Limitation: Although cld3 proved efficient, its Apache License 2.0 is incompatible with Fossology's GNU General Public License v2.0.

  • spaCy's Model: Despite utilizing spaCy's language detection model, many English rows were misclassified as non-English and vice versa.

GitHub Repository:

Conclusion & Future Plans:

Language Detection

  • Investigate more efficient language detection methods.

Preprocessing Improvements

  • Enhance preprocessing by using NER for name and organization replacements.

Feature Extraction

  • Delve deeper into feature extraction techniques.

Documentation

  • Cleanup my documentation
  • Cleanup and update my GitHub repository.