Week 11
(August,09,2023)
Attendees:
Updates:
Datasets & Findings:
-
Dataset Corrections: This week commenced with a detailed inspection of datasets which led to the rectification of various errors. The corrected datasets and predictions from the current model have been updated in this spreadsheet.
-
Inconsistencies Addressed: I found that the treatment of separate language rows varied across datasets. To maintain consistency, all such records have been treated as copyrights, requiring manual intervention later.
-
Annotating Mistakes: Through model predictions, I detected errors in dataset annotations. These errors have been fixed and the updates can be found in the aforementioned spreadsheet.
-
Dataset Merging: Given the presence of different languages across datasets, I decided to consolidate all datasets for training, setting aside 20% for testing. The new dataset comprises:
- Class 0 (copyrights): 75.22% (16377 rows)
- Class 1: 24.77% (5393 rows)
- Total rows: 21770
-
Additional Dataset: Gaurav has provided an additional dataset comprising 26188 unique rows. I've yet to label this dataset.
Model Performance:
-
TF-IDF Vectorizer: The model achieved significant results using the TF-IDF vectorizer, without additional preprocessing:
- Class 0 misclassifications: 0.32% (52 out of 16377)
- Class 1 misclassifications: 0.61% (33 out of 5393)
-
Preprocessing Enhancements: I devised a preprocessing function which improved the model's performance. These enhancements include replacing digits, copyright symbols, emails, and more. This approach reduced the misclassifications:
- Class 0: 0.26% (43 out of 16377)
- Class 1: 0.82% (44 out of 5393)
-
TF-IDF Parameter Tweaking: Further fine-tuning of TF-IDF parameters allowed the model to achieve:
- Class 0 misclassifications: 0.16% (27 out of 16377)
- Class 1 misclassifications: 0.54% (29 out of 5393)
-
Thresholding at 0.99: Applying a threshold of 0.99 rendered impressive results:
- Class 0 misclassifications: 0.03% (5 out of 16377)
- Class 1 misclassifications: 4.6% (248 out of 5393)
External Datasets Testing:
-
Fossology-provided-2 dataset: Initial results on this dataset indicated:
- Class 0 misclassifications: 0.46% (27 out of 5808)
- However, after manual inspection, only 12 were genuine misclassifications.
-
Dataset from Paper: I tested the model on the dataset from this paper. The results were:
- Class 0 misclassifications: 0.09% (2 out of 2146)
- Class 1 misclassifications: 1.32% (2 out of 151)
- Notably, the two misclassifications in class 1 were found to be correctly predicted by our model.
Feature Extraction & LDA:
-
Feature Extraction from Paper: Implementing the paper's feature extraction method yielded the following results:
- Class 0 misclassifications: 2.91% (477 out of 16377)
- Class 1 misclassifications: 6.93% (374 out of 5393)
-
LDA Analysis: Leveraging LDA, I identified the 20 most frequent words in each class, offering insights for potential feature extraction enhancements.
Language Detection:
-
cld3 Limitation: Although
cld3
proved efficient, itsApache License 2.0
is incompatible with Fossology'sGNU General Public License v2.0
. -
spaCy's Model: Despite utilizing spaCy's language detection model, many English rows were misclassified as non-English and vice versa.
GitHub Repository:
- I've established a GitHub repository to store all project files.
Conclusion & Future Plans:
Language Detection
- Investigate more efficient language detection methods.
Preprocessing Improvements
- Enhance preprocessing by using NER for name and organization replacements.
Feature Extraction
- Delve deeper into feature extraction techniques.
Documentation
- Cleanup my documentation
- Cleanup and update my GitHub repository.