Week 11

(August,09,2023)

Attendees:

Updates:

Datasets & Findings:

Dataset Corrections: This week commenced with a detailed inspection of datasets which led to the rectification of various errors. The corrected datasets and predictions from the current model have been updated in this spreadsheet.
Inconsistencies Addressed: I found that the treatment of separate language rows varied across datasets. To maintain consistency, all such records have been treated as copyrights, requiring manual intervention later.
Annotating Mistakes: Through model predictions, I detected errors in dataset annotations. These errors have been fixed and the updates can be found in the aforementioned spreadsheet.
Dataset Merging: Given the presence of different languages across datasets, I decided to consolidate all datasets for training, setting aside 20% for testing. The new dataset comprises:
- Class 0 (copyrights): 75.22% (16377 rows)
- Class 1: 24.77% (5393 rows)
- Total rows: 21770
Additional Dataset: Gaurav has provided an additional dataset comprising 26188 unique rows. I've yet to label this dataset.

Model Performance:

TF-IDF Vectorizer: The model achieved significant results using the TF-IDF vectorizer, without additional preprocessing:
- Class 0 misclassifications: 0.32% (52 out of 16377)
- Class 1 misclassifications: 0.61% (33 out of 5393)
Preprocessing Enhancements: I devised a preprocessing function which improved the model's performance. These enhancements include replacing digits, copyright symbols, emails, and more. This approach reduced the misclassifications:
- Class 0: 0.26% (43 out of 16377)
- Class 1: 0.82% (44 out of 5393)
TF-IDF Parameter Tweaking: Further fine-tuning of TF-IDF parameters allowed the model to achieve:
- Class 0 misclassifications: 0.16% (27 out of 16377)
- Class 1 misclassifications: 0.54% (29 out of 5393)
Thresholding at 0.99: Applying a threshold of 0.99 rendered impressive results:
- Class 0 misclassifications: 0.03% (5 out of 16377)
- Class 1 misclassifications: 4.6% (248 out of 5393)

External Datasets Testing:

Fossology-provided-2 dataset: Initial results on this dataset indicated:
- Class 0 misclassifications: 0.46% (27 out of 5808)
- However, after manual inspection, only 12 were genuine misclassifications.
Dataset from Paper: I tested the model on the dataset from this paper. The results were:
- Class 0 misclassifications: 0.09% (2 out of 2146)
- Class 1 misclassifications: 1.32% (2 out of 151)
- Notably, the two misclassifications in class 1 were found to be correctly predicted by our model.

Feature Extraction & LDA:

Feature Extraction from Paper: Implementing the paper's feature extraction method yielded the following results:
- Class 0 misclassifications: 2.91% (477 out of 16377)
- Class 1 misclassifications: 6.93% (374 out of 5393)
LDA Analysis: Leveraging LDA, I identified the 20 most frequent words in each class, offering insights for potential feature extraction enhancements.

Language Detection:

cld3 Limitation: Although cld3 proved efficient, its Apache License 2.0 is incompatible with Fossology's GNU General Public License v2.0.
spaCy's Model: Despite utilizing spaCy's language detection model, many English rows were misclassified as non-English and vice versa.

GitHub Repository:

I've established a GitHub repository to store all project files.

Conclusion & Future Plans:

Language Detection

Investigate more efficient language detection methods.

Preprocessing Improvements

Enhance preprocessing by using NER for name and organization replacements.

Feature Extraction

Delve deeper into feature extraction techniques.

Documentation

Cleanup my documentation
Cleanup and update my GitHub repository.

Attendees:​

Updates:​

Datasets & Findings:​

Model Performance:​

External Datasets Testing:​

Feature Extraction & LDA:​

Language Detection:​

GitHub Repository:​

Conclusion & Future Plans:​

Language Detection​

Preprocessing Improvements​

Feature Extraction​

Documentation​

Attendees:

Updates:

Datasets & Findings:

Model Performance:

External Datasets Testing:

Feature Extraction & LDA:

Language Detection:

GitHub Repository:

Conclusion & Future Plans:

Language Detection

Preprocessing Improvements

Feature Extraction

Documentation