Week 9
(July,26,2023)
Attendees:
Updates:
SVM Testing on Vectorization Algorithms and Pre-trained Word Embeddings
- Vectorizers and Embeddings Tested:
- Bag of Words (BoW)
- Term Frequency - Inverse Document Frequency (TF-IDF)
- GloVe (averaging word vectors for each sentence)
- FastText
- Sentence Transformers
- Word2Vec
Results from Vectorization and Embeddings
- BoW and TF-IDF yielded the most promising results both in terms of accuracy.
- GloVe embeddings were tested across four dimensions: 50, 100, 200, and 300. The best-performing 300-dimensional embeddings still underperformed TF-IDF by around 4% for both classes 0 and 1.
- FastText's pre-trained embeddings (sourced from Wikipedia) were larger than 7GB, making it impractical to load them. Hence, I decided to train the embedder from scratch using our dataset, resulting in slightly inferior performance than FastText.
- Other embedders lagged even further in performance.
TF-IDF Model Performance
Precision
| | 0 | 1 |
|:-----|---------:|---------:|
| 0 | 0.991262 | 0.967086 |
| 1 | 0.97284 | 0.703488 |
| 2 | 0.945312 | 0.892562 |
| 3 | 0.991701 | 0.911765 |
| 4 | 0.995004 | 0.974809 |
| Mean | 0.979224 | 0.889942 |
Recall
| | 0 | 1 |
|:-----|---------:|---------:|
| 0 | 0.988153 | 0.975586 |
| 1 | 0.885393 | 0.916667 |
| 2 | 0.902985 | 0.93913 |
| 3 | 0.980312 | 0.96124 |
| 4 | 0.990982 | 0.985943 |
| Mean | 0.949565 | 0.955713 |
F1-score
| | 0 | 1 |
|:-----|---------:|---------:|
| 0 | 0.989705 | 0.971317 |
| 1 | 0.927059 | 0.796053 |
| 2 | 0.923664 | 0.915254 |
| 3 | 0.985974 | 0.935849 |
| 4 | 0.992989 | 0.980344 |
| Mean | 0.963878 | 0.919764 |
Datasets Explained
- 0 corresponds to the test dataset (20% of the Fossology dataset), with training performed on the remaining 80%.
- 1 represents the Kubernetes dataset.
- 2 stands for the Tensorflow dataset.
- 3 is identified as the Fossology-provided-dataset-1.
- 4 comprises a merged set of all aforementioned datasets, including the training data.
Why TF-IDF and BoW Outperformed
- The dataset size may not be large enough to realize the benefits of more advanced embeddings.
- Copyright classification differs from conventional text classification due to the presence of code snippets and other unique features.
- The absence of text preprocessing in the current iteration might be a limiting factor.
SVM's predict_proba
method
- Discussions with Anupam led to a consensus on continuing the tests using SVM, leveraging its
predict_proba
method. This technique provides the probability associated with each SVM prediction, offering insight into the model's confidence. A threshold can be set on this confidence factor to potentially enhance recall, even if it results in reduced precision.
Problems and Solutions
Problem 1
- Classification reports were overly verbose, consuming excess space, and included redundant information.
Solution 1
- Developed a function to streamline reports for each dataset, displaying precision up to more than two decimal places.
- This function computes the average precision, recall, and F1-scores, providing a comprehensive yet concise view of model performance across datasets, irrespective of their sizes.
Conclusion and Further Plans:
Text Preprocessing
- Aim to evaluate the efficacy of each vectorization method post-text preprocessing.
predict_proba
SVM method
- Assess the performance of the
predict_proba
method within the SVM framework.