Skip to main content

Week 9

(July,26,2023)

Attendees:

Updates:

SVM Testing on Vectorization Algorithms and Pre-trained Word Embeddings

  • Vectorizers and Embeddings Tested:
    • Bag of Words (BoW)
    • Term Frequency - Inverse Document Frequency (TF-IDF)
    • GloVe (averaging word vectors for each sentence)
    • FastText
    • Sentence Transformers
    • Word2Vec

Results from Vectorization and Embeddings

  • BoW and TF-IDF yielded the most promising results both in terms of accuracy.
  • GloVe embeddings were tested across four dimensions: 50, 100, 200, and 300. The best-performing 300-dimensional embeddings still underperformed TF-IDF by around 4% for both classes 0 and 1.
  • FastText's pre-trained embeddings (sourced from Wikipedia) were larger than 7GB, making it impractical to load them. Hence, I decided to train the embedder from scratch using our dataset, resulting in slightly inferior performance than FastText.
  • Other embedders lagged even further in performance.

TF-IDF Model Performance

Precision
| | 0 | 1 |
|:-----|---------:|---------:|
| 0 | 0.991262 | 0.967086 |
| 1 | 0.97284 | 0.703488 |
| 2 | 0.945312 | 0.892562 |
| 3 | 0.991701 | 0.911765 |
| 4 | 0.995004 | 0.974809 |
| Mean | 0.979224 | 0.889942 |

Recall
| | 0 | 1 |
|:-----|---------:|---------:|
| 0 | 0.988153 | 0.975586 |
| 1 | 0.885393 | 0.916667 |
| 2 | 0.902985 | 0.93913 |
| 3 | 0.980312 | 0.96124 |
| 4 | 0.990982 | 0.985943 |
| Mean | 0.949565 | 0.955713 |

F1-score
| | 0 | 1 |
|:-----|---------:|---------:|
| 0 | 0.989705 | 0.971317 |
| 1 | 0.927059 | 0.796053 |
| 2 | 0.923664 | 0.915254 |
| 3 | 0.985974 | 0.935849 |
| 4 | 0.992989 | 0.980344 |
| Mean | 0.963878 | 0.919764 |

Datasets Explained

  • 0 corresponds to the test dataset (20% of the Fossology dataset), with training performed on the remaining 80%.
  • 1 represents the Kubernetes dataset.
  • 2 stands for the Tensorflow dataset.
  • 3 is identified as the Fossology-provided-dataset-1.
  • 4 comprises a merged set of all aforementioned datasets, including the training data.

Why TF-IDF and BoW Outperformed

  1. The dataset size may not be large enough to realize the benefits of more advanced embeddings.
  2. Copyright classification differs from conventional text classification due to the presence of code snippets and other unique features.
  3. The absence of text preprocessing in the current iteration might be a limiting factor.

SVM's predict_proba method

  • Discussions with Anupam led to a consensus on continuing the tests using SVM, leveraging its predict_proba method. This technique provides the probability associated with each SVM prediction, offering insight into the model's confidence. A threshold can be set on this confidence factor to potentially enhance recall, even if it results in reduced precision.

Problems and Solutions

Problem 1

  • Classification reports were overly verbose, consuming excess space, and included redundant information.

Solution 1

  • Developed a function to streamline reports for each dataset, displaying precision up to more than two decimal places.
  • This function computes the average precision, recall, and F1-scores, providing a comprehensive yet concise view of model performance across datasets, irrespective of their sizes.

Conclusion and Further Plans:

Text Preprocessing

  • Aim to evaluate the efficacy of each vectorization method post-text preprocessing.

predict_proba SVM method

  • Assess the performance of the predict_proba method within the SVM framework.