Skip to main content

Week 8

(July,19,2023)

Attendees:

Updates:

Fossology Dataset Cleaning

  • The Fossology dataset is now cleared. The clearing results showcase:
    • Green: Copyright
    • Red: False positive
    • Orange: Unsure (consulted mentors)
    • Blue: Non-English texts
    • Gray: Invalid copyrights (e.g., Copyright (c) _____).
  • The final dataset comprises ~20,000 unique strings. Approximately 75% are true copyright notices, and the rest are false positives. This is reduced from an initial ~43,000 rows in the original Fossology dataset.
  • The code is available here. Key findings include:
    • For classical machine learning techniques, SVMs, random forests, and Naive Bayes classifiers were assessed. Random forest outperformed the others.

    • The results of the random forest model are as follows:

      Fossology Dataset (Test Set)

      |               | precision | recall | f1-score | support |
      |---------------|-----------|--------|----------|---------|
      | 0 | 0.99 | 0.98 | 0.99 | 2870 |
      | 1 | 0.95 | 0.97 | 0.96 | 1024 |

      Tensorflow Dataset

      |               | precision | recall | f1-score | support |
      |---------------|-----------|--------|----------|---------|
      | 0 | 1.00 | 0.98 | 0.99 | 14865 |
      | 1 | 0.88 | 0.99 | 0.93 | 1632 |

      Kubernetes Dataset

      |               | precision | recall | f1-score | support |
      |---------------|-----------|--------|----------|---------|
      | 0 | 1.00 | 1.00 | 1.00 | 25786 |
      | 1 | 0.87 | 1.00 | 0.93 | 156 |

Model Performance and Future Directions

  • After discussions with mentors Kaushl and Gaurav, it was decided that recall should be prioritized. While DistilBert was explored, its performance was suboptimal compared to random forests. De-cluttering will likely be approached via Named Entity Recognition (NER). Additionally, Gaurav provided a new dataset of 10,000 copyrights, available here, that will need minor editing before use.

Dataset Creation Problems and Solutions

Problem 1:

Creating a dataset manually is repetitive, time-consuming, and error-prone. Especially with over 20,000 unique rows filled with code and potential copyrights, mislabeling is easy.

Solution 1:

In Google Sheets, conditional formatting rules were implemented to color each row based on its label. This visual cue greatly assisted in the labeling process, speeding up the workflow, and reducing potential errors.

Conclusion and Further Plans:

Exploration

  • Delve deeper into various classifiers and text vectorization methods.

Deep Learning

  • Analyze the efficiency of deep learning models in contrast to traditional machine learning models.

Generalization

  • Ensure all techniques employed perform robustly and generalize well to unseen data.