Skip to main content

Meeting 4

(June 20,2024)

Attendees:

Discussion:

Linux License Dataset

Created a larger, more comprehensive dataset of licenses from Linux sources. This dataset includes line-level annotations to indicate which lines contain license-relevant information.

Continued refining the semantic search implementation:

  • Explored Alternative Models: Tested other sentence transformer models and potential alternatives, but determined that the current model (all-mpnet-base-v2) was not a bottleneck.

Performance Metrics

Introduced two key metrics to evaluate semantic search effectiveness:

  • Accuracy: Measures if at least one relevant line is found per file (binary: 0 or 100%). The current model achieved 96% accuracy on the Linux dataset.
  • Coverage: Calculates the percentage of relevant lines found within each file. The average coverage across files was 80%.

Conclusions and Next Steps

  • Give semantic search one more week and focus on improving license matching accuracy.