Meeting 4
(June 20,2024)
Attendees:
Discussion:
Linux License Dataset
Created a larger, more comprehensive dataset of licenses from Linux sources. This dataset includes line-level annotations to indicate which lines contain license-relevant information.
Refined Semantic Search
Continued refining the semantic search implementation:
- Explored Alternative Models: Tested other sentence transformer models and potential alternatives, but determined that the current model (all-mpnet-base-v2) was not a bottleneck.
Performance Metrics
Introduced two key metrics to evaluate semantic search effectiveness:
- Accuracy: Measures if at least one relevant line is found per file (binary: 0 or 100%). The current model achieved 96% accuracy on the Linux dataset.
- Coverage: Calculates the percentage of relevant lines found within each file. The average coverage across files was 80%.
Conclusions and Next Steps
- Give semantic search one more week and focus on improving license matching accuracy.