Meeting 4

(June 20,2024)

Attendees:

Kaushlendra Pratap
Gaurav Mishra
Shaheem Azmal M MD
Ayush Bhardwaj
Avinal Kumar

Discussion:

Linux License Dataset

Created a larger, more comprehensive dataset of licenses from Linux sources. This dataset includes line-level annotations to indicate which lines contain license-relevant information.

Refined Semantic Search

Continued refining the semantic search implementation:

Explored Alternative Models: Tested other sentence transformer models and potential alternatives, but determined that the current model (all-mpnet-base-v2) was not a bottleneck.

Performance Metrics

Introduced two key metrics to evaluate semantic search effectiveness:

Accuracy: Measures if at least one relevant line is found per file (binary: 0 or 100%). The current model achieved 96% accuracy on the Linux dataset.
Coverage: Calculates the percentage of relevant lines found within each file. The average coverage across files was 80%.

Conclusions and Next Steps

Give semantic search one more week and focus on improving license matching accuracy.

Attendees:​

Discussion:​

Linux License Dataset​

Refined Semantic Search​

Performance Metrics​

Conclusions and Next Steps​