Skip to main content

Meeting 5

(June 27,2024)

Attendees:

Discussion:

Improved Semantic Search Performance

  • Accuracy: Increased to nearly 100% (98.7% in the meeting, with slight discrepancies attributed to labeling errors).
  • Coverage: Enhanced to over 93% (potentially closer to 96% when accounting for labeling issues).

License Matching Exploration

  • Hierarchical Approach: Attempted a two-step semantic search approach:
    1. Use existing code to identify top K similar lines.
    2. Perform line-by-line semantic search against all SPDX license text lines.
  • Performance Issues: This approach proved computationally expensive, taking hours to process small samples.
  • Fuzzy Matching (Post-Meeting): Explored the fuzzywuzzy library (based on Levenshtein distance) for the second step of license matching, yielding significantly improved results.

License Matching Metrics

  • Predicted License Accuracy: 68% (This indicates the percentage of files where at least one license was correctly matched.)
  • Predicted Licenses Covered: 62.5% (This measures the percentage of all licenses within the explored files that were correctly identified and matched.)

Key Findings

  • Performance Breakthrough with Fuzzy Matching: Switching to fuzzy matching with fuzzywuzzy significantly enhanced license matching accuracy and coverage compared to the initial hierarchical semantic search approach.

Conclusions and Next Steps

  • Integrate Semantic Search with LLMs: Begin combining semantic search results with LLM analysis to achieve a more comprehensive and accurate license identification solution.
  • Refine Fuzzy Matching: Continue exploring and refining fuzzy matching parameters to further improve license matching performance.
  • Analyze Combined Performance: Establish metrics and analyze the effectiveness of the integrated semantic search and LLM approach.