Skip to main content

Meeting 3

(June 13,2024)

Attendees:

Discussion:

Semantic Search Exploration

Began implementing semantic search to improve license identification accuracy. Explored various techniques:

  • Sentence Transformers: Utilized the high-performing all-mpnet-base-v2 model.
  • Bag-of-Words (BoW) & TF-IDF: Examined a simpler model for comparison

License Text Dataset

Incorporated license names and SPDX IDs from the SPDX GitHub repository into the project. The end result is a csv file with license names, ids, and license text available for use in semantic search.

Semantic Search Approaches

  • file-embedding: Embedded all license texts together, but this proved too coarse for granular analysis.
  • license-embedding: Embedded each license text individually. This showed promise in identifying license-relevant lines but struggled with accurate license matching.
  • line-embedding: Embedded each line of each license separately, offering potential for finer-grained matching but at a higher computational cost.

Code Example:

create_license_dataset('extras/license_information/details')
df = client.temp_function(pd.read_csv('extras/lamma3-8b-pytorch-main-sampled.csv'))
file_idx = 0
results = get_top_similar_license_lines(\
df.loc[file_idx, 'file_comments'],
'extras/license_information/license_dataset.txt',
# model='bow',
top_k=5,
embedding_approach='license-embedding'
)
results
)

Output Example:

[(40,
0.6339692,
' File distributed under the Zero Clause BSD (0BSD) license. Copyright Contributors to the pythoncapi_compat project.',
'License Name: CNRI Python License'),
(0,
0.5076868,
'Header file providing new C API functions to old Python versions.',
'N/A'),
(1,
0.5062,
'SPDX-License-Identifier: 0BSD',
'License Name: Xdebug License v 1.03'),
(41,
0.4910386,
' Homepage: https://github.com/python/pythoncapi_compat',
'License Name: CNRI Python License'),
(46,
0.47866815,
' bpo-43753 added Py_Is(), Py_IsNone(), Py_IsTrue() and Py_IsFalse() to Python 3.10.0b1.',
'License Name: CNRI Python License')]

The output is a list of tuples where each tuple contains:

  1. The index of the line in question
  2. The similarity score that led to it being chosen as a top similar line
  3. The actual text of the line
  4. The name of the license that the line was matched to

Key Findings

  • Semantic Search Progress: Successfully implemented semantic search to identify potentially license-relevant lines within code files.
  • License Matching Challenges: While line identification improved, accurately matching lines to the correct license remains a challenge. The current approach often mismatches lines to unrelated licenses.
  • Metrics Needed: Currently lack quantitative metrics (e.g., accuracy) to assess the effectiveness of different semantic search approaches and embedding techniques.

Additional Notes

  • The TF-IDF proved very poor at both finding license relevant lines and matching them to the correct license.
  • The BoW was not as accurate as Sentence Transformers at finding the license relevant lines, but it was still very good for its size.
  • The get_top_similar_license_lines function automatically saved the embeddings for each approach and loads them if they already existed on disk.

Conclusions and Next Steps

  • Refine License Matching: Continue exploring and refining semantic search techniques to improve license matching accuracy.
  • Experiment with Models: Investigate alternative embedding models or fine-tuning existing models to better capture license-specific semantics. As model size is not a hindrance, no need to explore suboptimal strategies such as BoW and TF-IDF.
  • Establish Evaluation Metrics: Develop metrics to quantitatively measure the performance of license identification and matching, enabling objective comparison of different approaches.