Meeting 3
(June 13,2024)
Attendees:
Discussion:
Semantic Search Exploration
Began implementing semantic search to improve license identification accuracy. Explored various techniques:
- Sentence Transformers: Utilized the high-performing all-mpnet-base-v2 model.
- Bag-of-Words (BoW) & TF-IDF: Examined a simpler model for comparison
License Text Dataset
Incorporated license names and SPDX IDs from the SPDX GitHub repository into the project. The end result is a csv file with license names, ids, and license text available for use in semantic search.
Semantic Search Approaches
file-embedding
: Embedded all license texts together, but this proved too coarse for granular analysis.license-embedding
: Embedded each license text individually. This showed promise in identifying license-relevant lines but struggled with accurate license matching.line-embedding
: Embedded each line of each license separately, offering potential for finer-grained matching but at a higher computational cost.
Code Example:
create_license_dataset('extras/license_information/details')
df = client.temp_function(pd.read_csv('extras/lamma3-8b-pytorch-main-sampled.csv'))
file_idx = 0
results = get_top_similar_license_lines(\
df.loc[file_idx, 'file_comments'],
'extras/license_information/license_dataset.txt',
# model='bow',
top_k=5,
embedding_approach='license-embedding'
)
results
)
Output Example:
[(40,
0.6339692,
' File distributed under the Zero Clause BSD (0BSD) license. Copyright Contributors to the pythoncapi_compat project.',
'License Name: CNRI Python License'),
(0,
0.5076868,
'Header file providing new C API functions to old Python versions.',
'N/A'),
(1,
0.5062,
'SPDX-License-Identifier: 0BSD',
'License Name: Xdebug License v 1.03'),
(41,
0.4910386,
' Homepage: https://github.com/python/pythoncapi_compat',
'License Name: CNRI Python License'),
(46,
0.47866815,
' bpo-43753 added Py_Is(), Py_IsNone(), Py_IsTrue() and Py_IsFalse() to Python 3.10.0b1.',
'License Name: CNRI Python License')]
The output is a list of tuples where each tuple contains:
- The index of the line in question
- The similarity score that led to it being chosen as a top similar line
- The actual text of the line
- The name of the license that the line was matched to
Key Findings
- Semantic Search Progress: Successfully implemented semantic search to identify potentially license-relevant lines within code files.
- License Matching Challenges: While line identification improved, accurately matching lines to the correct license remains a challenge. The current approach often mismatches lines to unrelated licenses.
- Metrics Needed: Currently lack quantitative metrics (e.g., accuracy) to assess the effectiveness of different semantic search approaches and embedding techniques.
Additional Notes
- The TF-IDF proved very poor at both finding license relevant lines and matching them to the correct license.
- The BoW was not as accurate as Sentence Transformers at finding the license relevant lines, but it was still very good for its size.
- The
get_top_similar_license_lines
function automatically saved the embeddings for each approach and loads them if they already existed on disk.
Conclusions and Next Steps
- Refine License Matching: Continue exploring and refining semantic search techniques to improve license matching accuracy.
- Experiment with Models: Investigate alternative embedding models or fine-tuning existing models to better capture license-specific semantics. As model size is not a hindrance, no need to explore suboptimal strategies such as BoW and TF-IDF.
- Establish Evaluation Metrics: Develop metrics to quantitatively measure the performance of license identification and matching, enabling objective comparison of different approaches.