Meeting 3

(June 13,2024)

Attendees:

Discussion:

Semantic Search Exploration

Began implementing semantic search to improve license identification accuracy. Explored various techniques:

Sentence Transformers: Utilized the high-performing all-mpnet-base-v2 model.
Bag-of-Words (BoW) & TF-IDF: Examined a simpler model for comparison

License Text Dataset

Incorporated license names and SPDX IDs from the SPDX GitHub repository into the project. The end result is a csv file with license names, ids, and license text available for use in semantic search.

Semantic Search Approaches

file-embedding: Embedded all license texts together, but this proved too coarse for granular analysis.
license-embedding: Embedded each license text individually. This showed promise in identifying license-relevant lines but struggled with accurate license matching.
line-embedding: Embedded each line of each license separately, offering potential for finer-grained matching but at a higher computational cost.

Code Example:

create_license_dataset('extras/license_information/details')
df = client.temp_function(pd.read_csv('extras/lamma3-8b-pytorch-main-sampled.csv'))
file_idx = 0
results = get_top_similar_license_lines(\
                df.loc[file_idx, 'file_comments'],
                'extras/license_information/license_dataset.txt',
                # model='bow',
                top_k=5,
                embedding_approach='license-embedding'
            )
results
)

Output Example:

[(40,
  0.6339692,
  ' File distributed under the Zero Clause BSD (0BSD) license. Copyright Contributors to the pythoncapi_compat project.',
  'License Name: CNRI Python License'),
 (0,
  0.5076868,
  'Header file providing new C API functions to old Python versions.',
  'N/A'),
 (1,
  0.5062,
  'SPDX-License-Identifier: 0BSD',
  'License Name: Xdebug License v 1.03'),
 (41,
  0.4910386,
  ' Homepage: https://github.com/python/pythoncapi_compat',
  'License Name: CNRI Python License'),
 (46,
  0.47866815,
  ' bpo-43753 added Py_Is(), Py_IsNone(), Py_IsTrue() and Py_IsFalse() to Python 3.10.0b1.',
  'License Name: CNRI Python License')]

The output is a list of tuples where each tuple contains:

The index of the line in question
The similarity score that led to it being chosen as a top similar line
The actual text of the line
The name of the license that the line was matched to

Key Findings

Semantic Search Progress: Successfully implemented semantic search to identify potentially license-relevant lines within code files.
License Matching Challenges: While line identification improved, accurately matching lines to the correct license remains a challenge. The current approach often mismatches lines to unrelated licenses.
Metrics Needed: Currently lack quantitative metrics (e.g., accuracy) to assess the effectiveness of different semantic search approaches and embedding techniques.

Additional Notes

The TF-IDF proved very poor at both finding license relevant lines and matching them to the correct license.
The BoW was not as accurate as Sentence Transformers at finding the license relevant lines, but it was still very good for its size.
The get_top_similar_license_lines function automatically saved the embeddings for each approach and loads them if they already existed on disk.

Conclusions and Next Steps

Refine License Matching: Continue exploring and refining semantic search techniques to improve license matching accuracy.
Experiment with Models: Investigate alternative embedding models or fine-tuning existing models to better capture license-specific semantics. As model size is not a hindrance, no need to explore suboptimal strategies such as BoW and TF-IDF.
Establish Evaluation Metrics: Develop metrics to quantitatively measure the performance of license identification and matching, enabling objective comparison of different approaches.

Attendees:​

Discussion:​

Semantic Search Exploration​

License Text Dataset​

Semantic Search Approaches​

Code Example:​

Output Example:​

Key Findings​

Additional Notes​

Conclusions and Next Steps​