Skip to main content

Meeting 2

(June 6,2024)

Attendees:

Discussion:

Code Implementation Details

Completed core functionality for automated LLM comparison on datasets. Now able to efficiently process and compare LLM outputs across different models.

Key Functions:

  • process_dataset: This function is the heart of the LLM comparison process. It:

    1. Iterates through the dataset (in this case, pytorch-main.csv).
    2. Sends each data point (license text) to the specified LLM API.
    3. Handles API requests and responses, including retries for robustness.
    4. Logs progress at specified intervals (log_every=5 means every 5 rows)
    5. We can easily swap the model=Models.MISTRAL_7b in the process_dataset function call with other models from the Models enum to run the same experiment on a different model.
    client = LLMClient()
    df = pd.read_csv('extras/pytorch-main.csv')
    sampled_data_mistral_7b = client.process_dataset(df, df_path='pytorch-main.csv',
    model=Models.MISTRAL_7b,
    prompt_function=prompt_function,
    parser=license_parser,
    extra_file_path='extras',
    log_every=5,
    )

  • prompt_function: This function creates the prompt template instructing the LLM on how to identify and extract license information.

    def prompt_function(text):
    return f"""
    [Task]
    Carefully analyze the provided text to determine if it contains any software licenses.

    [Guidelines]
    1. **License Identification:** If a license is found, clearly state its name and its corresponding SPDX identifier (e.g., MIT License, SPDX-License-Identifier: MIT).
    2. **Evidence Extraction:** For each identified license, extract the specific text snippet(s) from the provided text that confirm its presence. Include surrounding context if it helps clarify the license's applicability.
    3. **No License Scenario:** If no license is detected in the text, explicitly state "No software license found."
    4. **Response Format:** Provide the results in the following format:
    * **Licenses = [list of identified licenses]**
    * **SPDX-IDs = [list of corresponding SPDX identifiers]**

    If no licenses are found, both lists should be empty:

    * **Licenses = []**
    * **SPDX-IDs = []**

    [Text]
    {text}
    """
  • Log Output Example:

    2024-06-06 12:37:58.034 | INFO     | helpers.llm_client:process_dataset:173 - Processing index: 0
    2024-06-06 12:38:07.307 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 5
    2024-06-06 12:38:14.613 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 10
    2024-06-06 12:38:42.489 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 15
    2024-06-06 12:38:50.538 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 20
    2024-06-06 12:39:13.782 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 25
    2024-06-06 12:39:23.271 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 30

Results Showcase

scan results_xllama_responsemistral_responsegemma_response
0BSD BSDLicenses = [Zero Clause BSD (0BSD)]\n**SPD...Licenses = [Zero Clause BSD (0BSD)]\n**SPD...Licenses = [ 'Zero Clause BSD (0BSD)' ]\n*...
Apache-2.0Licenses = [Apache License, Version 2.0]\n...Licenses = [Apache License, Version 2.0]\n...Licenses = [Apache License, Version 2.0]\n...
Apache-2.0Licenses = [Apache-2.0]\n**SPDX-IDs = [Apa...Licenses = [Apache-2.0]\n**SPDX-IDs = [Apa...Licenses = [Apache-2.0]\n**SPDX-IDs = [Apa...
Apache-2.0 BSD-3-Clause BSL-1.0**Licenses = [BSD License, Apache License, Boo...**Licenses = [BSD License, Apache License, Boo...Licenses:\n\n- Licenses = [Early BSD Licen...
Apache-possibilityLicenses = [Apache License]\n**SPDX-IDs = ...Licenses = [Apache License]\n**SPDX-IDs = ...Licenses = [Apache License]\n**SPDX-IDs = ...
BSDLicenses = [BSD License]\n**SPDX-IDs = [BS...Licenses = [BSD License]\n**SPDX-IDs = [BS...Licenses = [BSD License]\n\n**SPDX-IDs = [...
BSDLicenses = [BSD License]\n**SPDX-IDs = [BS...Licenses = [BSD License]\n**SPDX-IDs = [BS...Licenses = [BSD License]\n\n**SPDX-IDs = [...
BSDLicenses = [BSD License]\n**SPDX-IDs = [BS...Licenses = [BSD License]\n**SPDX-IDs = [BS...Licenses = [BSD License]\n**SPDX-IDs = [BS...
BSDLicenses = []\nSPDX-IDs = []\n\nNo sof...Licenses = []\nSPDX-IDs = []\n\nNo sof...Licenses = []\nSPDX-IDs = []\n\n**Expl...
BSDLicenses = [BSD License]\n**SPDX-IDs = [BS...Licenses = [BSD License]\n**SPDX-IDs = [BS...Licenses = [BSD License]\n\n**SPDX-IDs = [...
BSD-3-ClauseLicenses = [Apache License 2.0]\n**SPDX-ID...Licenses = [Apache License 2.0]\n**SPDX-ID...Licenses = [MIT License]\n\n**SPDX-IDs = [...
BSD-3-ClauseLicenses = [Apache License 2.0]\n**SPDX-ID...Licenses = [Apache License 2.0]\n**SPDX-ID...Licenses = []\nSPDX-IDs = []\n\n**Conc...
BSD-3-ClauseLicenses = [Apache License 2.0]\n**SPDX-ID...Licenses = [Apache License 2.0]\n**SPDX-ID...Licenses = [MIT License]\n**SPDX-IDs = [SP...
BSD-possibilityLicenses = [MIT License]\n**SPDX-IDs = [MI...Licenses = [MIT License]\n**SPDX-IDs = [MI...Licenses:\n\n- Licenses: [MIT License]...
BSD-styleLicenses = [BSD License]\n**SPDX-IDs = [SP...Licenses = [BSD License]\n**SPDX-IDs = [SP...Licenses = [BSD License]\n\n**SPDX-IDs = [...
BSD-styleLicenses = [BSD-style license]\n**SPDX-IDs...Licenses = [BSD-style license]\n**SPDX-IDs...Licenses = [BSD-style license]\n**SPDX-IDs...
BSD-styleLicenses = [BSD License]\n**SPDX-IDs = [BS...Licenses = [BSD License]\n**SPDX-IDs = [BS...Licenses = [BSD-style license]\n\n**SPDX-I...
BSD-styleLicenses = [BSD-style license]\n**SPDX-IDs...Licenses = [BSD-style license]\n**SPDX-IDs...Licenses = [BSD-style license]\n**SPDX-IDs...
BSD-styleLicenses = [BSD-style license]\n**SPDX-IDs...Licenses = [BSD-style license]\n**SPDX-IDs...Licenses = [BSD License]\n**SPDX-IDs = [SP...
MITLicenses = [MIT License]\n**SPDX-IDs = [SP...Licenses = [MIT License]\n**SPDX-IDs = [SP...Licenses:\n\n```\n- MIT License\n- SPDX-Li...
MITLicenses = [MIT License]\n**SPDX-IDs = [MI...Licenses = [MIT License]\n**SPDX-IDs = [MI...Licenses = [MIT License]\n**SPDX-IDs = [SP...
MITLicenses = [MIT License]\n**SPDX-IDs = [MI...Licenses = [MIT License]\n**SPDX-IDs = [MI...Licenses:\n\n- MIT License\n\n**SPDX-IDs:*...
MITLicenses = []\nSPDX-IDs = []\n\nNo sof...Licenses = []\nSPDX-IDs = []\n\nNo sof...Licenses = []\nSPDX-IDs = []\n\n**Note...
NCSA**Licenses = [University of Illinois Open Sour...**Licenses = [University of Illinois Open Sour...**Licenses = [University of Illinois Open Sour...
No_license_foundLicenses = []\nSPDX-IDs = []\n\nNo sof...Licenses = []\nSPDX-IDs = []\n\nNo sof...Licenses = []\nSPDX-IDs = []\n\n**Expl...
No_license_foundLicenses = ["MIT License"]\n**SPDX-IDs = [...Licenses = ["MIT License"]\n**SPDX-IDs = [...Licenses = []\nSPDX-IDs = []\n\n**No s...
No_license_foundLicenses = []\nSPDX-IDs = []\n\nNo sof...Licenses = []\nSPDX-IDs = []\n\nNo sof...Licenses = []\nSPDX-IDs = []\n\nNo sof...
No_license_foundLicenses = []\nSPDX-IDs = []\n\nNo sof...Licenses = []\nSPDX-IDs = []\n\nNo sof...Licenses = []\nSPDX-IDs = []\n\n**No s...

Key Findings

  • Decent Accuracy but Challenges with Complexity: LLMs demonstrated reasonably good accuracy in identifying single, well-known licenses. However, they encountered difficulties when dealing with files containing multiple licenses or licenses with variations.
  • Consistent LLM Errors: We observed similar mistakes across different LLMs.

Next Steps

  • Anupam suggested utilizing semantic search to pinpoint license-relevant code snippets and match them to potential licenses based on similarity scores.