Meeting 2
(June 6,2024)
Attendees:
Discussion:
Code Implementation Details
Completed core functionality for automated LLM comparison on datasets. Now able to efficiently process and compare LLM outputs across different models.
Key Functions:
-
process_dataset
: This function is the heart of the LLM comparison process. It:- Iterates through the dataset (in this case,
pytorch-main.csv
). - Sends each data point (license text) to the specified LLM API.
- Handles API requests and responses, including retries for robustness.
- Logs progress at specified intervals (
log_every=5
means every 5 rows) - We can easily swap the
model=Models.MISTRAL_7b
in the process_dataset function call with other models from the Models enum to run the same experiment on a different model.
client = LLMClient()
df = pd.read_csv('extras/pytorch-main.csv')
sampled_data_mistral_7b = client.process_dataset(df, df_path='pytorch-main.csv',
model=Models.MISTRAL_7b,
prompt_function=prompt_function,
parser=license_parser,
extra_file_path='extras',
log_every=5,
) - Iterates through the dataset (in this case,
-
prompt_function
: This function creates the prompt template instructing the LLM on how to identify and extract license information.def prompt_function(text):
return f"""
[Task]
Carefully analyze the provided text to determine if it contains any software licenses.
[Guidelines]
1. **License Identification:** If a license is found, clearly state its name and its corresponding SPDX identifier (e.g., MIT License, SPDX-License-Identifier: MIT).
2. **Evidence Extraction:** For each identified license, extract the specific text snippet(s) from the provided text that confirm its presence. Include surrounding context if it helps clarify the license's applicability.
3. **No License Scenario:** If no license is detected in the text, explicitly state "No software license found."
4. **Response Format:** Provide the results in the following format:
* **Licenses = [list of identified licenses]**
* **SPDX-IDs = [list of corresponding SPDX identifiers]**
If no licenses are found, both lists should be empty:
* **Licenses = []**
* **SPDX-IDs = []**
[Text]
{text}
""" -
Log Output Example:
2024-06-06 12:37:58.034 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 0
2024-06-06 12:38:07.307 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 5
2024-06-06 12:38:14.613 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 10
2024-06-06 12:38:42.489 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 15
2024-06-06 12:38:50.538 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 20
2024-06-06 12:39:13.782 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 25
2024-06-06 12:39:23.271 | INFO | helpers.llm_client:process_dataset:173 - Processing index: 30
Results Showcase
scan results_x | llama_response | mistral_response | gemma_response |
---|---|---|---|
0BSD BSD | Licenses = [Zero Clause BSD (0BSD)]\n**SPD... | Licenses = [Zero Clause BSD (0BSD)]\n**SPD... | Licenses = [ 'Zero Clause BSD (0BSD)' ]\n*... |
Apache-2.0 | Licenses = [Apache License, Version 2.0]\n... | Licenses = [Apache License, Version 2.0]\n... | Licenses = [Apache License, Version 2.0]\n... |
Apache-2.0 | Licenses = [Apache-2.0]\n**SPDX-IDs = [Apa... | Licenses = [Apache-2.0]\n**SPDX-IDs = [Apa... | Licenses = [Apache-2.0]\n**SPDX-IDs = [Apa... |
Apache-2.0 BSD-3-Clause BSL-1.0 | **Licenses = [BSD License, Apache License, Boo... | **Licenses = [BSD License, Apache License, Boo... | Licenses:\n\n- Licenses = [Early BSD Licen... |
Apache-possibility | Licenses = [Apache License]\n**SPDX-IDs = ... | Licenses = [Apache License]\n**SPDX-IDs = ... | Licenses = [Apache License]\n**SPDX-IDs = ... |
BSD | Licenses = [BSD License]\n**SPDX-IDs = [BS... | Licenses = [BSD License]\n**SPDX-IDs = [BS... | Licenses = [BSD License]\n\n**SPDX-IDs = [... |
BSD | Licenses = [BSD License]\n**SPDX-IDs = [BS... | Licenses = [BSD License]\n**SPDX-IDs = [BS... | Licenses = [BSD License]\n\n**SPDX-IDs = [... |
BSD | Licenses = [BSD License]\n**SPDX-IDs = [BS... | Licenses = [BSD License]\n**SPDX-IDs = [BS... | Licenses = [BSD License]\n**SPDX-IDs = [BS... |
BSD | Licenses = []\nSPDX-IDs = []\n\nNo sof... | Licenses = []\nSPDX-IDs = []\n\nNo sof... | Licenses = []\nSPDX-IDs = []\n\n**Expl... |
BSD | Licenses = [BSD License]\n**SPDX-IDs = [BS... | Licenses = [BSD License]\n**SPDX-IDs = [BS... | Licenses = [BSD License]\n\n**SPDX-IDs = [... |
BSD-3-Clause | Licenses = [Apache License 2.0]\n**SPDX-ID... | Licenses = [Apache License 2.0]\n**SPDX-ID... | Licenses = [MIT License]\n\n**SPDX-IDs = [... |
BSD-3-Clause | Licenses = [Apache License 2.0]\n**SPDX-ID... | Licenses = [Apache License 2.0]\n**SPDX-ID... | Licenses = []\nSPDX-IDs = []\n\n**Conc... |
BSD-3-Clause | Licenses = [Apache License 2.0]\n**SPDX-ID... | Licenses = [Apache License 2.0]\n**SPDX-ID... | Licenses = [MIT License]\n**SPDX-IDs = [SP... |
BSD-possibility | Licenses = [MIT License]\n**SPDX-IDs = [MI... | Licenses = [MIT License]\n**SPDX-IDs = [MI... | Licenses:\n\n- Licenses: [MIT License]... |
BSD-style | Licenses = [BSD License]\n**SPDX-IDs = [SP... | Licenses = [BSD License]\n**SPDX-IDs = [SP... | Licenses = [BSD License]\n\n**SPDX-IDs = [... |
BSD-style | Licenses = [BSD-style license]\n**SPDX-IDs... | Licenses = [BSD-style license]\n**SPDX-IDs... | Licenses = [BSD-style license]\n**SPDX-IDs... |
BSD-style | Licenses = [BSD License]\n**SPDX-IDs = [BS... | Licenses = [BSD License]\n**SPDX-IDs = [BS... | Licenses = [BSD-style license]\n\n**SPDX-I... |
BSD-style | Licenses = [BSD-style license]\n**SPDX-IDs... | Licenses = [BSD-style license]\n**SPDX-IDs... | Licenses = [BSD-style license]\n**SPDX-IDs... |
BSD-style | Licenses = [BSD-style license]\n**SPDX-IDs... | Licenses = [BSD-style license]\n**SPDX-IDs... | Licenses = [BSD License]\n**SPDX-IDs = [SP... |
MIT | Licenses = [MIT License]\n**SPDX-IDs = [SP... | Licenses = [MIT License]\n**SPDX-IDs = [SP... | Licenses:\n\n```\n- MIT License\n- SPDX-Li... |
MIT | Licenses = [MIT License]\n**SPDX-IDs = [MI... | Licenses = [MIT License]\n**SPDX-IDs = [MI... | Licenses = [MIT License]\n**SPDX-IDs = [SP... |
MIT | Licenses = [MIT License]\n**SPDX-IDs = [MI... | Licenses = [MIT License]\n**SPDX-IDs = [MI... | Licenses:\n\n- MIT License\n\n**SPDX-IDs:*... |
MIT | Licenses = []\nSPDX-IDs = []\n\nNo sof... | Licenses = []\nSPDX-IDs = []\n\nNo sof... | Licenses = []\nSPDX-IDs = []\n\n**Note... |
NCSA | **Licenses = [University of Illinois Open Sour... | **Licenses = [University of Illinois Open Sour... | **Licenses = [University of Illinois Open Sour... |
No_license_found | Licenses = []\nSPDX-IDs = []\n\nNo sof... | Licenses = []\nSPDX-IDs = []\n\nNo sof... | Licenses = []\nSPDX-IDs = []\n\n**Expl... |
No_license_found | Licenses = ["MIT License"]\n**SPDX-IDs = [... | Licenses = ["MIT License"]\n**SPDX-IDs = [... | Licenses = []\nSPDX-IDs = []\n\n**No s... |
No_license_found | Licenses = []\nSPDX-IDs = []\n\nNo sof... | Licenses = []\nSPDX-IDs = []\n\nNo sof... | Licenses = []\nSPDX-IDs = []\n\nNo sof... |
No_license_found | Licenses = []\nSPDX-IDs = []\n\nNo sof... | Licenses = []\nSPDX-IDs = []\n\nNo sof... | Licenses = []\nSPDX-IDs = []\n\n**No s... |
Key Findings
- Decent Accuracy but Challenges with Complexity: LLMs demonstrated reasonably good accuracy in identifying single, well-known licenses. However, they encountered difficulties when dealing with files containing multiple licenses or licenses with variations.
- Consistent LLM Errors: We observed similar mistakes across different LLMs.
Next Steps
- Anupam suggested utilizing semantic search to pinpoint license-relevant code snippets and match them to potential licenses based on similarity scores.