Skip to main content

Meeting 6

(July 4,2024)

Attendees:

Discussion:

Integration of Semantic Search with LLMs

  • Initial Attempt

    1. Prompt: The initial prompt focused on providing text and metadata to the LLM for license identification.
    2. Issues: The LLM attempted to match all provided lines to a license, even when many lines were clearly irrelevant to licensing.
  • Initial Prompt

[Task]
You are provided with text extracted from a file, along with potential license matches identified by a semantic search tool.
Your task is to carefully analyze the provided text and metadata to determine the actual software license(s) present in the original file.
Out of the 10 provided lines, not all matches will be correct or relevant, so focus on the most relevant lines in your analysis.

[Metadata Explanation]
The metadata provided for each line is a tuple containing four elements:
* **Line:** The actual line of text extracted from the file.
* **Potential License Match:** The name of a license that the semantic search tool believes the line might belong to.
* **License ID:** The SPDX identifier of the potential license match.
* **Matched License Text:** The specific text within the potential license that the line was matched to.

[Guidelines]
1. **License Identification:** If a license is found, clearly state its name and its corresponding SPDX identifier (e.g., MIT License, SPDX-License-Identifier: MIT). If multiple licenses are found, list them all.
2. **Evidence and Reasoning (Focus on Relevance and Clarity):**
* For each identified license, extract the specific text snippet(s) from the provided text that confirm its presence. Include surrounding context if it helps clarify the license's applicability. Prioritize the most relevant lines of text.
* Explain why the identified license is the most likely match, taking into account the potential license matches and the matched license text provided in the metadata.
* Only consider matches that are clear and obviously correct. The semantic search tool will always attempt to match lines to licenses, but these matches are not always accurate.
3. **Override Semantic Search:** If the semantic search tool's suggested match seems incorrect, feel free to disregard it and rely on your own knowledge and analysis to determine the correct license. Provide a clear explanation of why you chose a different license.
4. **Exclude Irrelevant Information:**
* Disregard copyright notices and statements and lines of code as they do not indicate the software license.
* Focus only on text that is found in licenses or clearly identifies licenses.
5. **No License Scenario:** If no license is detected in the text, explicitly state "No software license found."
6. **Ambiguity:** If the license cannot be confidently determined due to ambiguity or conflicting information, clearly state this and provide an explanation.
7. **Response Format:** Provide the results in the following format:
* **Licenses = [list of identified licenses]**
* **SPDX-IDs = [list of corresponding SPDX identifiers]**

If no licenses are found, both lists should be empty:
* **Licenses = []**
* **SPDX-IDs = []**

[Text and Metadata]
  • Outcome: The LLM tried too hard to relate irrelevant lines to licenses, resulting in many false positives.

Revised Approach

  • Second Attempt

    • Prompt: Changed the task to identify relevant lines before determining licenses.
    • Issues: Reduced the number of irrelevant lines identified, but the problem of false positives persisted.
  • Second Prompt

[Task]
From the following tuples, select those that are relevant to software licensing and ignore the rest.
A relevant tuple is a tuple that contains a line of text that is relevant and can be used to identify a license.

[Tuples]
Each tuple consists of three elements:
1. **Line:** The actual line of text extracted from the file. This is the element you need to evaluate for relevance to software licensing.
2. **Potential License Match:** The name of a license that the semantic search tool suggests the line might belong to (provided for reference).
3. **License ID:** The SPDX identifier of the potential license match (provided for reference).

[Guidelines]
1. **Select License-Specific Lines:** Choose only lines that:
* Explicitly mention license terms
* Directly quote from known license texts
* Include specific license references or titles.

2. **Ignore Irrelevant Lines:**
* Disregard lines that do not explicitly mention license terms.
* Ignore copyright notices, code snippets, comments, and general documentation.
* Ignore code documentation lines that seem to be documenting code or just general instructions or information.
* Do not select lines that are general descriptions, code, or comments unrelated to license terms.

3. **No License:** If no license is found, state "No software license found."
4. **Ambiguity:** If uncertain, explain the ambiguity.
5. **Response Format:**
* **Relevant Lines = [list of relevant lines]**
* **Licenses = [list of identified licenses from relevant lines]**
* **SPDX-IDs = [list of corresponding SPDX identifiers from relevant lines]**

[Text and Metadata]
  • Outcome: The LLM still included irrelevant lines in its output, indicating a persistent issue with following the prompt guidelines.

Key Findings

  • Performance Issues: Despite detailed prompts, the LLMs struggled to correctly identify relevant lines and accurately match licenses.

  • RAG Exploration: Suggested by Kaushl, Retrieval-Augmented Generation (RAG) may provide a more robust solution to improve accuracy in license identification.

Conclusions and Next Steps

  • Improve Semantic Search: Continue refining the semantic search approach for better initial filtering of potential license lines.

  • RAG Implementation: Investigate and implement RAG to enhance the LLM's ability to accurately identify relevant lines and match licenses.

  • Further Prompt Engineering: Experiment with additional prompt variations to improve LLM performance.

  • Performance Metrics: Establish metrics to evaluate the effectiveness of the integrated approach and analyze the results for further improvements.