Meeting 7
(July 11,2024)
Attendees:
Discussion:
Improved Semantic Search Algorithm
-
Presentation of Improved Algorithm
-
Approach: Implemented chunking by dividing code comments and paragraphs into multiline strings, starting new chunks at empty lines.
-
Challenges: Original method of extracting comments resulted in long, merged lines, necessitating a rework to handle multiline comments effectively.
-
-
Chunking Example
/*
This is a multiline comment
that should be considered as one
single chunk for better accuracy.
This is a separate chunk.
*/
// This is a single line comment
// This is still a single line comment
// This is also still a single line comment and not a chunk with the previous comment
License Matching Performance
-
Initial Results
-
Accuracy: The chunking approach significantly improved license matching accuracy but struggled with extremely similar licenses (e.g., 0BSD and ISC).
-
Problem: Minor differences between very similar licenses led to occasional misidentifications.
-
-
License Text Examples
0BSD License Text:
Copyright (C) YEAR by AUTHOR EMAIL
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted.
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
ISC License Text:
Copyright (c) 2004-2010 by Internet Systems Consortium, Inc. ("ISC")
Copyright (c) 1995-2003 by Internet Software Consortium
Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.
THE SOFTWARE IS PROVIDED "AS IS" AND ISC DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL ISC BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
- Enhancements
-
Chunk Merging: Attempted merging potential chunks to improve license identification by providing more comprehensive text for comparison.
-
Combined Line and Chunk Matching: Implemented both line and chunk matching to enhance accuracy, though it increased processing time due to the greater number of combinations.
-
Results
-
Metrics: Predicted License Accuracy: 93.33% Predicted Licenses Covered: 84.0%
-
Notes: Approximately 5% of the remaining 7.6% were files referring to a different license file, not containing the license text directly.
-
-
Matching Output Example:
[(100.0,
' THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.',
'BSD Zero Clause License',
'0BSD',
'THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.',
[('BSD Zero Clause License', 100.0),
('BSD Zero Clause License', 100.0),
('ISC License', 97.0),
('ISC License', 97.0),
('Mackerras 3-Clause License', 93.0)]),
(100.0,
' Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.',
'curl License',
'curl',
'Permission to use, copy, modify, and distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.',
[('curl License', 100.0),
('curl License', 100.0),
('pkgconf License', 99.0),
('ISC License', 99.0),
('ISC License', 99.0)]),
(99.0,
' Permission to use, copy, modify, and distribute this software for any',
'OAR License',
'OAR',
'Permission to use, copy, modify, and distribute this software for any',
[('Historical Permission Notice and Disclaimer - Fenneberg-Livingston variant',
99.0),
('David M. Gay dtoa License', 99.0),
('OAR License', 99.0),
('pkgconf License', 97.0),
('SGI OpenGL License', 96.0)]),
(99.0,
' copyright notice and this permission notice appear in all copies.',
'pkgconf License',
'pkgconf',
'copyright notice and this permission notice appear in all copies.',
[('pkgconf License', 99.0),
('Historical Permission Notice and Disclaimer - documentation variant',
99.0),
('CMU Mach - no notices-in-documentation variant', 96.0),
('ISC Veillard variant', 87.0),
('Historical Permission Notice and Disclaimer - Pbmplus variant', 86.0)]),
.... (many more)
Key Findings
-
Enhanced Accuracy: The combination of chunking and line matching improved overall accuracy and coverage.
-
Increased Processing Time: The dual approach led to longer search times due to the increased number of combinations.
Conclusions and Next Steps
-
Evaluate on Nomos Test Dataset:
-
Dataset Links:
-
Objective: Assess the performance of the semantic search and LLM integration on the provided test datasets.
-
-
Limit Line/Chunk Matching: To address the issue of excessive matches, limit the line/chunk matching to optimize search time and accuracy.
-
Additional Tasks
-
Acknowledgement from Notice Files: Begin work on identifying and acknowledging licenses from notice files.
-
Obligations: Convert identified licenses to obligations, detailing the requirements and conditions of each license.
-