Meeting 13
(August 22,2024)
Attendees:
Discussion:
Atarashi Integration and PR
-
Code Ready for PR:
-
Progress: Presented the latest results to the mentors. The Atarashi integration progressed well, and I created a PR after awaiting feedback from Kaushlendra.
-
Code Execution: The code allows for semantic search on a file by using the following command:
atarashi -a SemanticSearch /path/to/file.c
- Example output:
{
"file": "/home/jimbo/Desktop/GSoC24/atarashi/atarashi/license/licenseDownloader.py",
"results": [
{
"description": "",
"shortname": "GPL-2.0-only",
"sim_score": 91.0,
"sim_type": "SemanticSearch-LVD"
}
]
}
- Example output:
-
Example Comments in the File:
Copyright 2018 Aman Jain (amanjain5221@gmail.com)
SPDX-License-Identifier: GPL-2.0
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License
version 2 as published by the Free Software Foundation.
-
LLM License Text Identification
-
Evaluation on Nomos Test Dataset:
-
Approach: I evaluated Gemma 2 9b on approximately 200 samples from the Nomos Test dataset provided by Shaheem.
-
Results: The LLM was generally effective at extracting continuous license text sections, but had issues with edge cases where the license was interspersed with other content (such as HTML tags or commentary).
-
Interesting Snippets:
-
Example 1: License text intermixed with HTML, where the LLM partially extracted the license correctly but missed sections due to the interruptions caused by tags.
File Comments:
<HTML>
<HEAD>
<TITLE>Copyright and Licensing Information for ACE, TAO, CIAO, DAnCE, and CoSMIC</TITLE>
<BODY text = "#000000"
link="#000fff"
vlink="#ff0f0f"
bgcolor="#ffffff">
<HR>
<H3>Copyright and Licensing Information for ACE<sup><font
size=-2>(TM)</font></sup>, TAO<sup><font
size=-2>(TM)</font></sup>, CIAO<sup><font
size=-2>(TM)</font></sup>, DAnCE<sup><font
size=-2>(TM)</font></sup>, and
CoSMIC<sup><font
size=-2>(TM)</font></sup></H3>
<A HREF="http://www.cs.wustl.edu/~schmidt/ACE.html">ACE</a><sup><font
size=-2>(TM)</font></sup>, <A
HREF="http://www.cs.wustl.edu/~schmidt/TAO.html">TAO</A><sup><font
size=-2>(TM)</font></sup>, <A
HREF="http://www.dre.vanderbilt.edu/CIAO/">CIAO</A><sup><font
size=-2>(TM)</font></sup>, DAnCE><sup><font size=-2>(TM)</font></sup>,
and
<A HREF="http://www.dre.vanderbilt.edu/cosmic/">CoSMIC</A><sup><font
size=-2>(TM)</font></sup> (henceforth referred to as "DOC software")
are copyrighted by <A
HREF="http://www.dre.vanderbilt.edu/~schmidt/">Douglas C. Schmidt</A>
and his <a
HREF="http://www.cs.wustl.edu/~schmidt/ACE-members.html">research
group</a> at <A HREF="http://www.wustl.edu/">Washington
University</A>, <A HREF="http://www.uci.edu">University of California,
Irvine</A>, and <A HREF="http://www.vanderbilt.edu">Vanderbilt
University</A>, Copyright (c) 1993-2009, all rights reserved. Since
DOC software is open-source, freely available software, you are free
to use, modify, copy, and distribute--perpetually and irrevocably--the
DOC software source code and object code produced from the source, as
well as copy and distribute modified versions of this software. You
must, however, include this copyright statement along with any code
built using DOC software that you release. No copyright statement
needs to be provided if you just ship binary executables of your
software products. <P>
You can use DOC software in commercial and/or binary software releases
and are under no obligation to redistribute any of your source code
that is built using DOC software. Note, however, that you may not
misappropriate the DOC software code, such as copyrighting it yourself
or claiming authorship of the DOC software code, in a way that will
prevent DOC software from being distributed freely using an
open-source development model. You needn't inform anyone that you're
using DOC software in your software, though we encourage you to let <A
HREF="mailto:doc_group@cs.wustl.edu">us</a> know so we can promote
your project in the <A
HREF="http://www.cs.wustl.edu/~schmidt/ACE-users.html">DOC software
success stories</A>. <P>
The <A HREF="http://www.cs.wustl.edu/~schmidt/ACE.html">ACE</A>, <A
HREF="http://www.cs.wustl.edu/~schmidt/TAO.html">TAO</A>, <A
HREF="http://www.dre.vanderbilt.edu/CIAO/">CIAO</A>, <A
HREF="http://www.dre.vanderbilt.edu/~schmidt/DOC_ROOT/DAnCE/">DAnCE</A>,
and <A HREF="http://www.dre.vanderbilt.edu/cosmic/">CoSMIC</A> web
sites are maintained by the <A
HREF="http://www.dre.vanderbilt.edu/">DOC Group</A> at the <A
HREF="http://www.isis.vanderbilt.edu/">Institute for Software
Integrated Systems</A> (ISIS) and the <A
HREF="http://www.cs.wustl.edu/~schmidt/doc-center.html">Center for
Distributed Object Computing</A> of Washington University, St. Louis
for the development of open-source software as part of the open-source
software community.
Submissions are provided by the submitter ``as is'' with no warranties
whatsoever, including any warranty of merchantability, noninfringement
of third party intellectual property, or fitness for any particular
purpose. In no event shall the submitter be liable for any direct,
indirect, special, exemplary, punitive, or consequential damages,
including without limitation, lost profits, even if advised of the
possibility of such damages. Likewise, DOC software is provided as is
with no warranties of any kind, including the warranties of design,
merchantability, and fitness for a particular purpose,
noninfringement, or arising from a course of dealing, usage or trade
practice. Washington University, UC Irvine, Vanderbilt University,
their employees, and students shall have no liability with respect to
the infringement of copyrights, trade secrets or any patents by DOC
software or any part thereof. Moreover, in no event will Washington
University, UC Irvine, or Vanderbilt University, their employees, or
students be liable for any lost revenue or profits or other special,
indirect and consequential damages. <P>
DOC software is provided with no support and without any obligation on
the part of Washington University, UC Irvine, Vanderbilt University,
their employees, or students to assist in its use, correction,
modification, or enhancement. A <A
HREF="http://www.cs.wustl.edu/~schmidt/commercial-support.html">number
of companies</A> around the world provide commercial support for DOC
software, however.
DOC software is Y2K-compliant, as long as the underlying OS platform
is Y2K-compliant. Likewise, DOC software is compliant with the new US
daylight savings rule passed by Congress as "The Energy Policy Act of
2005," which established new daylight savings times (DST) rules for
the United States that expand DST as of March 2007. Since DOC
software obtains time/date and calendaring information from operating
systems users will not be affected by the new DST rules as long as
they upgrade their operating systems accordingly. <P>
The names ACE</a><sup><font size=-2>(TM)</font></sup>,
TAO</a><sup><font size=-2>(TM)</font></sup>, CIAO</a><sup><font
size=-2>(TM)</font></sup>, DAnCE</a><sup><font
size=-2>(TM)</font></sup>, CoSMIC</a><sup><font
size=-2>(TM)</font></sup>, Washington University, UC Irvine, and
Vanderbilt University, may not be used to endorse or promote products
or services derived from this source without express written
permission from Washington University, UC Irvine, or Vanderbilt
University. This license grants no permission to call products or
services derived from this source ACE</a><sup><font
size=-2>(TM)</font></sup>, TAO</a><sup><font
size=-2>(TM)</font></sup>, CIAO</a><sup><font
size=-2>(TM)</font></sup>, DAnCE</a><sup><font
size=-2>(TM)</font></sup>, or CoSMIC</a><sup><font
size=-2>(TM)</font></sup>, nor does it grant permission for the name
Washington University, UC Irvine, or Vanderbilt University to appear
in their names. <P>
If you have any suggestions, additions, comments, or questions, please
let <A HREF="mailto:d.schmidt@vanderbilt.edu">me</a> know. <P>
<A HREF="http://www.dre.vanderbilt.edu/~schmidt/">Douglas C. Schmidt</A><P>
<HR><P>
Back to the <A HREF="http://www.cs.wustl.edu/ACE.html">ACE</A> home page.
</BODY>
</HTML>LLM Response:
Copyright (c) 1993-2009, all rights reserved. Since
DOC software is open-source, freely available software, you are free
to use, modify, copy, and distribute--perpetually and irrevocably--the
DOC software source code and object code produced from the source, as
well as copy and distribute modified versions of this software. You
must, however, include this copyright statement along with any code
built using DOC software that you release. No copyright statement
needs to be provided if you just ship binary executables of your
software products.
You can use DOC software in commercial and/or binary software releases
and are under no obligation to redistribute any of your source code
that is built using DOC software. Note, however, that you may not
misappropriate the DOC software code, such as copyrighting it yourself
or claiming authorship of the DOC software code, in a way that will
prevent DOC software from being distributed freely using an
open-source development model. You needn't inform anyone that you're
using DOC software in your software, though we encourage you to let <A
HREF="mailto:doc_group@cs.wustl.edu">us</a> know so we can promote
your project in the <A
HREF="http://www.cs.wustl.edu/~schmidt/ACE-users.html">DOC software
success stories</A>. -
Example 2: The LLM extracted the license correctly even though the license information was embedded in a package description. This was deemed correct despite being less obvious.
File Comments:
# rpm-specific
set(CPACK_RPM_PACKAGE_SUMMARY "${CPACK_PACKAGE_NAME}")
set(CPACK_RPM_PACKAGE_DESCRIPTION "${PACKAGE_LONG_DESCRIPTION}")
set(CPACK_RPM_PACKAGE_LICENSE "AGPL v.3.0")LLM Response:
set(CPACK_RPM_PACKAGE_LICENSE "AGPL v.3.0")
-
Example 3: The LLM stopped extracting once it encountered commentary between license statements, which might indicate a limitation in detecting when a license resumes after an interruption.
File Comments:
Distribution and modification of the the URW Garamond fonts are subject to the
Aladdin Free Public License <http://www.artifex.com/downloads/doc/Public.htm>
Copyright (c) 2000 URW++ Design & Development
The font GaramondNo8 Italic has been modified, as compared with the version
from the original distribution site <http://www.artifex.com/downloads/>;
credits to Ralf Stubner for fixing the broken quotedblleft character.
The TeX font metrics, virtual fonts and font definition files for URW Garamond,
i.e., the files in the archive ugm.zip, may be distributed and/or modified
under the conditions of the LaTeX Project Public License, either version 1.3 of
this license or (at your option) any later version. The latest version of this
license is in http://www.latex-project.org/lppl.txt and version 1.3 or later is
part of all distributions of LaTeX version 2003/12/01 or later.
Copyright (c) 2005 Walter Schmidt
Various aspects of the font were subsequently improved and (contagiously)
released under the Aladdin Free Public License.
Copyright (c) 2009 Gael Varoquaux
Copyright (c) 2010 Rogerio Theodoro de Brito
Copyright (c) 2010 Khaled Hosny
Finally, these fonts were converted by Gaetan Bisson to TrueType format using
FontForge, for use in an X environment. No additional restriction applies.
Files provided "as is", with no warranties.LLM Response:
Distribution and modification of the the URW Garamond fonts are subject to the
Aladdin Free Public License <http://www.artifex.com/downloads/doc/Public.htm>
Copyright (c) 2000 URW++ Design & Development
-
-
Observations: The LLM handles straightforward, continuous license text effectively, but struggles when the license text is broken up by other information or interspersed with unrelated comments. Overall this can be handled with additional prompt engineering
-
Conclusions and Next Steps
- Documentation and Code Cleanup: Focus on documenting the entire project and cleaning up the code as the GSoC final evaluation approaches.