Skip to main content

Meeting 13

(August 22,2024)

Attendees:

Discussion:

Atarashi Integration and PR

  • Code Ready for PR:

    1. Progress: Presented the latest results to the mentors. The Atarashi integration progressed well, and I created a PR after awaiting feedback from Kaushlendra.

    2. Code Execution: The code allows for semantic search on a file by using the following command:

      atarashi -a SemanticSearch /path/to/file.c
      • Example output:
        {
        "file": "/home/jimbo/Desktop/GSoC24/atarashi/atarashi/license/licenseDownloader.py",
        "results": [
        {
        "description": "",
        "shortname": "GPL-2.0-only",
        "sim_score": 91.0,
        "sim_type": "SemanticSearch-LVD"
        }
        ]
        }
    3. Example Comments in the File:

      Copyright 2018 Aman Jain (amanjain5221@gmail.com)

      SPDX-License-Identifier: GPL-2.0

      This program is free software; you can redistribute it and/or
      modify it under the terms of the GNU General Public License
      version 2 as published by the Free Software Foundation.

LLM License Text Identification

  • Evaluation on Nomos Test Dataset:

    1. Approach: I evaluated Gemma 2 9b on approximately 200 samples from the Nomos Test dataset provided by Shaheem.

    2. Results: The LLM was generally effective at extracting continuous license text sections, but had issues with edge cases where the license was interspersed with other content (such as HTML tags or commentary).

    3. Interesting Snippets:

      • Example 1: License text intermixed with HTML, where the LLM partially extracted the license correctly but missed sections due to the interruptions caused by tags.

        File Comments:

        <HTML>

        <HEAD>
        <TITLE>Copyright and Licensing Information for ACE, TAO, CIAO, DAnCE, and CoSMIC</TITLE>

        <BODY text = "#000000"
        link="#000fff"
        vlink="#ff0f0f"
        bgcolor="#ffffff">

        <HR>

        <H3>Copyright and Licensing Information for ACE<sup><font
        size=-2>(TM)</font></sup>, TAO<sup><font
        size=-2>(TM)</font></sup>, CIAO<sup><font
        size=-2>(TM)</font></sup>, DAnCE<sup><font
        size=-2>(TM)</font></sup>, and
        CoSMIC<sup><font
        size=-2>(TM)</font></sup></H3>

        <A HREF="http://www.cs.wustl.edu/~schmidt/ACE.html">ACE</a><sup><font
        size=-2>(TM)</font></sup>, <A
        HREF="http://www.cs.wustl.edu/~schmidt/TAO.html">TAO</A><sup><font
        size=-2>(TM)</font></sup>, <A
        HREF="http://www.dre.vanderbilt.edu/CIAO/">CIAO</A><sup><font
        size=-2>(TM)</font></sup>, DAnCE><sup><font size=-2>(TM)</font></sup>,
        and
        <A HREF="http://www.dre.vanderbilt.edu/cosmic/">CoSMIC</A><sup><font
        size=-2>(TM)</font></sup> (henceforth referred to as "DOC software")
        are copyrighted by <A
        HREF="http://www.dre.vanderbilt.edu/~schmidt/">Douglas C. Schmidt</A>
        and his <a
        HREF="http://www.cs.wustl.edu/~schmidt/ACE-members.html">research
        group</a> at <A HREF="http://www.wustl.edu/">Washington
        University</A>, <A HREF="http://www.uci.edu">University of California,
        Irvine</A>, and <A HREF="http://www.vanderbilt.edu">Vanderbilt
        University</A>, Copyright (c) 1993-2009, all rights reserved. Since
        DOC software is open-source, freely available software, you are free
        to use, modify, copy, and distribute--perpetually and irrevocably--the
        DOC software source code and object code produced from the source, as
        well as copy and distribute modified versions of this software. You
        must, however, include this copyright statement along with any code
        built using DOC software that you release. No copyright statement
        needs to be provided if you just ship binary executables of your
        software products. <P>

        You can use DOC software in commercial and/or binary software releases
        and are under no obligation to redistribute any of your source code
        that is built using DOC software. Note, however, that you may not
        misappropriate the DOC software code, such as copyrighting it yourself
        or claiming authorship of the DOC software code, in a way that will
        prevent DOC software from being distributed freely using an
        open-source development model. You needn't inform anyone that you're
        using DOC software in your software, though we encourage you to let <A
        HREF="mailto:doc_group@cs.wustl.edu">us</a> know so we can promote
        your project in the <A
        HREF="http://www.cs.wustl.edu/~schmidt/ACE-users.html">DOC software
        success stories</A>. <P>

        The <A HREF="http://www.cs.wustl.edu/~schmidt/ACE.html">ACE</A>, <A
        HREF="http://www.cs.wustl.edu/~schmidt/TAO.html">TAO</A>, <A
        HREF="http://www.dre.vanderbilt.edu/CIAO/">CIAO</A>, <A
        HREF="http://www.dre.vanderbilt.edu/~schmidt/DOC_ROOT/DAnCE/">DAnCE</A>,
        and <A HREF="http://www.dre.vanderbilt.edu/cosmic/">CoSMIC</A> web
        sites are maintained by the <A
        HREF="http://www.dre.vanderbilt.edu/">DOC Group</A> at the <A
        HREF="http://www.isis.vanderbilt.edu/">Institute for Software
        Integrated Systems</A> (ISIS) and the <A
        HREF="http://www.cs.wustl.edu/~schmidt/doc-center.html">Center for
        Distributed Object Computing</A> of Washington University, St. Louis
        for the development of open-source software as part of the open-source
        software community.

        Submissions are provided by the submitter ``as is'' with no warranties
        whatsoever, including any warranty of merchantability, noninfringement
        of third party intellectual property, or fitness for any particular
        purpose. In no event shall the submitter be liable for any direct,
        indirect, special, exemplary, punitive, or consequential damages,
        including without limitation, lost profits, even if advised of the
        possibility of such damages. Likewise, DOC software is provided as is
        with no warranties of any kind, including the warranties of design,
        merchantability, and fitness for a particular purpose,
        noninfringement, or arising from a course of dealing, usage or trade
        practice. Washington University, UC Irvine, Vanderbilt University,
        their employees, and students shall have no liability with respect to
        the infringement of copyrights, trade secrets or any patents by DOC
        software or any part thereof. Moreover, in no event will Washington
        University, UC Irvine, or Vanderbilt University, their employees, or
        students be liable for any lost revenue or profits or other special,
        indirect and consequential damages. <P>

        DOC software is provided with no support and without any obligation on
        the part of Washington University, UC Irvine, Vanderbilt University,
        their employees, or students to assist in its use, correction,
        modification, or enhancement. A <A
        HREF="http://www.cs.wustl.edu/~schmidt/commercial-support.html">number
        of companies</A> around the world provide commercial support for DOC
        software, however.

        DOC software is Y2K-compliant, as long as the underlying OS platform
        is Y2K-compliant. Likewise, DOC software is compliant with the new US
        daylight savings rule passed by Congress as "The Energy Policy Act of
        2005," which established new daylight savings times (DST) rules for
        the United States that expand DST as of March 2007. Since DOC
        software obtains time/date and calendaring information from operating
        systems users will not be affected by the new DST rules as long as
        they upgrade their operating systems accordingly. <P>

        The names ACE</a><sup><font size=-2>(TM)</font></sup>,
        TAO</a><sup><font size=-2>(TM)</font></sup>, CIAO</a><sup><font
        size=-2>(TM)</font></sup>, DAnCE</a><sup><font
        size=-2>(TM)</font></sup>, CoSMIC</a><sup><font
        size=-2>(TM)</font></sup>, Washington University, UC Irvine, and
        Vanderbilt University, may not be used to endorse or promote products
        or services derived from this source without express written
        permission from Washington University, UC Irvine, or Vanderbilt
        University. This license grants no permission to call products or
        services derived from this source ACE</a><sup><font
        size=-2>(TM)</font></sup>, TAO</a><sup><font
        size=-2>(TM)</font></sup>, CIAO</a><sup><font
        size=-2>(TM)</font></sup>, DAnCE</a><sup><font
        size=-2>(TM)</font></sup>, or CoSMIC</a><sup><font
        size=-2>(TM)</font></sup>, nor does it grant permission for the name
        Washington University, UC Irvine, or Vanderbilt University to appear
        in their names. <P>

        If you have any suggestions, additions, comments, or questions, please
        let <A HREF="mailto:d.schmidt@vanderbilt.edu">me</a> know. <P>

        <A HREF="http://www.dre.vanderbilt.edu/~schmidt/">Douglas C. Schmidt</A><P>

        <HR><P>
        Back to the <A HREF="http://www.cs.wustl.edu/ACE.html">ACE</A> home page.

        </BODY>
        </HTML>

        LLM Response:

        Copyright (c) 1993-2009, all rights reserved.  Since
        DOC software is open-source, freely available software, you are free
        to use, modify, copy, and distribute--perpetually and irrevocably--the
        DOC software source code and object code produced from the source, as
        well as copy and distribute modified versions of this software. You
        must, however, include this copyright statement along with any code
        built using DOC software that you release. No copyright statement
        needs to be provided if you just ship binary executables of your
        software products.

        You can use DOC software in commercial and/or binary software releases
        and are under no obligation to redistribute any of your source code
        that is built using DOC software. Note, however, that you may not
        misappropriate the DOC software code, such as copyrighting it yourself
        or claiming authorship of the DOC software code, in a way that will
        prevent DOC software from being distributed freely using an
        open-source development model. You needn't inform anyone that you're
        using DOC software in your software, though we encourage you to let <A
        HREF="mailto:doc_group@cs.wustl.edu">us</a> know so we can promote
        your project in the <A
        HREF="http://www.cs.wustl.edu/~schmidt/ACE-users.html">DOC software
        success stories</A>.
      • Example 2: The LLM extracted the license correctly even though the license information was embedded in a package description. This was deemed correct despite being less obvious.

        File Comments:

        # rpm-specific
        set(CPACK_RPM_PACKAGE_SUMMARY "${CPACK_PACKAGE_NAME}")
        set(CPACK_RPM_PACKAGE_DESCRIPTION "${PACKAGE_LONG_DESCRIPTION}")
        set(CPACK_RPM_PACKAGE_LICENSE "AGPL v.3.0")

        LLM Response:

        set(CPACK_RPM_PACKAGE_LICENSE "AGPL v.3.0")
      • Example 3: The LLM stopped extracting once it encountered commentary between license statements, which might indicate a limitation in detecting when a license resumes after an interruption.

        File Comments:

        Distribution and modification of the the URW Garamond fonts are subject to the
        Aladdin Free Public License <http://www.artifex.com/downloads/doc/Public.htm>

        Copyright (c) 2000 URW++ Design & Development

        The font GaramondNo8 Italic has been modified, as compared with the version
        from the original distribution site <http://www.artifex.com/downloads/>;
        credits to Ralf Stubner for fixing the broken quotedblleft character.


        The TeX font metrics, virtual fonts and font definition files for URW Garamond,
        i.e., the files in the archive ugm.zip, may be distributed and/or modified
        under the conditions of the LaTeX Project Public License, either version 1.3 of
        this license or (at your option) any later version. The latest version of this
        license is in http://www.latex-project.org/lppl.txt and version 1.3 or later is
        part of all distributions of LaTeX version 2003/12/01 or later.

        Copyright (c) 2005 Walter Schmidt


        Various aspects of the font were subsequently improved and (contagiously)
        released under the Aladdin Free Public License.

        Copyright (c) 2009 Gael Varoquaux
        Copyright (c) 2010 Rogerio Theodoro de Brito
        Copyright (c) 2010 Khaled Hosny


        Finally, these fonts were converted by Gaetan Bisson to TrueType format using
        FontForge, for use in an X environment. No additional restriction applies.

        Files provided "as is", with no warranties.

        LLM Response:

        Distribution and modification of the the URW Garamond fonts are subject to the
        Aladdin Free Public License <http://www.artifex.com/downloads/doc/Public.htm>

        Copyright (c) 2000 URW++ Design & Development
    4. Observations: The LLM handles straightforward, continuous license text effectively, but struggles when the license text is broken up by other information or interspersed with unrelated comments. Overall this can be handled with additional prompt engineering

Conclusions and Next Steps

  • Documentation and Code Cleanup: Focus on documenting the entire project and cleaning up the code as the GSoC final evaluation approaches.