Skip to main content

Week 21

(October,18,2023)

Attendees:

Updates:

1. Re-evaluation of the Existing Model:

  • Upon a thorough review of the previously developed decluttering model, I identified a significant issue in the approach I adopted. Specifically, the semi-supervised learning technique utilized earlier had not been applied with adequate scrutiny to the dataset. As a result, the dataset contained an excessive number of inaccurately labeled examples, adversely affecting the model's performance.

2. Data Labeling and Refinement:

  • To rectify the identified discrepancies, I undertook the task of labeling a new dataset comprising 4,000 diverse examples. This process was assisted by the model to ensure the accuracy of labels. The objective was to establish a robust dataset, devoid of labeling errors, which could be reliably used to gauge the model's performance.

3. Optimization Strategy:

  • During this labeling phase, I adopted a systematic strategy to mitigate the recurrence of previously observed issues, particularly the repetitive copyright statements. Consequently, this dataset, though numbering 4,000 examples, effectively offers the richness of approximately 6,000 to 7,000 samples when benchmarked against the former labeling methodology.

4. Putting the Model to Test:

  • I decided to evaluate the refined model on new datasets - copyrights from ansible, cassandra, and vscode repositories:
  • Ansible: Here, the results were mixed. While the model performed reasonably in some cases, it exhibited challenges in accurately identifying GNU license instances:
    1. 'Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>' Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.
    2. Copyright 2019 Ansible Project GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt)
    3. (c) 2014, James Tanner <tanner.jc@gmail.com>
    4. (c) 2017 Ansible Project GNU General Public License v3.0+ (see COPYING or https://www.gnu.org/licenses/gpl-3.0.txt) from __future__ import (absolute_import, division, print_function) metaclass__ = type
    5. (c) 2013, bleader Written by bleader <bleader@ratonland.org> Based on pkgin module written by Shaun Zinck that was based on pacman module written by Afterburn <https://github.com/afterburn> that was based on apt module written by Matthew Williams <matthew@flowrout>
  • Cassandra: Again, the model demonstrated varied performance. While it succeeded in some instances, it missed out on others, particularly the ones with repeated patterns:
    1. (c) 2005, 2014 jQuery Foundation, Inc. | jquery.org/license */
    2. (c) Steven Levithan <stevenlevithan.com> MIT License
    3. Copyright 2005-2008 The Android Open Source Project This product includes software developed as part of The Android Open Source Project (http://source.android.com).
    4. Copyright © 2020 Jeff Carpenter, Eben Hewitt. All rights reserved. Used with permission._
    5. Copyright &amp;copy; 2009- The Apache Software Foundation " useexternalfile="yes" encoding="UTF-8" failonerror="false" maxmemory="256m" additionalparam="${jdk11plus-javadoc-exports}"> filesets/> javadoc> fail message="javadoc failed"> condition>
    6. © 2018 DataStax", "", "\n", "\0", "\0\0", "\001", "0", "0\0", "00", "1") forEach(stringConsumer)
    7. copyright to Philip Koopman , which he licenses under the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0)
    8. Copyright (c) 1998 Hewlett-Packard CompanydescsRGB IEC61966-2.1sRGB IEC61966-2.1XYZ óQ ÌXYZ XYZ o¢8õXYZ b·ÚXYZ $ ¶ÏdescIEC http://www.iec.chIEC
  • VScode: A similar trend was observed here. Some instances were accurately identified, whereas others were overlooked:
    1. Copyright (c) Microsoft Corporation. All rights reserved. Licensed under the MIT License. See License.txt in the project root for license information.
    2. Copyright (c) textmate-diff.tmbundle project authors",
    3. copyrightCopyright Apple Inc., 2016Èô(FtEXticc:descriptionDisplay ¸IEND®B
    4. Copyright (c) 2002-2020 K.Kosako <kkosako0@gmail.com> ", All rights reserved.",
    5. Copyright (C) Microsoft Corporation. All rights reserved. ± t Copyright (C) Microsoft Corporation. All rights reserved. lÿü ÿü C=------------------------------------------------------------- ±
    6. Copyright (c) 2011 Fabrice Bellard The original design remains. The terminal itself has been extended to include xterm CSI codes, among other features .
    7. Copyright © 2015 W3C® (MIT, ERCIM, Keio, Beihang) . This software or document includes material copied ", from or derived from HTML 5.1 W3C Working Draft (http://www.w3.org/TR/2015/WD-html51-20151008/.)",
  • Feedback Session: After showcasing these outcomes to Kaushlendra, he articulated that the model would greatly benefit from an even more expansive dataset. A corpus larger than the current 4,000 examples is essential for the model to effectively generalize across diverse variations.

Conclusion and further plans:

1. Decluttering Improvements

  • Improve the decluttering model as much as I can while working on the documentation

2. Documentation

  • Work on finilzating the weekly documentation as GSoC is coming to an end.
  • Start working on the GSoC final report.