Skip to main content

Week 18

(September,27,2023)

Attendees:

Updates:

1. Decluttering Strategies:

  • I have considered two distinct strategies for decluttering using NER (Named Entity Recognition):
    • Simpler Approach: Identifying the entire copyright as a single entity.
    • Detailed Labeling: Recognizing and labeling individual components within the copyright. This includes symbols like (c), (C), and ©, the term copyright, the copyright holder's name, the year/date, among other constituents. Although this method requires more extensive labeling, it promises potential benefits in accuracy and granularity.
  • I opted for the simpler approach and proceeded with manually labeling 600 instances via doccano. Subsequently, a rudimentary spaCy model was trained on this labeled data.

2. Model Testing:

  • Here are some samples tested with the developed model, where the highlighted parts denote detected copyrights:
    1. Copyright (c) 1997-2000 PHP Development Team (See Credits file) |\n"); ibase_blob_add($bl_h, "+----------------------------------------------------------------------+\n"); ibase_blob_add($bl_h, "| This program is free software; you can redistribute it and/or modify |\n"); ibase_blob_add($bl_h,
    2. copyright 1996 by SPI
    3. Copyright (c) 2004-2011, The Dojo Foundation All Rights Reserved. Available via Academic Free License >= 2.1 OR the modified BSD license. see: http://dojotoolkit.org/license for details
    4. Copyright (C) 2003-2004 Lawrence E. Rosen. All rights reserved. Permission is hereby granted to copy and distribute this license without modification. This license may not be modified without
    5. Copyright 2004-2018 H2 Group. Multiple-Licensed under the MPL 2.0, and the EPL 1.0 (http://h2database.com/html/license.html). Initial Developer: H2 Group
  • Overall, the model displays adeptness in detecting the copyrights and filtering out the clutter, with some notable exceptions, like the fifth example.

3. Integration Efforts:

  • With Gaurav's assistance during our recent meeting, we managed to pinpoint some integration issues. After overcoming them, the integrated feature was activated, although it ran at a significantly diminished speed. The reason for this reduced efficiency is yet to be determined.

Conclusion and Further Plans:

1. Model Enhancement:

  • The immediate plan is to supplement our dataset with additional labeled data points. With this augmented dataset, the aim is to further improve and refine the declutter model.