Week 18
(September,27,2023)
Attendees:
Updates:
1. Decluttering Strategies:
- I have considered two distinct strategies for decluttering using NER (Named Entity Recognition):
- Simpler Approach: Identifying the entire copyright as a single entity.
- Detailed Labeling: Recognizing and labeling individual components within the copyright. This includes symbols like
(c)
,(C)
, and©
, the termcopyright
, the copyright holder's name, the year/date, among other constituents. Although this method requires more extensive labeling, it promises potential benefits in accuracy and granularity.
- I opted for the simpler approach and proceeded with manually labeling 600 instances via doccano. Subsequently, a rudimentary spaCy model was trained on this labeled data.
2. Model Testing:
- Here are some samples tested with the developed model, where the highlighted parts denote detected copyrights:
Copyright (c) 1997-2000 PHP Development Team (See Credits file)
|\n"); ibase_blob_add($bl_h, "+----------------------------------------------------------------------+\n"); ibase_blob_add($bl_h, "| This program is free software; you can redistribute it and/or modify |\n"); ibase_blob_add($bl_h,copyright 1996 by SPI
Copyright (c) 2004-2011, The Dojo Foundation
All Rights Reserved. Available via Academic Free License >= 2.1 OR the modified BSD license. see: http://dojotoolkit.org/license for detailsCopyright (C) 2003-2004 Lawrence E. Rosen.
All rights reserved. Permission is hereby granted to copy and distribute this license without modification. This license may not be modified withoutCopyright 2004-2018 H2 Group. Multiple-Licensed under the MPL 2.0, and the EPL 1.0
(http://h2database.com/html/license.html). Initial Developer: H2 Group
- Overall, the model displays adeptness in detecting the copyrights and filtering out the clutter, with some notable exceptions, like the fifth example.
3. Integration Efforts:
- With Gaurav's assistance during our recent meeting, we managed to pinpoint some integration issues. After overcoming them, the integrated feature was activated, although it ran at a significantly diminished speed. The reason for this reduced efficiency is yet to be determined.
Conclusion and Further Plans:
1. Model Enhancement:
- The immediate plan is to supplement our dataset with additional labeled data points. With this augmented dataset, the aim is to further improve and refine the declutter model.