Skip to main content

Coding Week 8 Meeting

Attendees

  • Gaurav Mishra
  • Vasudev
  • Ayush Bharadwaj
  • Shreya Singh
  • Kaushlendra Pratap Singh
  • Omar AbdelSamea

Discussions

  • Provided with more edge cases to retaliate.

  • Discussion whether a good model will result in good extraction and results.

  • Discussion over calculating the PRECISION of the algorithm using the predefined accuracies.

  • Different models for NER was discussed, like "en_core_web_md", "en_core_web_lg" & "en_core_web_trf"

  • Discussions over the process of clutter removal from the copyrights. E.g.

    Original Copyright: Copyright (c) 2018 Kaushlendra Pratap. Distributed under the MIT license.........

    After Clutter Removal: Copyright (c) 2018 Kaushlendra Pratap.

  • A REGEX based approach was discussed to extract the exact copyright statements out and decided to keep it as a last resort.

  • An approach using the heuristic approach was taken as two pointers to go from one NER extracted entity till the exact thing is acquired.

  • After long discussions over approaches and the edge cases that should be encountered, the final approach included the STRING MANIPULATION.

  • The edge cases decided was:

    1. ['ORG'] encountered
    2. ['PERSON'] encountered
    3. ['ORG] and ['PERSON'] both were encountered
    4. Hard code string value if "All rights reserved" or "Distributed under the licenses" was present
    5. Finally the regex approach for the left out cases.
  • The time complexity of the algorithm was under the close watch

Week 8 Progress

  • The edge cases provided were screened through the algorithm and results were improved.

  • Several models were tried including all the en_core_web(md, lg and trf).

  • The accuracy was not impacted as such using the heavy models.

  • The most accuracy was given by "en_core_web_trf", i.e. 89.06 from 89.02 which is a minor change but on the other, the implementation time for 100 thousand copyright statements was 1.2 hours.

  • Performance over accuracy was chosen. (Stayed with "en_core_web_sm")

  • The accuracy of 100 thousand copyrights were stretched to 90 - 91% approx.

  • Final accuracy variable was created with the formula of PRECISION: TP+TN/FP+TP+TN+FN

  • The PRECISION was calculated to be: 94.07% (for 100 thousand copyright statements)

  • The clutter removal flag was created and developed in such a way that,

    USER: Provides FLAG => 1

    IF COPYRIGHT is True:

    IF FLAG is True:

    CALL FUNC() -> Remove clutter

  • TWO edge cases of clutter removal were covered, i.e. the hardcoded string-based and ['ORG'] was present.

  • Wiki has been Updated

Conclusion and Further Plans

Different level of clutter removal should be implemented.