Coding Week 8 Meeting
Attendees
- Gaurav Mishra
- Vasudev
- Ayush Bharadwaj
- Shreya Singh
- Kaushlendra Pratap Singh
- Omar AbdelSamea
Discussions
-
Provided with more edge cases to retaliate.
-
Discussion whether a good model will result in good extraction and results.
-
Discussion over calculating the PRECISION of the algorithm using the predefined accuracies.
-
Different models for NER was discussed, like "en_core_web_md", "en_core_web_lg" & "en_core_web_trf"
-
Discussions over the process of clutter removal from the copyrights. E.g.
Original Copyright:
Copyright (c) 2018 Kaushlendra Pratap. Distributed under the MIT license.........
After Clutter Removal:
Copyright (c) 2018 Kaushlendra Pratap.
-
A REGEX based approach was discussed to extract the exact copyright statements out and decided to keep it as a last resort.
-
An approach using the heuristic approach was taken as two pointers to go from one NER extracted entity till the exact thing is acquired.
-
After long discussions over approaches and the edge cases that should be encountered, the final approach included the STRING MANIPULATION.
-
The edge cases decided was:
- ['ORG'] encountered
- ['PERSON'] encountered
- ['ORG] and ['PERSON'] both were encountered
- Hard code string value if "All rights reserved" or "Distributed under the licenses" was present
- Finally the regex approach for the left out cases.
-
The time complexity of the algorithm was under the close watch
Week 8 Progress
-
The edge cases provided were screened through the algorithm and results were improved.
-
Several models were tried including all the en_core_web(md, lg and trf).
-
The accuracy was not impacted as such using the heavy models.
-
The most accuracy was given by "en_core_web_trf", i.e. 89.06 from 89.02 which is a minor change but on the other, the implementation time for 100 thousand copyright statements was 1.2 hours.
-
Performance over accuracy was chosen. (Stayed with "en_core_web_sm")
-
The accuracy of 100 thousand copyrights were stretched to 90 - 91% approx.
-
Final accuracy variable was created with the formula of PRECISION:
TP+TN/FP+TP+TN+FN
-
The PRECISION was calculated to be: 94.07% (for 100 thousand copyright statements)
-
The clutter removal flag was created and developed in such a way that,
USER: Provides FLAG => 1
IF COPYRIGHT is True:
IF FLAG is True:
CALL FUNC() -> Remove clutter
-
TWO edge cases of clutter removal were covered, i.e. the hardcoded string-based and ['ORG'] was present.
-
Wiki has been Updated
Conclusion and Further Plans
Different level of clutter removal should be implemented.