Coding Week 4&5 Meeting

Checking results manually and understanding the edge cases.
Implementation of the edge cases like (c) --> copyright, Date needed to be mandatory.
Go through different manually checked copyright CSV provided and The final CSV provided by Michael.
Traversing the CSV provided by Michael and Implementing the algorithm over it.
Implementing a performance score with which the algorithms performance to detect the copyrights is been calculated.

All the results from the different CSVs were traversed and few edge cases were predicted: Org/person name in a different language is impacting, (c) was not been predicted as copyright and it was excluded, [Date] needed to be an important entity for copyright recognition.
Solution was: (c) --> has been changed to "copyright" string and then it was fed to the algorithm, [Date] check has been implemented inside all the checks which individually helps as a final check before calling it a hit.
CSV provided by Michael contained 13lakh+ datasets that were not ideal to traverse through all of it once (Jupyter server crashed after continuous 10 hours running).
Divided the datasets into chunks of 10,000 and will traverse through it and check the ideal results on all over it.
Performance score was calculated {hitscore/No.of copyrights in list}*100, which came out as 82.65%
Wiki has been Updated

Understanding the edge cases and calculating the accuracy score over True Positives.