Coding Week 4&5 Meeting
Attendees
- Anupam Ghosh
- Gaurav Mishra
- Vasudev
- Ayush Bharadwaj
- Shreya Singh
- Kaushlendra Pratap Singh
- Omar AbdelSamea
Discussions
- Checking results manually and understanding the edge cases.
- Implementation of the edge cases like (c) --> copyright, Date needed to be mandatory.
- Go through different manually checked copyright CSV provided and The final CSV provided by Michael.
- Traversing the CSV provided by Michael and Implementing the algorithm over it.
- Implementing a performance score with which the algorithms performance to detect the copyrights is been calculated.
Week 4&5 Progress
- All the results from the different CSVs were traversed and few edge cases were predicted: Org/person name in a different language is impacting, (c) was not been predicted as copyright and it was excluded, [Date] needed to be an important entity for copyright recognition.
- Solution was: (c) --> has been changed to "copyright" string and then it was fed to the algorithm, [Date] check has been implemented inside all the checks which individually helps as a final check before calling it a hit.
- CSV provided by Michael contained 13lakh+ datasets that were not ideal to traverse through all of it once (Jupyter server crashed after continuous 10 hours running).
- Divided the datasets into chunks of 10,000 and will traverse through it and check the ideal results on all over it.
- Performance score was calculated {hitscore/No.of copyrights in list}*100, which came out as 82.65%
- Wiki has been Updated
Conclusion and Further Plans
Understanding the edge cases and calculating the accuracy score over True Positives.