Created a labled result for FOSSology's repo. The copyrights are color coded based on
True Positive (green), False Positive (red), Different lang (blue), not actual copyright (grey), confusing (orange).
Used this data (14k +ve and 5K -ve) to train classifiers. Started with tf-idf and trained SVM, Random Forest, Navie Bayes.
NB can be told to have certial level of confidence before classifying a string.
Results are very good, >95% accuracy. Higher recall is aimed on identifying +ve copyrights.
Tested out Bert, but is slow and not very performant given the amount of data.