Coding Week-4 Meeting

Discussions

Validate generated files through nomos using the terminal, n-gram generated around 300000 files and Markov generated 134000 files.
Included Multiprocessing to the scripts to generate text files, to speed up the process.
Using different text Augmentations libraries such as Augly to reduces biasedness from the dataset. Compared results on one of the licenses.
Augly simulate_typos and ReplaceSimilarUnicodeChars to be used for further text augmentation.
Segregation of files after getting nomos validated into different folders.

Validated Markov and N-gram files through terminals. Since the dataset was large, used multiple cores for the validation.
Updated my scripts to automate the entire process.
Apart from SPDX licenses, will be implementing the exact approach to other licenses present in Fossology Database. Extracted license_header and texts from the JSON_file
Working on a script to remove the discarded files from the validated files and segregate the correctly labeled files.
Worksamples : augly_implementation, validation-jaccard, Sample-Script-GeneratingFiles, final_script_markov, final_script_ngram

Segregate validated files in different folders and carry on with Augly implementation.