Coding Week-4 Meeting
Attendees
- Gaurav Mishra
- Anupam Ghosh
- Michael C. Jaeger
- Shaheem Azmal
- Ayush Bhardwaj
- Vasudev Maduri
- Omar Mohamed
- Kaushlendra Pratap
- Shreya Singh
Discussions
- Validate generated files through nomos using the terminal, n-gram generated around 300000 files and Markov generated 134000 files.
- Included Multiprocessing to the scripts to generate text files, to speed up the process.
- Using different text Augmentations libraries such as Augly to reduces biasedness from the dataset. Compared results on one of the licenses.
- Augly simulate_typos and ReplaceSimilarUnicodeChars to be used for further text augmentation.
- Segregation of files after getting nomos validated into different folders.
Week 4 Progress
- Validated Markov and N-gram files through terminals. Since the dataset was large, used multiple cores for the validation.
- Updated my scripts to automate the entire process.
- Apart from SPDX licenses, will be implementing the exact approach to other licenses present in Fossology Database. Extracted license_header and texts from the JSON_file
- Working on a script to remove the discarded files from the validated files and segregate the correctly labeled files.
- Worksamples : augly_implementation, validation-jaccard, Sample-Script-GeneratingFiles, final_script_markov, final_script_ngram
Conclusion and Further Plans
Segregate validated files in different folders and carry on with Augly implementation.