Coding Week-0 Meeting
Attendees
- Gaurav Mishra
- Anupam Ghosh
- Michael C. Jaeger
- Shaheem Azmal
- Ayush Bhardwaj
- Vasudev Maduri
- Omar Mohamed
- Kaushlendra Pratap
- Shreya Singh
Discussions
- Brainstormed on various ways through which dataset can be generated, and broke it down into different parts.
- Python library to be used in text generation using regex - Xeger/Intxeger
- To reuse Script to n-gram the paragraphs of license texts and to generate different permutations and combinations of them.
- Regex of different licenses can be extracted from licenses.json, exceptions.json, or STRINGS.in
- Shifting the codebase of Atarashi to dask/vaex will boost the runtime. This to be done in parallel with the contribution. Results from few python files showed a significant boost to runtime.
Week 0 Progress
- I reused the script.py file to split the files in different combinations of paras that will create its combinations: like para1+para3, para2+para4 after each traversal.
- Different org licenses are 60-70% and the ones with different versions are 90% similar. So to the split files, keywords and regex of the specific license_header can be added.
- Tested Intxeger performance on the regex from STRINGS.in file. We are able to generate "Nsamples", and add them to all the paras, the randomness could be minimized.
x = intxeger.build(r"motosoto open source licen[cs]e =FEW= (v|version )0\.?9\.?1")
print(x.sample(N=5))
Output : ['motosoto open source license =FEW= version 0.9.1', 'motosoto open source licence =FEW= version 0.9.1', 'motosoto open source licence =FEW= v0.91', 'motosoto open source license =FEW= v0.91', 'motosoto open source license =FEW= v0.9.1'] - Generated statements in Nsamples were unique which will automatically make the generated text files unique.
- The number of datasets generated will depend on the number of paras + Nsamples.
- Regex from SPDX released licenses.json and exceptions.json can be extracted by redirecting to detailsUrl of JSON file -> standardLicenseTemplate (regex) -> licenseText (complete text).
- Work Samples : Texts-Intxeger.ipynb, Texts-difflibraries.ipynb
Conclusion and Further Plans
Implementation of Intxeger on one of the licenses and generating files using it.