Skip to main content

Coding Week-0 Meeting

Attendees

Discussions

  1. Brainstormed on various ways through which dataset can be generated, and broke it down into different parts.
  2. Python library to be used in text generation using regex - Xeger/Intxeger
  3. To reuse Script to n-gram the paragraphs of license texts and to generate different permutations and combinations of them.
  4. Regex of different licenses can be extracted from licenses.json, exceptions.json, or STRINGS.in
  5. Shifting the codebase of Atarashi to dask/vaex will boost the runtime. This to be done in parallel with the contribution. Results from few python files showed a significant boost to runtime.

Week 0 Progress

  1. I reused the script.py file to split the files in different combinations of paras that will create its combinations: like para1+para3, para2+para4 after each traversal.
  2. Different org licenses are 60-70% and the ones with different versions are 90% similar. So to the split files, keywords and regex of the specific license_header can be added.
  3. Tested Intxeger performance on the regex from STRINGS.in file. We are able to generate "Nsamples", and add them to all the paras, the randomness could be minimized.
    x = intxeger.build(r"motosoto open source licen[cs]e =FEW= (v|version )0\.?9\.?1")
    print(x.sample(N=5))
    Output : ['motosoto open source license =FEW= version 0.9.1', 'motosoto open source licence =FEW= version 0.9.1', 'motosoto open source licence =FEW= v0.91', 'motosoto open source license =FEW= v0.91', 'motosoto open source license =FEW= v0.9.1']
  4. Generated statements in Nsamples were unique which will automatically make the generated text files unique.
  5. The number of datasets generated will depend on the number of paras + Nsamples.
  6. Regex from SPDX released licenses.json and exceptions.json can be extracted by redirecting to detailsUrl of JSON file -> standardLicenseTemplate (regex) -> licenseText (complete text).
  7. Work Samples : Texts-Intxeger.ipynb, Texts-difflibraries.ipynb

Conclusion and Further Plans

Implementation of Intxeger on one of the licenses and generating files using it.