atarashi.libs.ngram module

Copyright 2018 Aman Jain (amanjain5221@gmail.com)

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

atarashi.libs.ngram.createNgrams(licenseList, ngramJsonLoc, threads=4, verbose=0)[source]

Creates a Ngram_keywords.json in location specified by user that contains unique ngrams for each license cluster

Parameters:
  • licenseList – Processed License List (CSV)
  • ngramJsonLoc – Specify N-Gram Json File location
  • threads – Number of CPU to be used for creating n-grams. This is done to speed up the process.
  • verbose – Specify if verbose mode is on or not (Default is Off/ None)
Returns:

Returns - n-gram json file location, - Array - matched_output (Licenses that has non-zero unique n-gram identifiers) - Array - no_keyword_matched (licenses woth zero unique n-gram identifiers)

atarashi.libs.ngram.find_ngrams(input_list, n)[source]

Zip ngrams of given length n from Input list

atarashi.libs.ngram.load_database(licenseList, verbose=0)[source]

Store the unique n-grams N=[2,5,6,7,8] for each license cluster

Parameters:licenseList – Processed License List path
Returns:Return uniqueNgrams array, license cluster array, licenses array
atarashi.libs.ngram.unique_ngrams(uniqueNGram)[source]
Parameters:uniqueNGram – List of all ngrams of a cluster
Returns:List/ Array of n-grams that uniquely identify the license cluster