utilities to scan, score and save license found data More...

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <time.h>
#include <signal.h>
#include <libgen.h>
#include <limits.h>
#include <stdlib.h>
#include "nomos.h"
#include "licenses.h"
#include "nomos_utils.h"
#include "util.h"
#include "list.h"
#include "nomos_regex.h"
#include "parse.h"
#include <_autodefs.h>

Include dependency graph for licenses.c:

Go to the source code of this file.

Macros
#define	_GNU_SOURCE

#define	HASHES "#####################"

#define	DEBCPYRIGHT "debian/copyright"

#define	MAX(a, b) ((a) > (b) ? a : b)
	Max of two.

#define	MIN(a, b) ((a) < (b) ? a : b)
	Min of two.

#define	LINE_BYTES 50

#define	LINE_WORDS 8

#define	WC_BYTES 30

#define	WC_WORDS 3

#define	PUNT_LINES 3

#define	MIN_LINES 1

Functions
static void	makeLicenseSummary (list_t l, int highScore, char target, int size)
	Construct a 'computed license'. Wherever possible, leave off the entries for None and LikelyNot; those are individual-file results and we're making an 'aggregate summary' here. More...

static void	noLicenseFound ()
	Mark curent scan as LS_NOSUM (No_license_found)

static int	searchStrategy (int, char *, int)

static void	saveLicenseData (scanres_t *scores, int nCand, int nElem, int lowWater)
	Save/creates all the license-data in a specific directory temp directory? More...

static int	scoreCompare (const void arg1, const void arg2)
	Compare two scores. More...

static void	printHighlightInfo (GArray keyWords, GArray theMatches)
	Print highlight info about matches. More...

void	licenseInit ()
	license initialization More...

char *	createRelativePath (item_t p, scanres_t scp)

void	scanForKeywordsAndSetScore (scanres_t scores, list_t licenseList)

void	relaxScoreCriterionForSingleFile (scanres_t *scores)
	Reset scores to 1 if it is 0. More...

int	fiterResultsOfKeywordScan (int lowWater, scanres_t *scores, int nFiles)
	Run through the list once more. More...

void	licenseScan (list_t *licenseList)
	scan the list for a license(s) More...

static void	printKeyWordMatches (scanres_t *scores, int idx)
	Prints keywords match to STDOUT.

static gint	compare_integer (gconstpointer a, gconstpointer b)
	Compare two integers. More...

static void	rescanOriginalTextForFoundLicences (char *textp, int isFileMarkupLanguage, int isPS)
	Rescan original content for the licenses already found. More...

Variables
static char	any [6]

static char	some [7]

static char	few [6]

static char	year [7]

Detailed Description

utilities to scan, score and save license found data

Version: "$Id: licenses.c 4032 2011-04-05 22:16:20Z bobgo $"

Definition in file licenses.c.

Macro Definition Documentation

◆ LINE_BYTES

#define LINE_BYTES 50

fudge for punctuation, etc.

Definition at line 245 of file licenses.c.

◆ LINE_WORDS

#define LINE_WORDS 8

assume this many words per line

Definition at line 246 of file licenses.c.

◆ MIN_LINES

#define MIN_LINES 1

normal minimum-extra-lines

Definition at line 250 of file licenses.c.

◆ PUNT_LINES

#define PUNT_LINES 3

if "dunno", guess this line-count

Definition at line 249 of file licenses.c.

◆ WC_BYTES

#define WC_BYTES 30

wild-card counts this many bytes

Definition at line 247 of file licenses.c.

◆ WC_WORDS

#define WC_WORDS 3

wild-card counts this many words

Definition at line 248 of file licenses.c.

Function Documentation

◆ compare_integer()

static gint compare_integer	(	gconstpointer	a,
		gconstpointer	b
	)

static

Compare two integers.

Returns: negative value if a < b; zero if a = b; positive value if a > b.

Definition at line 916 of file licenses.c.

◆ fiterResultsOfKeywordScan()

int fiterResultsOfKeywordScan	(	int	lowWater,
		scanres_t *	scores,
		int	nFiles
	)

Run through the list once more.

This time we record and count the license candidates to process. License candidates are determined by either (score >= low) OR matching a set of filename patterns.

Parameters

lowWater	Lowest score to filter
scores	Scores to filter
nFiles	Number of files

Returns

Definition at line 701 of file licenses.c.

◆ licenseInit()

void licenseInit ( )

license initialization

Examine the search strings in licSpec looking for 3 corner-cases to optimize all the regex-searches we'll be making:

The seed string is the same as the text-search string
The text-search string has length 1 and contents == "."
The seed string is the 'null-string' indicator

Step 1, copy the tseed "search seed", decrypt it, and munge any wild- cards in the string. Note that once we eliminate the compile-time string encryption, we could re-use the same exact data. In fact, some day (in our copious spare time), we could effectively remove licSpec.

Step 2, add the search-seed to the search-cache

Step 3, handle special cases of NULL seeds and (regex == seed)

Step 4, decrypt and fix the regex (since seed != regex here). Once we have all that, searchStrategy() helps determine how many lines above and below [the seed] to save – see findPhrase() for details.

Now that we've computed the above- and below-values for license searches, set each of the appropriate entries with the MAX values determined. Limit 'above' values to 3 and 'below' values to 6.

QUESTION: the above has worked in the past - is it STILL valid?

Finally (if enabled), compare each of the search strings to see if there are duplicates, and determine if some of the regexes can be searched via strstr() (instead of it's slower-but-more-functional regex brethern).

Definition at line 70 of file licenses.c.

◆ licenseScan()

void licenseScan ( list_t * licenseList )

scan the list for a license(s)

This routine takes a list, but in fossology we always pass in a single file.

Set up defaults for the minimum-scores for which we'll save files. Try to ensure a minimum # of license files will be recorded for this source/package (try, don't force it too hard); see if lower scores yield a better fit, but recognize the of a non-license file increases as we lower the bar.

Definition at line 747 of file licenses.c.

◆ makeLicenseSummary()

static void makeLicenseSummary	(	list_t *	l,
		int	highScore,
		char *	target,
		int	size
	)

static

Construct a 'computed license'. Wherever possible, leave off the entries for None and LikelyNot; those are individual-file results and we're making an 'aggregate summary' here.

parseLicenses() added license components found, as long as they were considered "interesting" to some extent. Components of significant interest had their iFlag set to 1; those of lower-interest were set to 0. In this way we can tier license components into 4 distinct levels: 'interesting', 'medium interest', 'nothing significant', and 'Zero'.
==> If the list is EMPTY, there's nothing, period.
==> If listCount() returns non-zero, "interesting" stuff is in it and we can safely ignore things of 'significantly less interest'.
==> If neither of these is the case, only the licenses of the above
'significantly less interest' category exist (don't ignore them).

We need to be VERY careful in this routine about the length of the license-summary created; they COULD be indefinitely long! For now, just check to see if we're going to overrun the buffer...

Construct a 'computed license'.

Wherever possible, leave off the entries for None and LikelyNot; those are individual-file results and we're making an 'aggregate summary' here.

Note: This function adds licenses to cur.compLic

Definition at line 1264 of file licenses.c.

◆ printHighlightInfo()

static void printHighlightInfo	(	GArray *	keyWords,
		GArray *	theMatches
	)

static

Print highlight info about matches.

This functions prtints to STDOUT only if OPTS_HIGHLIGHT_STDOUT is set.

Format: Keyword at <start>, length <length>, index = 0, License #<name># at <start>, length <length>, index = <license_index>,

Parameters

keyWords	Keywords matches
theMatches	License matches

Definition at line 854 of file licenses.c.

◆ relaxScoreCriterionForSingleFile()

void relaxScoreCriterionForSingleFile ( scanres_t * scores )

Reset scores to 1 if it is 0.

If we were invoked with a single-file-only option, just over-ride the score calculation – give the file any greater-than-zero score so it appears as a valid candidate. This is important when the file to be evaluated has no keywords, yet might contain authorship inferences.

Parameters

scores

Note: It is always the case that we are doing one file at a time.

Definition at line 679 of file licenses.c.

◆ rescanOriginalTextForFoundLicences()

static void rescanOriginalTextForFoundLicences	(	char *	textp,
		int	isFileMarkupLanguage,
		int	isPS
	)

static

Rescan original content for the licenses already found.

Parameters

textp	Original text string
isFileMarkupLanguage	Is original text a markup text
isPS	Is original text a PostScript text

Definition at line 936 of file licenses.c.

◆ saveLicenseData()

static void saveLicenseData	(	scanres_t *	scores,
		int	nCand,
		int	nElem,
		int	lowWater
	)

static

Save/creates all the license-data in a specific directory temp directory?

Note: OF SPECIAL INTEREST: this function changes directory!

Todo:

CDB - Some initializations happen here for no particular reason

we should filter some names out like the shellscript does. For instance, word-spell-dictionary files will score high but will likely NOT contain a license. But the shellscript filters these names AFTER they're already scanned. Think about it.

BUG: When _FTYP_POSTSCR is "(postscript|utf-8 unicode)", the resulting license-parse yields 'NoLicenseFound' but when both "postscript" and "utf-8 unicode" are searched independently, parsing definitely finds quantifiable licenses. WHY?

Definition at line 982 of file licenses.c.

◆ scanForKeywordsAndSetScore()

void scanForKeywordsAndSetScore	(	scanres_t *	scores,
		list_t *	licenseList
	)

For EACH file, determine if we want to scan it, and if so, scan the candidate files for keywords (to obtain a "score" – the higher the score, the more likely it has a real open source license in it).

There are lots of things that will 'disinterest' us in a file (below).

Parameters

scores
licenseList

Note: This loop is called 400,000 to 500,000 times when parsing a distribution. Little slow-downs ADD UP quickly!; Some other part of FOSSology has already decided we want to scan this file, so we need to look into removing this file scoring stuff.

Todo:: We don't currently use _UTIL_FILTER, which is set up to exclude some files by filename.

Definition at line 594 of file licenses.c.

◆ scoreCompare()

static int scoreCompare	(	const void *	arg1,
		const void *	arg2
	)

static

Compare two scores.

Returns: -1 ; If score1 > score2
1 ; If score1 < score2
-1 ; If fullpath1 != NULL and follpath2 = NULL
1 ; If fullpath1 = NULL and follpath2 != NULL
; String comparison of fullpath if conditions above fails

Note: this procedure is a qsort callback that provides a REVERSE integer sort (highest to lowest)

Definition at line 809 of file licenses.c.

◆ searchStrategy()

static int searchStrategy	(	int	index,
		char *	regex,
		int	aboveCalc
	)

static

Note: This function should be called BEFORE the wild-card specifier =ANY= is converted to a REAL regex ".*" (e.g., before fixSearchString())!

ASSUME a "standard line-length" of 50 characters/bytes. That's likely too small, but err on the side of being too conservative.

Determining for the number of text-lines ABOVE involves finding out how far into the 'license footprint' the seed-word resides. ASSUME a standard line-length of 50 (probably too small, but we'll err on the side of being too conservative. If the seed isn't IN the regex, assume a generally-bad worst-case and search 2-3 lines above.

Determining for the number of text-lines BELOW involves finding out how long the 'license footprint' actually is, plus adding some fudge based on the number of wild-cards in the footprint.

Parameters

index	License index from Strings.in
regex	regex to match for
aboveCalc	Set to look above

Definition at line 273 of file licenses.c.

Macros

Functions

Variables

Detailed Description

Macro Definition Documentation

◆ LINE_BYTES

◆ LINE_WORDS

◆ MIN_LINES

◆ PUNT_LINES

◆ WC_BYTES

◆ WC_WORDS

Function Documentation

◆ compare_integer()

◆ fiterResultsOfKeywordScan()

◆ licenseInit()

◆ licenseScan()

◆ makeLicenseSummary()

◆ printHighlightInfo()

◆ relaxScoreCriterionForSingleFile()

◆ rescanOriginalTextForFoundLicences()

◆ saveLicenseData()

◆ scanForKeywordsAndSetScore()

◆ scoreCompare()

◆ searchStrategy()