FOSSology
4.4.0
Open Source License Compliance by Open Source Software
|
utilities to scan, score and save license found data More...
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <time.h>
#include <signal.h>
#include <libgen.h>
#include <limits.h>
#include <stdlib.h>
#include "nomos.h"
#include "licenses.h"
#include "nomos_utils.h"
#include "util.h"
#include "list.h"
#include "nomos_regex.h"
#include "parse.h"
#include <_autodefs.h>
Go to the source code of this file.
Macros | |
#define | _GNU_SOURCE |
#define | HASHES "#####################" |
#define | DEBCPYRIGHT "debian/copyright" |
#define | MAX(a, b) ((a) > (b) ? a : b) |
Max of two. | |
#define | MIN(a, b) ((a) < (b) ? a : b) |
Min of two. | |
#define | LINE_BYTES 50 |
#define | LINE_WORDS 8 |
#define | WC_BYTES 30 |
#define | WC_WORDS 3 |
#define | PUNT_LINES 3 |
#define | MIN_LINES 1 |
Functions | |
static void | makeLicenseSummary (list_t *l, int highScore, char *target, int size) |
Construct a 'computed license'. Wherever possible, leave off the entries for None and LikelyNot; those are individual-file results and we're making an 'aggregate summary' here. More... | |
static void | noLicenseFound () |
Mark curent scan as LS_NOSUM (No_license_found) | |
static int | searchStrategy (int, char *, int) |
static void | saveLicenseData (scanres_t *scores, int nCand, int nElem, int lowWater) |
Save/creates all the license-data in a specific directory temp directory? More... | |
static int | scoreCompare (const void *arg1, const void *arg2) |
Compare two scores. More... | |
static void | printHighlightInfo (GArray *keyWords, GArray *theMatches) |
Print highlight info about matches. More... | |
void | licenseInit () |
license initialization More... | |
char * | createRelativePath (item_t *p, scanres_t *scp) |
void | scanForKeywordsAndSetScore (scanres_t *scores, list_t *licenseList) |
void | relaxScoreCriterionForSingleFile (scanres_t *scores) |
Reset scores to 1 if it is 0. More... | |
int | fiterResultsOfKeywordScan (int lowWater, scanres_t *scores, int nFiles) |
Run through the list once more. More... | |
void | licenseScan (list_t *licenseList) |
scan the list for a license(s) More... | |
static void | printKeyWordMatches (scanres_t *scores, int idx) |
Prints keywords match to STDOUT. | |
static gint | compare_integer (gconstpointer a, gconstpointer b) |
Compare two integers. More... | |
static void | rescanOriginalTextForFoundLicences (char *textp, int isFileMarkupLanguage, int isPS) |
Rescan original content for the licenses already found. More... | |
Variables | |
static char | any [6] |
static char | some [7] |
static char | few [6] |
static char | year [7] |
utilities to scan, score and save license found data
Definition in file licenses.c.
#define LINE_BYTES 50 |
fudge for punctuation, etc.
Definition at line 245 of file licenses.c.
#define LINE_WORDS 8 |
assume this many words per line
Definition at line 246 of file licenses.c.
#define MIN_LINES 1 |
normal minimum-extra-lines
Definition at line 250 of file licenses.c.
#define PUNT_LINES 3 |
if "dunno", guess this line-count
Definition at line 249 of file licenses.c.
#define WC_BYTES 30 |
wild-card counts this many bytes
Definition at line 247 of file licenses.c.
#define WC_WORDS 3 |
wild-card counts this many words
Definition at line 248 of file licenses.c.
|
static |
Compare two integers.
Definition at line 916 of file licenses.c.
int fiterResultsOfKeywordScan | ( | int | lowWater, |
scanres_t * | scores, | ||
int | nFiles | ||
) |
Run through the list once more.
This time we record and count the license candidates to process. License candidates are determined by either (score >= low) OR matching a set of filename patterns.
lowWater | Lowest score to filter |
scores | Scores to filter |
nFiles | Number of files |
Definition at line 701 of file licenses.c.
void licenseInit | ( | ) |
license initialization
Examine the search strings in licSpec looking for 3 corner-cases to optimize all the regex-searches we'll be making:
Step 1, copy the tseed "search seed", decrypt it, and munge any wild- cards in the string. Note that once we eliminate the compile-time string encryption, we could re-use the same exact data. In fact, some day (in our copious spare time), we could effectively remove licSpec.
Step 2, add the search-seed to the search-cache
Step 3, handle special cases of NULL seeds and (regex == seed)
Step 4, decrypt and fix the regex (since seed != regex here). Once we have all that, searchStrategy() helps determine how many lines above and below [the seed] to save – see findPhrase() for details.
Now that we've computed the above- and below-values for license searches, set each of the appropriate entries with the MAX values determined. Limit 'above' values to 3 and 'below' values to 6.
QUESTION: the above has worked in the past - is it STILL valid?
Finally (if enabled), compare each of the search strings to see if there are duplicates, and determine if some of the regexes can be searched via strstr() (instead of it's slower-but-more-functional regex brethern).
Definition at line 70 of file licenses.c.
void licenseScan | ( | list_t * | licenseList | ) |
scan the list for a license(s)
This routine takes a list, but in fossology we always pass in a single file.
Set up defaults for the minimum-scores for which we'll save files. Try to ensure a minimum # of license files will be recorded for this source/package (try, don't force it too hard); see if lower scores yield a better fit, but recognize the of a non-license file increases as we lower the bar.
Definition at line 747 of file licenses.c.
|
static |
Construct a 'computed license'. Wherever possible, leave off the entries for None and LikelyNot; those are individual-file results and we're making an 'aggregate summary' here.
parseLicenses() added license components found, as long as they were considered "interesting" to some extent. Components of significant interest had their iFlag set to 1; those of lower-interest were set to 0. In this way we can tier license components into 4 distinct levels: 'interesting', 'medium interest', 'nothing significant', and 'Zero'.
==> If the list is EMPTY, there's nothing, period.
==> If listCount() returns non-zero, "interesting" stuff is in it and we can safely ignore things of 'significantly less interest'.
==> If neither of these is the case, only the licenses of the above
'significantly less interest' category exist (don't ignore them).
We need to be VERY careful in this routine about the length of the license-summary created; they COULD be indefinitely long! For now, just check to see if we're going to overrun the buffer...
Construct a 'computed license'.
Wherever possible, leave off the entries for None and LikelyNot; those are individual-file results and we're making an 'aggregate summary' here.
Definition at line 1264 of file licenses.c.
|
static |
Print highlight info about matches.
This functions prtints to STDOUT only if OPTS_HIGHLIGHT_STDOUT is set.
Format: Keyword at <start>, length <length>, index = 0, License #<name># at <start>, length <length>, index = <license_index>,
keyWords | Keywords matches |
theMatches | License matches |
Definition at line 854 of file licenses.c.
void relaxScoreCriterionForSingleFile | ( | scanres_t * | scores | ) |
Reset scores to 1 if it is 0.
If we were invoked with a single-file-only option, just over-ride the score calculation – give the file any greater-than-zero score so it appears as a valid candidate. This is important when the file to be evaluated has no keywords, yet might contain authorship inferences.
scores |
Definition at line 679 of file licenses.c.
|
static |
Rescan original content for the licenses already found.
textp | Original text string |
isFileMarkupLanguage | Is original text a markup text |
isPS | Is original text a PostScript text |
Definition at line 936 of file licenses.c.
|
static |
Save/creates all the license-data in a specific directory temp directory?
CDB - Some initializations happen here for no particular reason
we should filter some names out like the shellscript does. For instance, word-spell-dictionary files will score high but will likely NOT contain a license. But the shellscript filters these names AFTER they're already scanned. Think about it.
BUG: When _FTYP_POSTSCR is "(postscript|utf-8 unicode)", the resulting license-parse yields 'NoLicenseFound' but when both "postscript" and "utf-8 unicode" are searched independently, parsing definitely finds quantifiable licenses. WHY?
Definition at line 982 of file licenses.c.
For EACH file, determine if we want to scan it, and if so, scan the candidate files for keywords (to obtain a "score" – the higher the score, the more likely it has a real open source license in it).
There are lots of things that will 'disinterest' us in a file (below).
scores | |
licenseList |
Definition at line 594 of file licenses.c.
|
static |
Compare two scores.
Definition at line 809 of file licenses.c.
|
static |
ASSUME a "standard line-length" of 50 characters/bytes. That's likely too small, but err on the side of being too conservative.
Determining for the number of text-lines ABOVE involves finding out how far into the 'license footprint' the seed-word resides. ASSUME a standard line-length of 50 (probably too small, but we'll err on the side of being too conservative. If the seed isn't IN the regex, assume a generally-bad worst-case and search 2-3 lines above.
Determining for the number of text-lines BELOW involves finding out how long the 'license footprint' actually is, plus adding some fudge based on the number of wild-cards in the footprint.
index | License index from Strings.in |
regex | regex to match for |
aboveCalc | Set to look above |
Definition at line 273 of file licenses.c.