Week 7

(July 15, 2025 – July 22, 2025)

Meeting 1

Date: July 15, 2025
Attendees:

Summary

Presented the fixes made to make the frontend part of the project complete. One of the major enhancement was the implementation of server-side pagination.
Got insights on how to start the implementation of the new agent which was to be developed.

Progress

Conducted a thorough review of the existing codebase to identify agents with comparable functionality to the new agent.
Discovered that the MONKBULK agent closely align with the requirements of the new agent, serving as an ideal reference implementation.
Performed an in-depth analysis of the MONKBULK agent, resulting in comprehensive documentation detailing it architecture, logic and workflow.

MONKBULK Documentation

Overview

The MonkBulk agent is a specialized FOSSology component that performs bulk license scanning operations. It extends the core Monk agent to handle batch processing of license text matching across multiple files within an upload tree structure.

Core Architecture

Main Components

Entry Point: monkbulk.c - Main executable with scheduler integration
Database Layer: database.c - PostgreSQL interactions and queries
Text Processing: string_operations.c - Tokenization with custom delimiters
File Operations: file_operations.c - File I/O and content reading
License Processing: license.c - License text handling and extraction
Pattern Matching: match.c - Core matching algorithms
Shared Headers: monk.h, monkbulk.h - Data structures and constants

Key Data Structures

// Core bulk operation parameters  
typedef struct {  
long bulkId;  
long uploadTreeId;  
long uploadTreeLeft, uploadTreeRight;  
long licenseId;  
int uploadId, jobId, userId, groupId;  
char* refText;  
char* delimiters;  
bool ignoreIrre;  
bool scanFindings;  
BulkAction** actions;  
} BulkArguments;  
  
// Individual license actions  
typedef struct {  
long licenseId;  
int removing; // 1 = removing, 0 = adding  
char* comment;  
char* reportinfo;  
char* acknowledgement;  
} BulkAction;  
  
// Agent state management  
typedef struct {  
fo_dbManager* dbManager;  
int agentId;  
int scanMode; // MODE_BULK = 3  
int verbosity;  
bool ignoreFilesWithMimeType;  
void* ptr; // Points to BulkArguments  
} MonkState;  

Processing Flow

1. Initialization

Connect to database via FOSSology scheduler
Query agent ID and register with system
Set scan mode to MODE_BULK

2. Job Processing Loop

while (fo_scheduler_next() != NULL) {  
// Parse bulk ID from job parameters  
// Query bulk arguments from database  
// Create ARS (Agent Result Set) entry  
// Execute bulk identification  
// Update ARS with results  
// Clean up resources  
}  

3. Bulk Identification Process

Parse BulkArguments from database queries
Tokenize reference license text using custom delimiters
Query files within upload tree boundaries (left/right traversal)
Multi-threaded processing with OpenMP
Match files against license patterns
Save results to database via callbacks

Database Integration

Key Tables

license_ref_bulk - Bulk operation parameters
license_set_bulk - License actions for bulk operations
uploadtree - File system tree structure
clearing_event - License clearing decisions
highlight_bulk - Match highlighting information

Critical SQL Queries

-- Bulk parameters retrieval  
SELECT ut.upload_fk, ut.uploadtree_pk, lrb.user_fk, lrb.group_fk,  
lrb.rf_text, lrb.ignore_irrelevant, lrb.bulk_delimiters, lrb.scan_findings  
FROM license_ref_bulk lrb  
INNER JOIN uploadtree ut ON ut.uploadtree_pk = lrb.uploadtree_fk  
WHERE lrb_pk = $1  
  
-- File selection within tree boundaries  
SELECT DISTINCT pfile_fk FROM uploadtree  
WHERE upload_fk = $1 AND (lft BETWEEN $2 AND $3) AND pfile_fk != 0  

Text Processing & Tokenization

Default Configuration

#define DELIMITERS " \t\n\r\f#^%,*"  
#define MAX_ALLOWED_DIFF_LENGTH 256  
#define MIN_ADJACENT_MATCHES 3  
#define MIN_ALLOWED_RANK 66  

Token Structure

typedef struct {  
unsigned int length;  
unsigned int removedBefore;  
uint32_t hashedContent;  
} Token;  

Tokenization Process

Custom delimiter support via bulk_delimiters field
Escape sequence processing (\n, \t, etc.)
Special handling for comment delimiters (//, /*, */)
Hash-based token comparison for efficiency

Multi-threading Implementation

OpenMP Integration

#ifdef MONK_MULTI_THREAD  
#pragma omp parallel  
#endif  
{  
MonkState* threadLocalState = &threadLocalStateStore;  
threadLocalState->dbManager = fo_dbManager_fork(state->dbManager);  
  
#pragma omp for schedule(dynamic)  
for (int i = 0; i < resultsCount; i++) {  
// Process files in parallel  
}  
}  

Thread Safety

Each thread gets isolated database connection
Thread-local state copies prevent race conditions
Shared resources protected by OpenMP directives

Memory Management

Allocation Patterns

BulkArguments: Dynamic allocation with custom cleanup
Token arrays: GLib GArray structures
String handling: Mix of GLib (g_strdup) and standard C (malloc)

Cleanup Functions

void bulkArguments_contents_free(BulkArguments* bulkArguments);  

Build System

CMake Configuration

Shared source files with main Monk agent
OpenMP support (-fopenmp)
Large file support (-D_FILE_OFFSET_BITS=64)
Case insensitive matching (-DMONK_CASE_INSENSITIVE)

Dependencies

libfossology - Core FOSSology library
libpq - PostgreSQL client
glib-2.0 - Utility functions
magic - File type detection
OpenMP - Multi-threading

Scheduler Integration

Job Queue Processing

Integrates with FOSSology's job scheduler system
Heartbeat mechanism for progress reporting
ARS (Agent Result Set) tracking for results

Agent Registration

queryAgentId(state, AGENT_BULK_NAME, AGENT_BULK_DESC);  

Result Processing

Match Callback System

bulk_onAllMatches() - Processes matching results
Database transaction management
Clearing event insertion with user context
Highlight information storage

Transaction Handling

ACID compliance for result storage
Rollback on processing errors
Referential integrity maintenance

Configuration Options

Scanning Modes

ignoreIrre - Skip irrelevant files
scanFindings - Process only files with existing findings
Custom delimiter configuration per bulk operation

Performance Tuning

Multi-threading control via OpenMP
Memory limits for token arrays
Database connection pooling

Error Handling

Database Errors

Connection failure handling
Query result validation
Transaction rollback on errors

File System Errors

Permission checking
File access validation
Resource cleanup on failures

Similarities between MONKBULK and the new agent

Both agents scan for exact matches and partial matches as dis-regarded.
The overall matching algorithm of MONKBULK agent was same when compared tot he new agent.

Dissimilarites between MONKBULK and the new agent

The new agent should use cutom_phrase table instead of license_ref_bulk table.
User should be able to trigger the new agent from the uploads page instead of going to the license page for a particular file like it is done for MONKBULK agent.

Meeting 1​

Summary​

Progress​

MONKBULK Documentation​

Overview​

Core Architecture​

Main Components​

Key Data Structures​

Processing Flow​

1. Initialization​

2. Job Processing Loop​

3. Bulk Identification Process​

Database Integration​

Key Tables​

Critical SQL Queries​

Text Processing & Tokenization​

Default Configuration​

Token Structure​

Tokenization Process​

Multi-threading Implementation​

OpenMP Integration​

Thread Safety​

Memory Management​

Allocation Patterns​

Cleanup Functions​

Build System​

CMake Configuration​

Dependencies​

Scheduler Integration​

Job Queue Processing​

Agent Registration​

Result Processing​

Match Callback System​

Transaction Handling​

Configuration Options​

Scanning Modes​

Performance Tuning​

Error Handling​

Database Errors​

File System Errors​

Similarities between MONKBULK and the new agent​

Dissimilarites between MONKBULK and the new agent​