Week 7
(July 15, 2025 – July 22, 2025)
Meeting 1
Date: July 15, 2025
Attendees:
Summary
- Presented the fixes made to make the frontend part of the project complete. One of the major enhancement was the implementation of server-side pagination.
- Got insights on how to start the implementation of the new agent which was to be developed.
Progress
- Conducted a thorough review of the existing codebase to identify agents with comparable functionality to the new agent.
- Discovered that the MONKBULK agent closely align with the requirements of the new agent, serving as an ideal reference implementation.
- Performed an in-depth analysis of the MONKBULK agent, resulting in comprehensive documentation detailing it architecture, logic and workflow.
MONKBULK Documentation
Overview
The MonkBulk agent is a specialized FOSSology component that performs bulk license scanning operations. It extends the core Monk agent to handle batch processing of license text matching across multiple files within an upload tree structure.
Core Architecture
Main Components
- Entry Point:
monkbulk.c
- Main executable with scheduler integration - Database Layer:
database.c
- PostgreSQL interactions and queries - Text Processing:
string_operations.c
- Tokenization with custom delimiters - File Operations:
file_operations.c
- File I/O and content reading - License Processing:
license.c
- License text handling and extraction - Pattern Matching:
match.c
- Core matching algorithms - Shared Headers:
monk.h
,monkbulk.h
- Data structures and constants
Key Data Structures
// Core bulk operation parameters
typedef struct {
long bulkId;
long uploadTreeId;
long uploadTreeLeft, uploadTreeRight;
long licenseId;
int uploadId, jobId, userId, groupId;
char* refText;
char* delimiters;
bool ignoreIrre;
bool scanFindings;
BulkAction** actions;
} BulkArguments;
// Individual license actions
typedef struct {
long licenseId;
int removing; // 1 = removing, 0 = adding
char* comment;
char* reportinfo;
char* acknowledgement;
} BulkAction;
// Agent state management
typedef struct {
fo_dbManager* dbManager;
int agentId;
int scanMode; // MODE_BULK = 3
int verbosity;
bool ignoreFilesWithMimeType;
void* ptr; // Points to BulkArguments
} MonkState;
Processing Flow
1. Initialization
- Connect to database via FOSSology scheduler
- Query agent ID and register with system
- Set scan mode to
MODE_BULK
2. Job Processing Loop
while (fo_scheduler_next() != NULL) {
// Parse bulk ID from job parameters
// Query bulk arguments from database
// Create ARS (Agent Result Set) entry
// Execute bulk identification
// Update ARS with results
// Clean up resources
}
3. Bulk Identification Process
- Parse
BulkArguments
from database queries - Tokenize reference license text using custom delimiters
- Query files within upload tree boundaries (left/right traversal)
- Multi-threaded processing with OpenMP
- Match files against license patterns
- Save results to database via callbacks
Database Integration
Key Tables
license_ref_bulk
- Bulk operation parameterslicense_set_bulk
- License actions for bulk operationsuploadtree
- File system tree structureclearing_event
- License clearing decisionshighlight_bulk
- Match highlighting information
Critical SQL Queries
-- Bulk parameters retrieval
SELECT ut.upload_fk, ut.uploadtree_pk, lrb.user_fk, lrb.group_fk,
lrb.rf_text, lrb.ignore_irrelevant, lrb.bulk_delimiters, lrb.scan_findings
FROM license_ref_bulk lrb
INNER JOIN uploadtree ut ON ut.uploadtree_pk = lrb.uploadtree_fk
WHERE lrb_pk = $1
-- File selection within tree boundaries
SELECT DISTINCT pfile_fk FROM uploadtree
WHERE upload_fk = $1 AND (lft BETWEEN $2 AND $3) AND pfile_fk != 0
Text Processing & Tokenization
Default Configuration
#define DELIMITERS " \t\n\r\f#^%,*"
#define MAX_ALLOWED_DIFF_LENGTH 256
#define MIN_ADJACENT_MATCHES 3
#define MIN_ALLOWED_RANK 66
Token Structure
typedef struct {
unsigned int length;
unsigned int removedBefore;
uint32_t hashedContent;
} Token;
Tokenization Process
- Custom delimiter support via
bulk_delimiters
field - Escape sequence processing (
\n
,\t
, etc.) - Special handling for comment delimiters (
//
,/*
,*/
) - Hash-based token comparison for efficiency
Multi-threading Implementation
OpenMP Integration
#ifdef MONK_MULTI_THREAD
#pragma omp parallel
#endif
{
MonkState* threadLocalState = &threadLocalStateStore;
threadLocalState->dbManager = fo_dbManager_fork(state->dbManager);
#pragma omp for schedule(dynamic)
for (int i = 0; i < resultsCount; i++) {
// Process files in parallel
}
}
Thread Safety
- Each thread gets isolated database connection
- Thread-local state copies prevent race conditions
- Shared resources protected by OpenMP directives
Memory Management
Allocation Patterns
BulkArguments
: Dynamic allocation with custom cleanup- Token arrays: GLib
GArray
structures - String handling: Mix of GLib (
g_strdup
) and standard C (malloc
)
Cleanup Functions
void bulkArguments_contents_free(BulkArguments* bulkArguments);
Build System
CMake Configuration
- Shared source files with main Monk agent
- OpenMP support (
-fopenmp
) - Large file support (
-D_FILE_OFFSET_BITS=64
) - Case insensitive matching (
-DMONK_CASE_INSENSITIVE
)
Dependencies
libfossology
- Core FOSSology librarylibpq
- PostgreSQL clientglib-2.0
- Utility functionsmagic
- File type detection- OpenMP - Multi-threading
Scheduler Integration
Job Queue Processing
- Integrates with FOSSology's job scheduler system
- Heartbeat mechanism for progress reporting
- ARS (Agent Result Set) tracking for results
Agent Registration
queryAgentId(state, AGENT_BULK_NAME, AGENT_BULK_DESC);
Result Processing
Match Callback System
bulk_onAllMatches()
- Processes matching results- Database transaction management
- Clearing event insertion with user context
- Highlight information storage
Transaction Handling
- ACID compliance for result storage
- Rollback on processing errors
- Referential integrity maintenance
Configuration Options
Scanning Modes
ignoreIrre
- Skip irrelevant filesscanFindings
- Process only files with existing findings- Custom delimiter configuration per bulk operation
Performance Tuning
- Multi-threading control via OpenMP
- Memory limits for token arrays
- Database connection pooling
Error Handling
Database Errors
- Connection failure handling
- Query result validation
- Transaction rollback on errors
File System Errors
- Permission checking
- File access validation
- Resource cleanup on failures
Similarities between MONKBULK and the new agent
- Both agents scan for exact matches and partial matches as dis-regarded.
- The overall matching algorithm of MONKBULK agent was same when compared tot he new agent.
Dissimilarites between MONKBULK and the new agent
- The new agent should use cutom_phrase table instead of license_ref_bulk table.
- User should be able to trigger the new agent from the uploads page instead of going to the license page for a particular file like it is done for MONKBULK agent.