Week 3

(June 17, 2025 - June 18, 2025)

Meeting 1

(June 18, 2025)

Presented progress on improving the Locality Sensitive Hashing (LSH) approach for license detection.
Compared MinHash (Jaccard-based) vs SimHash (cosine-based) algorithms.
Shared insights from experimenting with different vectorization techniques (TF-IDF vs. Sentence Transformers).
Discussed handling large-scale corpora with caching and sampling strategies.
Mentors proposed a 3-step architecture for Atarashi involving:
1. Initial keyword detection using STRINGS.in.
2. License prediction via LSH-based classifier.
3. Final license verification for correctness.

Initial implementation using MinHash with character shingles and Jaccard similarity yielded poor results — lacked robustness against paraphrased or partial text.
Switched to SimHash, which is more suitable for high-dimensional dense vector spaces and performs well with cosine similarity.

SimHash works by projecting high-dimensional vectors into binary hash codes based on weighted sign projections.
Vectors that are closer in cosine distance map to hash codes with small Hamming distances.
Enables fast similarity search using hash buckets, significantly reducing lookup time.

TF-IDF Vectorizer was initially used but resulted in sparse vectors, which are incompatible with SimHash.
Transitioned to sentence-transformers — used all-MiniLM-L6-v2 model, which generates dense sentence embeddings suitable for SimHash.

Implemented caching to avoid repeated vector generation for the same files.
Due to dataset size (~162k files), limited vectorization to a representative subset of 10,000 files for faster experimentation.

Combined all Minerva files into a single corpus and indexed using SimHash-based LSH.
Indexed 10,000 sample files, including:
- 46 unique licenses (out of 654 total)
- 20 known non-license texts
- Total -> 674 queries.

Key Metrics:

Metric	Value
Indexed licenses	46 / 654
Correctly retrieved licenses	All 46
Correctly rejected non-license text	20 / 674
Detected unseen licenses (not indexed)	203 / 608
Indexed file subset	10,000 / 162,833
Overall trend	Positive performance despite limited indexing

TF-IDF vectors were too sparse, reducing effectiveness with SimHash.
Limited indexing (only 10k files) restricts generalization for rare licenses.
Not all licenses were present in the indexed corpus — needs broader coverage.
False negatives among non-license texts indicate further tuning is required.

Start working on Stage 1 of the proposed pipeline:
- Extract and match keywords from STRINGS.in (used by Nomos) to identify candidate license regions.
Expand indexed dataset to include more diverse license types.
Improve non-license detection rate through better negative sampling and filtering.
Continue tuning SimHash and embedding-based search thresholds.