Week 3
(June 17, 2025 - June 18, 2025)
Meeting 1
(June 18, 2025)
Attendees
Discussions
- Presented progress on improving the Locality Sensitive Hashing (LSH) approach for license detection.
- Compared MinHash (Jaccard-based) vs SimHash (cosine-based) algorithms.
- Shared insights from experimenting with different vectorization techniques (TF-IDF vs. Sentence Transformers).
- Discussed handling large-scale corpora with caching and sampling strategies.
- Mentors proposed a 3-step architecture for Atarashi involving:
- Initial keyword detection using
STRINGS.in
. - License prediction via LSH-based classifier.
- Final license verification for correctness.
- Initial keyword detection using
LSH Algorithm and Implementation Updates
From MinHash to SimHash
- Initial implementation using MinHash with character shingles and Jaccard similarity yielded poor results — lacked robustness against paraphrased or partial text.
- Switched to SimHash, which is more suitable for high-dimensional dense vector spaces and performs well with cosine similarity.
SimHash Overview
- SimHash works by projecting high-dimensional vectors into binary hash codes based on weighted sign projections.
- Vectors that are closer in cosine distance map to hash codes with small Hamming distances.
- Enables fast similarity search using hash buckets, significantly reducing lookup time.
Vectorization Techniques
- TF-IDF Vectorizer was initially used but resulted in sparse vectors, which are incompatible with SimHash.
- Transitioned to
sentence-transformers
— used all-MiniLM-L6-v2 model, which generates dense sentence embeddings suitable for SimHash.
Performance Optimizations
- Implemented caching to avoid repeated vector generation for the same files.
- Due to dataset size (~162k files), limited vectorization to a representative subset of 10,000 files for faster experimentation.
Experimental Results
- Combined all Minerva files into a single corpus and indexed using SimHash-based LSH.
- Indexed 10,000 sample files, including:
- 46 unique licenses (out of 654 total)
- 20 known non-license texts
- Total -> 674 queries.
Key Metrics:
Metric | Value |
---|---|
Indexed licenses | 46 / 654 |
Correctly retrieved licenses | All 46 |
Correctly rejected non-license text | 20 / 674 |
Detected unseen licenses (not indexed) | 203 / 608 |
Indexed file subset | 10,000 / 162,833 |
Overall trend | Positive performance despite limited indexing |
Code Repository: atarashi-classifier
Problems Identified
- TF-IDF vectors were too sparse, reducing effectiveness with SimHash.
- Limited indexing (only 10k files) restricts generalization for rare licenses.
- Not all licenses were present in the indexed corpus — needs broader coverage.
- False negatives among non-license texts indicate further tuning is required.
Planning for Next Week
- Start working on Stage 1 of the proposed pipeline:
- Extract and match keywords from
STRINGS.in
(used by Nomos) to identify candidate license regions.
- Extract and match keywords from
- Expand indexed dataset to include more diverse license types.
- Improve non-license detection rate through better negative sampling and filtering.
- Continue tuning SimHash and embedding-based search thresholds.