Skip to main content

Week 3

(June 17, 2025 - June 18, 2025)

Meeting 1

(June 18, 2025)

Attendees

Discussions

  • Presented progress on improving the Locality Sensitive Hashing (LSH) approach for license detection.
  • Compared MinHash (Jaccard-based) vs SimHash (cosine-based) algorithms.
  • Shared insights from experimenting with different vectorization techniques (TF-IDF vs. Sentence Transformers).
  • Discussed handling large-scale corpora with caching and sampling strategies.
  • Mentors proposed a 3-step architecture for Atarashi involving:
    1. Initial keyword detection using STRINGS.in.
    2. License prediction via LSH-based classifier.
    3. Final license verification for correctness.

LSH Algorithm and Implementation Updates

From MinHash to SimHash

  • Initial implementation using MinHash with character shingles and Jaccard similarity yielded poor results — lacked robustness against paraphrased or partial text.
  • Switched to SimHash, which is more suitable for high-dimensional dense vector spaces and performs well with cosine similarity.

SimHash Overview

  • SimHash works by projecting high-dimensional vectors into binary hash codes based on weighted sign projections.
  • Vectors that are closer in cosine distance map to hash codes with small Hamming distances.
  • Enables fast similarity search using hash buckets, significantly reducing lookup time.

Vectorization Techniques

  • TF-IDF Vectorizer was initially used but resulted in sparse vectors, which are incompatible with SimHash.
  • Transitioned to sentence-transformers — used all-MiniLM-L6-v2 model, which generates dense sentence embeddings suitable for SimHash.

Performance Optimizations

  • Implemented caching to avoid repeated vector generation for the same files.
  • Due to dataset size (~162k files), limited vectorization to a representative subset of 10,000 files for faster experimentation.

Experimental Results

  • Combined all Minerva files into a single corpus and indexed using SimHash-based LSH.
  • Indexed 10,000 sample files, including:
    • 46 unique licenses (out of 654 total)
    • 20 known non-license texts
    • Total -> 674 queries.

Key Metrics:

MetricValue
Indexed licenses46 / 654
Correctly retrieved licensesAll 46
Correctly rejected non-license text20 / 674
Detected unseen licenses (not indexed)203 / 608
Indexed file subset10,000 / 162,833
Overall trendPositive performance despite limited indexing

Code Repository: atarashi-classifier


Problems Identified

  • TF-IDF vectors were too sparse, reducing effectiveness with SimHash.
  • Limited indexing (only 10k files) restricts generalization for rare licenses.
  • Not all licenses were present in the indexed corpus — needs broader coverage.
  • False negatives among non-license texts indicate further tuning is required.

Planning for Next Week

  • Start working on Stage 1 of the proposed pipeline:
    • Extract and match keywords from STRINGS.in (used by Nomos) to identify candidate license regions.
  • Expand indexed dataset to include more diverse license types.
  • Improve non-license detection rate through better negative sampling and filtering.
  • Continue tuning SimHash and embedding-based search thresholds.