Skip to main content

Meeting 11

(August 8,2024)

Attendees:

Discussion:

Semantic Search Code Cleanup and Refinement

  • Evaluation and Refinement of Changes:

    1. Approach: Began the process of cleaning up the semantic search code for push to the Atarashi repository. This included reviewing all modifications made throughout the project, from using cosine similarity and sentence transformers to employing fuzzywuzzy and Levenshtein distance methods.

    2. Challenges: Due to the significant number of changes introduced, evaluating which modifications to retain took considerable time, especially in relation to how file comments and license text are grouped and chunked.

Atarashi Repository Exploration

  • Cloning and Local Build:

    1. Approach: Cloned the Atarashi repository and started investigating areas for contribution, with a focus on understanding the package structure and functionality.

    2. Issues: Encountered some local build issues but managed to develop a close representation of the code I intended to run for this week.

  • Next Steps: Continue exploring how to fix the local build issues and further refine my contributions to the repository.

License-Relevant Text Detection

  • Task Design and LLM Experimentation:

    1. Objective: Designed a task prompt specifically for detecting license-relevant text within code files, without identifying which specific license the text belongs to.

    2. Experimentation: Started experimenting with LLMs for this task and observed positive initial results, showing promise for future improvements in detecting license-specific sections.

Conclusions and Next Steps

  • Push Code to Atarashi: Finalize the clean-up and push the refined semantic search algorithm to the repository.

  • Fix Local Build Issues: Resolve any remaining build issues to contribute effectively to the Atarashi repository.

  • Continue LLM Experimentation: Improve the prompt and refine the LLM experiments for detecting license-relevant text.