Skip to main content

Week 4

(June,21,2023)

Attendees:

Updates:

  • Initiated the process of curating a copyright dataset. Instead of manual procedures via the Fossology UI, automation was explored through the chat-gpt-3.5 API. A series of functions were designed to traverse directories, extract commented content in files, and forward that text along with a specific prompt to the chat-gpt API. This was meant to isolate any copyright content within. Though mostly successful, iterations were required for improvement. The related code is accessible here, and my findings are hosted here.

Methodology Challenge

  • The aforementioned approach, albeit innovative, was rendered non-viable for the project due to the necessity of employing Fossology for the dataset creation, ensuring the rectification of its false positives.

Fossology API

  • Acquired information about the existence of a Fossology API capable of extracting Fossology-generated copyright statements. This can be harnessed for dataset formulation.

LDA Model

  • Executed a basic LDA (Latent Dirichlet Allocation) model centered around two topics - copyright and non-copyright. The results were promising, indicating pertinent associations. The respective code can be located here.

Problems and Solutions:

Problem 1

  • The task of manually creating a dataset is monotonous, protracted, and susceptible to errors.

Solution 1

  • Automated the task employing chatGPT. However, it necessitated meticulous prompt structuring to derive semi-reliable results.

Problem 2

  • Uncertainty about file segments to forward to chatGPT for copyright extraction.

Solution 2

  • Developed a function to solely capture commented lines from predominant extensions. In instances of its inadequacy, the entire file was dispatched to chatGPT, a measure which eventually proved counterproductive. Subsequent insights from Gaurav introduced me to Nirjas, a Python library under the Fossology project, already adept at this task.

Conclusion and Further Plans:

Dataset Creation

  • Engage in the formulation of the dataset leveraging the Fossology API.