Week 5
Meeting 6
(July 14th, 2022)
GSOC 2022 weekly update
Attendees
- Shaheem Azmal M MD
- Gaurav Mishra
- Ayush Bhardwaj
- Kaushlendra Pratap
- Vasudev Maduri
- Shruti Agarwal
- Sushant Kumar
- Feng Wenhan
- Rohit Pandey
- Thanvi Lahari Pendyala
- Krishna Mahato
- Soham Banerjee
Discussions
-
Made changes suggested by the mentors on the pr creted on Minerva.
-
Implemented agent on atarashi for okapibm25 and got the accuracy score of 62%.
Total files scanned = 100
Successfully matched = 62
++++++++++++++++++ Result ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++
---> Total time elapsed: 11.67 Seconds <---
---> Accuracy: 62.0% <---
++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++ -
And raised the pr for the same.
def scan(self, filePath):
'''
Read the content of filename, extract the comments and preprocess them.
Find the license of the preprocessed file.
:param filePath: Path of the file to scan
:return: Returns the license's short name with highest similarity scores
'''
processedData = super().loadFile(filePath)
with open(filePath) as file:
raw_data = file.read()
spdx_identifers = spdx_identifer(raw_data,
self.licenseList['shortname'])
match = []
if spdx_identifers:
match.extend(spdx_identifers)
else:
tokenize_corpus = []
corpus_identifier = []
for idx in range(len(self.licenseList)):
tok = self.licenseList.iloc[idx]['processed_text'].split(" ")
tokenize_corpus.append(tok)
tok_identifier = self.licenseList.iloc[idx]['shortname']
corpus_identifier.append(tok_identifier)
bm25 = BM25Okapi(tokenize_corpus)
doc_scores = bm25.get_scores(processedData.split(" "))
indices = np.argsort(doc_scores)[::-1][:10]
for index in indices:
match.append({
"shortname": str(corpus_identifier[index]),
"sim_score": doc_scores[index],
"sim_type": "bm25",
"description": ""
})
return match- In above given code it can be seen that I have used bm25 to transform processed text and then the similarity score has been calculated.
Conclusion and Further Plans
- Will make the changes according to further suggestion.