Week 4
Meeting 6
(July 07th, 2022)
GSOC 2022 weekly update
Attendees
- Shaheem Azmal M MD
- Gaurav Mishra
- Anupam Ghosh
- Ayush Bhardwaj
- Shruti Agarwal
- Avinal Kumar
- Sushant Kumar
- Feng Wenhan
- Rohit Pandey
- Thanvi Lahari Pendyala
- Samuel Dushimimana
- Krishna Mahato
- Soham Banerjee
Discussions
- Created the python packages for both LogisticRegression and Linear SVC model. Below is the file structure for created package:
+-- linearsvc
│ +-- LICENSE
│ +-- MANIFEST.in
│ +-- README.md
│ +-- setup.py
│ +-- src
│ +-- linearsvc
│ │ +-- data
│ │ │ +-- linearsvc
│ │ +-- __init__.py
│ +-- model_train.py
+-- logreg
+-- LICENSE
+-- MANIFEST.in
+-- README.md
+-- setup.py
+-- src
+-- logreg
│ +-- data
│ │ +-- logreg
│ +-- __init__.py
+-- model_train.py
-
Modified init.py from the src folder of both the python packages as suggested:
- In the code below, it can be seen that the linearsvc class have two functions:
- linearsvc.classify() can be called to get the model classifier and the classifier can be further used to predict the license shortname for atarshi agent just by using the predict() function.
- And in linearsvc.predict_shortname(), we can directly pass the preprocessed file and it will return the license shortname.
- Similar functions has been implemented for logreg model also.
- In the code below, it can be seen that the linearsvc class have two functions:
class linearsvc():
def __init__(self, preprocessed_file):
self.preprocessed_file = preprocessed_file
def classify(self):
data = resource_filename("linearsvc", "data/linearsvc")
with open(data, 'rb') as f:
Classifier = pickle.load(f)
return Classifier
def predict_shortname(self):
predictor = self.classify()
return predictor.predict(self.preprocessed_file)
- Implemented the agent for Linear SVC on atarshi locally.
Conclusion and Further Plans
- Will make the changes according to further suggestion.
- Will start implementing okapi_BM25 in place of tfidftransformer for ranking the license text on dataset for training the models and compare which among the two is working better on dataset.