Week 11
Meeting 13
(August 25th, 2022)
GSOC 2022 weekly update
Updates
- Implemented algorithm for bert transformer:
- Basically the implementation is done using the labelling of different classes and create a dictionary where the license short name is key and it's label is value.
possible_labels = df.short_name.unique()
label_dict = {}
for index, possible_label in enumerate(possible_labels):
label_dict[possible_label] = index- And for tokenizing and encoding bert-base-uncased pretrained model is used.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
do_lower_case=True)
encoded_data_train = tokenizer.batch_encode_plus(
df[df.data_type=='train'].text.values,
add_special_tokens=True,
return_attention_mask=True,
pad_to_max_length=True,
max_length=256,
return_tensors='pt'
) - And for now trained a transformer model on smaller part of minerva dataset because the model requires a lot of RAM and time for training the whole dataset.
- Created a simple notebook for the trained model. It can be seen here.
Conclusion and Further Plans
- Will keep contributing to the organization.