Week 11

Meeting 13

(August 25th, 2022)

GSOC 2022 weekly update

Updates

Implemented algorithm for bert transformer:

Basically the implementation is done using the labelling of different classes and create a dictionary where the license short name is key and it's label is value.

    possible_labels = df.short_name.unique()

    label_dict = {}
    for index, possible_label in enumerate(possible_labels):
        label_dict[possible_label] = index

And for tokenizing and encoding bert-base-uncased pretrained model is used.

    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                      do_lower_case=True)
                                      
    encoded_data_train = tokenizer.batch_encode_plus(
        df[df.data_type=='train'].text.values, 
        add_special_tokens=True, 
        return_attention_mask=True, 
        pad_to_max_length=True, 
        max_length=256, 
        return_tensors='pt'
    )

And for now trained a transformer model on smaller part of minerva dataset because the model requires a lot of RAM and time for training the whole dataset.
Created a simple notebook for the trained model. It can be seen here.

Conclusion and Further Plans

Will keep contributing to the organization.

Meeting 13​

Updates​

Conclusion and Further Plans​

Meeting 13

Updates

Conclusion and Further Plans