GPT-3 vanilla

GPT-3 vanilla#

For our classification, we address GPT-3 via the API provided by OpenAI. Input to the model happens via a prompt that contains a linguistic expression of the task to be performed by the model. Various alternative prompts have been tested. The one which worked best on the task at hand is documented in the function below, which is used to predict the label(s) on a single publication item.

import pandas as pd
import os
import numpy as np
import openai
import backoff  # exponential backoff to avoid calling the API with a too high rate


# This notebook uses openai's API which is a payed service.
# To recreate the notebook you need your own API key which
# can be acquired through openai's website. For instructions go to:
# https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety

import sys
sys.path.append('../GPT_class/')
import api_key

Import our pickled data and get the list of labels:

data_dir = 'Physics_2020_Dimensions'
df = pd.read_pickle(os.path.join(data_dir, 'dimensions_data_clean.pkl')) 
all_labels_full = [item for sublist in df.Labels_str.values.tolist() for item in sublist] 
all_labels = list(set(all_labels_full))

This function sends one abstract to the GPT-3 API and returns the predicted label(s).

@backoff.on_exception(backoff.expo, openai.error.RateLimitError) #decorator to avoid calling API at a too high rate
def predict_gpt(abstract):
    response = openai.Completion.create(
      model="text-davinci-003", # strongest model of the GPT-3 model family (as of yet) provided by openai
      prompt="""Does this abstract fall into the subfield 'Astronomical Sciences', 
             'Atomic Molecular and Optical Physics', 'Classical Physics','Condensed Matter Physics',
             'Medical and Biological Physics', 'Nuclear and Plasma Physics','Particle and High Energy Physics', 
             'Quantum Physics','Space Sciences','Synchrotrons and Accelerators’. Name one or more of the listed 
              subfields."""+ abstract, # the actual text prompt
      temperature=0, # be as deterministic as possible
      max_tokens=2048, # absolutely sufficient as we have restricted our abstract to a length of max. 1500 characters 
      top_p=1, # use as little random for the output as possible
      frequency_penalty=0, # no penalty if tokens repeat (e.g. word Physics in all labels)
      presence_penalty=0,  # no penalty if word already present in prompt (e.g. abstract may contain subfield self-assignment)
      logprobs=5 # output logprobs, we do not use it for now, but will use the logprobs of the fine-tuned model
    )
    
    return response["choices"][0]["text"]

Some observations regarding the prompt design:

The prompt that worked best after probing various different versions was:

Does this abstract fall into the subfield ‘Astronomical Sciences’, ‘Atomic, Molecular and Optical Physics’, ‘Classical Physics’,’Condensed Matter Physics’, ‘Medical and Biological Physics’, ‘Nuclear and Plasma Physics’, ‘Particle and High Energy Physics’, ‘Quantum Physics’, ‘Space Sciences’, ‘Synchrotrons and Accelerators’. Name one or more of these subfield.

The most difficult aspect of designing an appropriate prompt was to get GPT-3 to perform actual multilabel classification, i.e., to make it return more than one label/subfield when appropriate. The phrase that worked best in our trials was ‘Name one or more of these subfields.’ With this prompt, GPT-3 indeed does return more than one label occasionally but sparingly as desired.

GPT-3’s output for the above prompt is quite consistent. As a rule, it will provide one or a comma-separated list of the subfield labels. This list sometimes ends with a colon, sometimes not, which needs to be considered when processing the output. Also, in rare occasions, GPT-3 was observed to deviate from this output format. Hence such deviations from the expected output format are captured in the batch prediction below. The take-home message here is that the model can be pushed to, but not forced to, stick to a certain output format. Therefore, it is always advisable to assert that the output meets the assumed format.

The abstracts in the Dimensions DB are rather heterogeneous. They may or may not contain formatting, which can interfere with GPT-3’s prompt logic (for instance, if returns are part of the abstract) and content without any information value for the task at hand (for instance, LaTeX formatting commands). Hence, we do some light preprocessing of the abstracts. This function could be much improved in a production version.

def preprocess_abstract(abstract):
    
    abstract = abstract.replace("\n", " " ) # no newlines in abstracts
    abstract = abstract.replace("abstract:", "" ).replace("Abstract:", "" ).replace("ABSTRACT:", "" ) # no abstract in abstract
    abstract = " " + abstract # Due to GPT-3's tokenization it is preferable to add a space here

    return abstract

We are now finally ready to do our first prediction:

predict_item = df.sample(1).iloc[0]
predict_abstract = preprocess_abstract(predict_item['Abstract']) 
print('Abstract: {}\nTrue: {}\nPredicted: {}'.format(predict_abstract, 
                                                     ', '.join(predict_item['Labels_str']), 
                                                     predict_gpt(predict_abstract).strip()))

Abstract:  In Phys. Rev. Lett. 102, 197002 (2009) it was reported that the element Eu becomes superconducting in the pressure and temperature range [84-142GPa], [1.8-2.75K]. The claim was largely based on ac susceptibility measurements. Recently reported ac susceptibility measurements on a hydride compound under pressure that appears to become superconducting near room temperature (Nature 586, 373 (2020)) cast serious doubt on the validity of the results for Eu as well as for the hydride. Here I present results that shed new light on the true behaviour of Eu. It is argued that the experiments on Eu have to be repeated to either validate or rule out the claim that it is a superconducting element.
True: Condensed Matter Physics
Predicted: Condensed Matter Physics

This seems to work well, and we can proceed to batch predict on a larger number of publication items to check the performance of the model in our task of subfield labeling abstracts. We sample a fixed number (nr_samples) of publication items for each of the 10 subfield labels. Since the distribution of the items over the subfields is not homogeneous (cf. analysis) in our dataset, we will over- and under-represent some subfields, and the overall accuracy of the batch prediction will thus not represent that of an arbitrary sample of publication items from our Dimensions corpus. The benefit is that we can evaluate and directly compare our result metrics per subfield.

This function samples publication items and predicts on them:

def batch_predict(test_data, nr_samples, my_model = False, verbose=False): # if a my_model is specified we use a fine 
                                                                           # tuned model, if not we use GPT-3 vanilla

    prediction_results = []
    j = 0
    
    for field_label in all_labels: # iterate through the subfields
        
        predict_items = test_data[test_data[field_label]==True].sample(nr_samples) # sample n publication items per subfield
        
        for i  in range(nr_samples): # predict on each individual item
            
            j += 1
            print(j, end='\r')
            
            predict_item = predict_items.iloc[i]
            predict_abstract = preprocess_abstract(predict_item['Abstract']) # preprocess abstract
            
                 
            predicted_labels = predict_gpt(predict_abstract).strip().replace(".","").split(',')
            predicted_labels = [item.strip() for item in predicted_labels] # the predicted labels
                
            true_labels = [item.replace('\'','') for item in predict_item.Labels_str] # the true labels

            if  not all(item in all_labels for item in predicted_labels): # check if predictions have the correct format   
                print("Houston we've got a problem with this {}".format(predicted_labels))
                        
            
            if set(predicted_labels) == set(true_labels):     
                if verbose: print("Fully correct labeling")
                full_correct, one_correct = True, True
                
            elif not (set(predicted_labels)).isdisjoint(set(true_labels)):               
                if verbose: print("Partially correct labeling")
                full_correct, one_correct = False, True

            else:               
                if verbose: print("Wrong labeling")
                full_correct, one_correct = False, False

            prediction_results.append([predict_item, predicted_labels, full_correct, one_correct]) 
        
    return prediction_results

Batch predict on 100 publication items (10 per subfield label):

print('This will cost money, do you really want to proceed (y/n)?')

proceed = input()
  
if proceed == 'n':
    print('Skipping')
    
elif proceed == 'y':   
    nr_samples = 10
    prediction_results = batch_predict(df, nr_samples)
    
else:
    print('Please answer yes or no')

This will cost money, do you really want to proceed (y/n)?
y
100

Next we evaluate the accuracy of the predictions.

In a multilabel classification a prediction can be correct, partially correct or fully incorrect. Different evaluation metrics have been proposed. Here we will only consider the exact match ratio, i.e, the proportion of fully correct matches and in addition the proportion of those partially correct predictions, where at least one label is correctly predicted.

In addition, we assess for each subfield the proportion of items under this label that were predicted partially correct, with at least one label predicted correctly, and the proportion of items predicted partially correct with the subfield in question predicted correctly.

def result_metric(prediction_results):
    
    results_metric = dict() # dict to store metrics per label
    nr_one_correct, nr_full_correct = 0, 0 

    for prediction in prediction_results:

        if (prediction[2]==True):
            nr_full_correct += 1


        elif (prediction[3]==True):
            nr_one_correct += 1

        for subfield in all_labels: # besides overall accuracy we want to asses the accuracies per subfield

            for item in ["_all", "_label", "_item"]:
                results_metric[subfield+item] = results_metric.get(subfield+item, 0)

            if subfield in prediction[0]['Labels_str']:
                results_metric[subfield+"_all"] = results_metric.get(subfield+"_all", 0) + 1

                if (prediction[3]==True): # item was predicted correct 
                    results_metric[subfield+"_item"] = results_metric.get(subfield+"_item", 0) + 1

                if (subfield in prediction[1]): # item was predicted fully correct and the current subfield was predicted
                    results_metric[subfield+"_label"] = results_metric.get(subfield+"_label", 0) + 1

    print('Partial match ratio: {:.2%}\nExact match ratio: {:.2%}'.format(
                                                                                    (nr_one_correct+ nr_full_correct)/(nr_samples*10),
                                                                                     nr_full_correct/(nr_samples*10)))
    

    for subfield in all_labels:
        results_metric[subfield+'_label_rel'] = results_metric[subfield+'_label']/results_metric[subfield+'_all']
        results_metric[subfield+'_item_rel'] = results_metric[subfield+'_item']/results_metric[subfield+'_all']
        print('\n{}: \nTotal: {}\nPartially or fully correct: {:.2%}\nPartially or fully and subfield correct: {:.2%}'.format(subfield, 
                                                                                    results_metric[subfield+"_all"], 
                                                                                    results_metric[subfield+'_item_rel'], 
                                                                                    results_metric[subfield+'_label_rel']))
result_metric(prediction_results)

Partial match ratio: 60.00%
Exact match ratio: 21.00%

Space Sciences: 
Total: 13
Partially or fully correct: 100.00%
Partially or fully and subfield correct: 100.00%

Nuclear and Plasma Physics: 
Total: 24
Partially or fully correct: 58.33%
Partially or fully and subfield correct: 20.83%

Medical and Biological Physics: 
Total: 10
Partially or fully correct: 60.00%
Partially or fully and subfield correct: 50.00%

Condensed Matter Physics: 
Total: 12
Partially or fully correct: 100.00%
Partially or fully and subfield correct: 100.00%

Astronomical Sciences: 
Total: 13
Partially or fully correct: 46.15%
Partially or fully and subfield correct: 15.38%

Particle and High Energy Physics: 
Total: 17
Partially or fully correct: 58.82%
Partially or fully and subfield correct: 52.94%

Classical Physics: 
Total: 11
Partially or fully correct: 27.27%
Partially or fully and subfield correct: 9.09%

Synchrotrons and Accelerators: 
Total: 14
Partially or fully correct: 42.86%
Partially or fully and subfield correct: 14.29%

Atomic Molecular and Optical Physics: 
Total: 12
Partially or fully correct: 41.67%
Partially or fully and subfield correct: 8.33%

Quantum Physics: 
Total: 15
Partially or fully correct: 80.00%
Partially or fully and subfield correct: 80.00%

Results:

Only about 20% of GPT-3’s labeling is fully correct
In about 60% of the cases, it gets at least one label right
Some labels, like ‘Classical Physics’, are only very rarely predicted correctly
For some pairs of labels, such as ‘Astronomical Sciences’ and ‘Space Sciences’, GPT-3 apparently has problems distinguishing between them.

The low number of fully correct predictions is certainly due to the fact that we only feed GPT-3 the paper’s abstracts, whereas the Dimension classification is based on the full papers as input. In cases of publication items labeled with more than one subfield label, we can almost always rank the labels and distinguish between primary and secondary attributions. Whereas the primary subfield can mostly be inferred from the abstracts, the secondary attributions often cannot, as they are based on information contained in the article itself only.

While GPT-3 with its ‘world knowledge’ outperforms the chance benchmark by far, the results of the subfield are far from satisfactory and not usable for scientometric studies in a scientific context. Hence, in the next experiment, we try to improve the results by fine-tuning GPT-3 appropriately.