GPT-3 fine-tuned#
We fine-tune a GPT-3 model for the task of labeling abstracts of publication items from Dimension. Unfortunately, OpenAI’s fine-tuned models cannot currently be shared. As a result, most cells in this notebook cannot be executed by others. However, this notebook provides a ‘how-to’ guide for users attempting to do the same or something similar.
import pandas as pd
import numpy as np
import openai
import os
import re
import backoff # for exponential backoff to avoid calling the API with a too high rate which results in an interruption.
from IPython.display import display, Markdown
# This notebook uses openai's API which is a payed service.
# To recreate the notebook you need your own API key which
# can be acquired through openai's website. For instructions go to:
# https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
import sys
sys.path.append('../GPT_class/')
import api_key
Import our pickled data:
data_dir = 'Physics_2020_Dimensions'
df = pd.read_pickle(os.path.join(data_dir, 'dimensions_data_clean.pkl'))
all_labels_full = [item for sublist in df.Labels_str.values.tolist() for item in sublist]
all_labels = list(set(all_labels_full))
The functions below have already been introduced and used in the previous section. For an explanation see there.
def batch_predict(test_data, nr_samples, my_model = False, verbose=False): # if a my_model is specified we use a fine-
# tuned model, if not we use GPT-3 vanilla
prediction_results = []
j = 0
for field_label in all_labels: # iterate through the subfields
predict_items = test_data[test_data[field_label]==True].sample(nr_samples) # sample n publication items per subfield
for i in range(nr_samples): # predict on each individual item
j += 1
print(j, end='\r')
predict_item = predict_items.iloc[i]
predict_abstract = preprocess_abstract(predict_item['Abstract']) # preprocess abstract
if my_model: # if a model is provided it means we are using a fine-tuned model
# where the output needs to be processed a bit differently
predicted_labels = predict_fine_tuned(predict_abstract, my_model)
predicted_labels = predicted_labels['choices'][0]['text']
predicted_labels = re.sub(r".*?\.", "", predicted_labels)
predicted_labels = predicted_labels.replace('\n\n###\n\n','').split('\n')
predicted_labels = [item.replace('\'','').strip() for item in predicted_labels] # the predicted labels
else: # vanilla model as the fallback
predicted_labels = predict_gpt(predict_abstract).strip().replace(".","").split(',')
predicted_labels = [item.strip() for item in predicted_labels] # the predicted labels
true_labels = [item.replace('\'','') for item in predict_item.Labels_str] # the true labels
if not all(item in all_labels for item in predicted_labels): # check if predictions have the correct format
print("Houston we've got a problem with this {}".format(predicted_labels))
if set(predicted_labels) == set(true_labels):
if verbose: print("Fully correct labeling")
full_correct, one_correct = True, True
elif not (set(predicted_labels)).isdisjoint(set(true_labels)):
if verbose: print("Partially correct labeling")
full_correct, one_correct = False, True
else:
if verbose: print("Wrong labeling")
full_correct, one_correct = False, False
prediction_results.append([predict_item, predicted_labels, full_correct, one_correct])
return prediction_results
def result_metric(prediction_results):
results_metric = dict() # dict to store metrics per label
nr_one_correct, nr_full_correct = 0, 0
for prediction in prediction_results:
if (prediction[2]==True):
nr_full_correct += 1
elif (prediction[3]==True):
nr_one_correct += 1
for subfield in all_labels: # besides overall accuracy we want to asses the accuracies per subfield
for item in ["_all", "_label", "_item"]:
results_metric[subfield+item] = results_metric.get(subfield+item, 0)
if subfield in prediction[0]['Labels_str']:
results_metric[subfield+"_all"] = results_metric.get(subfield+"_all", 0) + 1
if (prediction[3]==True): # item was predicted correct
results_metric[subfield+"_item"] = results_metric.get(subfield+"_item", 0) + 1
if (subfield in prediction[1]): # item was predicted fully correct and the current subfield was predicted
results_metric[subfield+"_label"] = results_metric.get(subfield+"_label", 0) + 1
print('Partial match ratio: {:.2%}\nExact match ratio: {:.2%}'.format(
(nr_one_correct+ nr_full_correct)/(nr_samples*10),
nr_full_correct/(nr_samples*10)))
for subfield in all_labels:
results_metric[subfield+'_label_rel'] = results_metric[subfield+'_label']/results_metric[subfield+'_all']
results_metric[subfield+'_item_rel'] = results_metric[subfield+'_item']/results_metric[subfield+'_all']
print('\n{}: \nTotal: {}\nPartially or fully correct: {:.2%}\nPartially or fully and subfield correct: {:.2%}'.format(subfield,
results_metric[subfield+"_all"],
results_metric[subfield+'_item_rel'],
results_metric[subfield+'_label_rel']))
def preprocess_abstract(abstract):
abstract = abstract.replace("\n", " " ) # No newlines in abstracts
abstract = abstract.replace("abstract:", "" ).replace("Abstract:", "" ).replace("ABSTRACT:", "" )
abstract = " " + abstract # Due to GPT-3's tokenization it is preferable to add a space in the beginning
return abstract
First we prepare an appropriate training dataset by random sampling 300 publication items per subfield and writing them to a JSON file.
df_finetune = pd.DataFrame()
nr_samples = 300
# As we have drawn an equal number of items per subfield above, we will do the same here
# to guarantee comparability
for field_label in list(set(all_labels)):
df_finetune = pd.concat([df[df[field_label]==True].sample(nr_samples).squeeze(axis=0), df_finetune])
df_finetune['Abstract'] = df_finetune['Abstract'].apply(lambda x: preprocess_abstract(x))
# insert as marker to indicate that the promt has ended
df_finetune['Abstract'] = df_finetune['Abstract'].apply(lambda x: x+'\n\n###\n\n')
df_finetune.rename(columns = {'Abstract':'prompt', 'Labels_str':'completion'}, inplace = True)
df_finetune['completion'] = df_finetune['completion'].astype(str)
df_finetune['completion'] = df_finetune['completion'].str.replace('\[', '').str.replace('\]', '')
# instead of a comma we use a newline character as a separator for the individual labels. This has yielded better
# results in our trials.
df_finetune['completion'] = df_finetune['completion'].str.replace(', ', '\n')
# insert as marker to indicate that the answer has ended. The model will reproduce this marker which can be used
# as a singnal to the model to stop generating
df_finetune['completion'] = df_finetune['completion'].apply(lambda x: x+'\n@@@')
#writing the JSON file
df_finetune[['prompt', 'completion']].to_json("pretrain.jsonl", orient='records', lines=True)
/var/folders/x8/4w9fb82j5_gcsm0cvwtnz7d40000gq/T/ipykernel_2460/215565130.py:18: FutureWarning: The default value of regex will change from True to False in a future version.
df_finetune['completion'] = df_finetune['completion'].str.replace('\[', '').str.replace('\]', '')
OpenAI provides a tool that parses, corrects, and reformats the input file for fine-tuning, if required. The tool prompts the user for input on a set of questions which, when calling the command from a notebook, we have to pass onto the command.
!printf '%s\n' Y Y n Y | openai tools fine_tunes.prepare_data -f pretrain.jsonl
Analyzing...
- Your file contains 3000 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 14 duplicated prompt-completion sets. These are rows: [566, 1398, 1721, 1936, 2007, 2470, 2653, 2697, 2868, 2881, 2909, 2920, 2928, 2977]
- All prompts end with suffix `\n\n###\n\n`
- All prompts start with prefix ` `
- The completion should start with a whitespace character (` `). This tends to produce better results due to the tokenization we use. See https://beta.openai.com/docs/guides/fine-tuning/preparing-your-dataset for more details
Based on the analysis we will perform the following actions:
- [Recommended] Remove 14 duplicate rows [Y/n]: - [Recommended] Add a whitespace character to the beginning of the completion [Y/n]: /Users/buettner/miniforge3/envs/tf_m1/lib/python3.8/site-packages/openai/validators.py:421: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
x["completion"] = x["completion"].apply(
- [Recommended] Would you like to split into training and validation set? [Y/n]:
Your data will be written to a new JSONL file. Proceed [Y/n]:
Wrote modified file to `pretrain_prepared (1).jsonl`
Feel free to take a look!
Now use that file when fine-tuning:
> openai api fine_tunes.create -t "pretrain_prepared (1).jsonl"
After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["s'\n@@@"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 1.23 hours to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.
We can now fine-tune GPT-3 on our data. We use the strongest model ‘curie’ from OpenAI’s model family available for fine-tuning and train for 5 epochs.
N.B.:We do not explicitly train this as a classification (which can be done by providing the –classification_n_classes flag to the fine-tuning command) because, as of now, GPT-3 will not perform multilabel classification but instead interpret each combination of labels in the input as an individual class.
!openai api fine_tunes.create -m curie --n_epochs 5 -t "pretrain_prepared.jsonl"
!openai api fine_tunes.follow -i ft-0gXDRTVFEESktPojt2KTi7YG # this reconnects to the output stream of the training process if required
[2023-03-09 10:30:58] Created fine-tune: ft-0gXDRTVFEESktPojt2KTi7YG
[2023-03-09 10:39:26] Fine-tune costs $10.39
[2023-03-09 10:39:27] Fine-tune enqueued. Queue number: 0
[2023-03-09 10:39:29] Fine-tune started
[2023-03-09 10:45:55] Completed epoch 1/5
[2023-03-09 10:51:17] Completed epoch 2/5
[2023-03-09 10:56:40] Completed epoch 3/5
[2023-03-09 11:02:04] Completed epoch 4/5
[2023-03-09 11:07:27] Completed epoch 5/5
[2023-03-09 11:07:42] Uploaded model: curie:ft-personal-2023-03-09-10-07-42
[2023-03-09 11:07:43] Uploaded result file: file-ZL7ePyptETasXli6DqCwD1Sk
[2023-03-09 11:07:43] Fine-tune succeeded
Job complete! Status: succeeded 🎉
Try out your fine-tuned model:
openai api completions.create -m curie:ft-personal-2023-03-09-10-07-42 -p <YOUR_PROMPT>
#my_model = "curie:ft-personal-2023-03-06-15-32-53" # this model has been trained on 1000 items
my_model = "curie:ft-personal-2023-03-09-10-07-42" # this model has been trained on 3000 items
N.B.: As can be seen above the fine-tuning using 3000 prompt answer pairs and training for 5 epochs did cost $10.39. While, given the excellent results (see below), this doesn’t seem overly expensive, it should still be noted that this is not a free technology. It needs to be kept in mind that processing fees will also incur for the later predictions.
We are finally ready to use our own fine-tuned model for predicting. Let’s test it and perform a single prediction:
item = df.sample(1).iloc[0]
my_prompt = preprocess_abstract(item['Abstract']) +"\n\n###\n\n"
def predict_fine_tuned(my_prompt, my_model):
prediction = openai.Completion.create(
model= my_model ,
prompt=my_prompt,
temperature=0,
stop='\n@@@',
max_tokens=1500,
top_p=1,
frequency_penalty=1,
presence_penalty=0,
logprobs=5,
)
return prediction
prediction_fine_tuned = predict_fine_tuned(my_prompt, my_model)
print('Abstract: {}'.format(my_prompt))
print('True: {}\nPredicted: {}.'.format(', '.join([label for label in item['Labels_str']]), prediction_fine_tuned['choices'][0]['text']))
Abstract: Theoretical simulations on single dielectric microspherical particles illuminated by plane wave are performed systematically to find out the key parameters for (i) generating elongated photonic nanojet and (ii) obtaining more focal length of the microsphere or working distance. These simulations are performed using analytical theory proposed by Aden and Kerker [1]. In addition, the dependence of the intensity distribution inside and outside the core-shell particles, length and width of the PNJs on the refractive index of the surrounding medium is studied in detail. The difference in the PNJs of the microparticles illuminated with resonant and non-resonant illuminations is also investigated.
###
True: Classical Physics
Predicted: 'Classical Physics'.
Hooray, this seems to work! Let’s test how good it really is by predicting on a larger number of abstracts. However, first, we have to create a test dataset by removing the training data from our full dataset so as to not predict on items used in training the model.
df_test = df.drop(df_finetune.index)
We can now go ahead and batch predict.
print('This will cost money, do you really want to proceed (y/n)?')
proceed = input()
if proceed == 'n':
print('Skipping')
elif proceed == 'y':
nr_samples = 10
prediction_results = batch_predict(df_test, nr_samples, my_model)
else:
print('Please answer yes or no')
This will cost money, do you really want to proceed (y/n)?
y
Houston we've got a problem with this ['', '', '### Condensed Matter Physics', 'Quantum Physics', 'Condensed Matter Physics', 'Quantum Physics', 'Condensed Matter Physics', 'Quantum Physics', 'Condensed Matter Physics', 'Quantum Physics', 'Condensed Matter Physics', 'Quantum Physics']
100
Get result metrics (see explanation in previous section) and compare them to the results achieved with the vanilla GPT-3.
result_metric(prediction_results)
Partial match ratio: 90.00%
Exact match ratio: 65.00%
Astronomical Sciences:
Total: 13
Partially or fully correct: 92.31%
Partially or fully and subfield correct: 76.92%
Quantum Physics:
Total: 16
Partially or fully correct: 87.50%
Partially or fully and subfield correct: 56.25%
Synchrotrons and Accelerators:
Total: 15
Partially or fully correct: 86.67%
Partially or fully and subfield correct: 73.33%
Condensed Matter Physics:
Total: 14
Partially or fully correct: 100.00%
Partially or fully and subfield correct: 92.86%
Nuclear and Plasma Physics:
Total: 29
Partially or fully correct: 100.00%
Partially or fully and subfield correct: 93.10%
Atomic Molecular and Optical Physics:
Total: 17
Partially or fully correct: 88.24%
Partially or fully and subfield correct: 64.71%
Medical and Biological Physics:
Total: 10
Partially or fully correct: 100.00%
Partially or fully and subfield correct: 90.00%
Classical Physics:
Total: 11
Partially or fully correct: 72.73%
Partially or fully and subfield correct: 54.55%
Space Sciences:
Total: 11
Partially or fully correct: 90.91%
Partially or fully and subfield correct: 90.91%
Particle and High Energy Physics:
Total: 21
Partially or fully correct: 90.48%
Partially or fully and subfield correct: 71.43%
Results
The fully correct labeling more than triples as compare to the vanilla approach rising to over 60%
In more than 90% of the cases the fine-tuned model gets at least one label right
There are correct predictions for all labels now. The gap between the worst and the best performing label is rather narrow.
N.B. 1: In some cases, the model has been observed to simply dream up a completion of the abstract it receives as the input prompt. The completion is added before the actual classification in the output answer. We can identify and filter these completions because they are not in the label list. However, one should be aware that this is an example of potential unexpected behavior of the model that one should always keep in mind when working with GPT-3.
Below is such a case. All text before \n\n###\n\n was simply generated by the pretrained model!”
print('Abstract:')
print(df[df.DOI == '10.48550/arxiv.2004.11821']['Abstract'].values[0])
print('Prediction together with \'imagined\' completion of abstract:\n')
predict_fine_tuned(df[df.DOI == '10.48550/arxiv.2004.11821']['Abstract'].values[0], my_model = "curie:ft-personal-2023-03-06-15-32-53")['choices'][0]['text']
Abstract:
We model the Galactic population of detached binaries that harbor black holes
with (0.5-1.7) solar mass companions -- the remnants of case B mass-exchange
that rapidly cross Hertzsprung gap after the termination of the Roche-lobe
overflow or as He-shell burning stars. Several such binaries can be currently
present in the Galaxy. The range of black hole masses in them is about 4 to 10
solar ones, the orbital periods are tens to hundreds day. The unique BH-binary
LB-1 fits well into this extremely rare class of double stars.
Prediction together with 'imagined' completion of abstract:
"We show that the\nbinary system is a good candidate for the formation of black holes in case B.\nThe orbital period of LB-1 is about 0.5 days, which is close to the period of binary\nsystems with (0.5-1) solar mass BHs and orbital periods from tens to hundreds day. The\nperiod ratio between these systems and LB-1 can be as high as 1:2, which means that they are\nlikely formed by similar mechanisms -- mass transfer from a companion star or merger of two stars. We also discuss the possibility that some other binaries with orbital periods up to several years could be formed in this way, but we do not expect such systems to be common in Galactic space due to their low probability for occurrence and long lifetime (several decades). We also show that there are no known cases when two BH binaries merge into one system after crossing Hertzsprung gap; however, it is possible for them to merge if they have similar masses and orbits at the time when they cross Hertzsprung gap. In this case, we expect them to merge into one single system with a mass greater than 2 solar ones; however, such mergers are still rare events because most BH binaries have very short lifetimes (several decades). Finally we discuss how our model can help us understand how black holes form in case A stars -- massive stars without companions -- by merging with other massive stars or merging with neutron stars during their evolution after He-burning phase ends.\n###\n\n 'Astronomical Sciences'"
N.B. 2: Upon inspection of cases where the predictions deviate from the Dimensions ground truth, it appears that our fine-tuned model’s predictions are as reliable as the ground truth classifications. In cases where the GPT-3 predictions differ from the labeling in Dimensions, we found that either the ground truth label is incorrect or the labeling is ambiguous, and the label predicted by our fine-tuned model is just as justifiable as the one contained in the Dimensions DB.
Here are two examples:
Our fine-tuned model predicts that the item with the DOI 10.1051/epjconf/202023812018 falls into the subfield ‘Atomic Molecular and Optical Physics,’ whereas it is labeled ‘Particle and High Energy Physics’ and ‘Synchrotrons and Accelerators’ in Dimensions. Given that the item was published in the proceedings of ‘Topical Meeting (TOM) 13- Advances and Applications of Optics and Photonics,’ GPT-3’s prediction is certainly correct here.
Our fine-tuned model predicts that the item with the DOI 10.1088/1361-6587/abb0f7 falls into the subfield ‘Nuclear and Plasma Physics,’ but it is labeled as Astronomical Sciences in Dimensions. Given that the paper was published in the journal ‘Plasma Physics and Controlled Fusion,’ it is rather clear that GPT-3’s prediction is correct here!
print('Publication item:')
display(df[df.DOI == '10.1051/epjconf/202023812018'].head(1))
print('Predicted:')
predict_fine_tuned(df[df.DOI == '10.1051/epjconf/202023812018']['Abstract'].values[0],
my_model = "curie:ft-personal-2023-03-09-10-07-42")['choices'][0]['text'].replace('\n\n###\n\n', '')
Publication item:
Rank | DOI | Title | Dimensions URL | Abstract | ANZSRC | Length | Labels | Labels_num | Labels_str | Space Sciences | Quantum Physics | Atomic Molecular and Optical Physics | Medical and Biological Physics | Synchrotrons and Accelerators | Classical Physics | Condensed Matter Physics | Particle and High Energy Physics | Astronomical Sciences | Nuclear and Plasma Physics | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
60526 | 100 | 10.1051/epjconf/202023812018 | Optical Simulation and Design of Spatial Heter... | https://app.dimensions.ai/details/publication/... | We present the optical simulation and design c... | 51 Physical Sciences; 5107 Particle and High E... | 585 | [[5107, 5110], [Particle and High Energy Physi... | [5107, 5110] | [Particle and High Energy Physics, Synchrotron... | False | False | False | False | True | False | False | True | False | False |
Predicted:
" 'Atomic Molecular and Optical Physics'"
print('Publication item:')
display(df[df.DOI == '10.1088/1361-6587/abb0f7'].head(1))
print('Predicted:')
predict_fine_tuned(df[df.DOI == '10.1088/1361-6587/abb0f7']['Abstract'].values[0],
my_model = "curie:ft-personal-2023-03-09-10-07-42")['choices'][0]['text'].replace('\n\n###\n\n', '')
Publication item:
Rank | DOI | Title | Dimensions URL | Abstract | ANZSRC | Length | Labels | Labels_num | Labels_str | Space Sciences | Quantum Physics | Atomic Molecular and Optical Physics | Medical and Biological Physics | Synchrotrons and Accelerators | Classical Physics | Condensed Matter Physics | Particle and High Energy Physics | Astronomical Sciences | Nuclear and Plasma Physics | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3069 | 100 | 10.1088/1361-6587/abb0f7 | Electronegative magnetized plasma sheath prope... | https://app.dimensions.ai/details/publication/... | The three-fluid model was employed to study el... | 51 Physical Sciences; 5101 Astronomical Sciences | 1155 | [[5101], [Astronomical Sciences]] | [5101] | [Astronomical Sciences] | False | False | False | False | False | False | False | False | True | False |
Predicted:
" 'Nuclear and Plasma Physics'"