Fine-Tuning GPT: Creating a Real AI Version of Star Trek’s Lt. Cdr. Data

Last year, OpenAI released a fine-tuning API that allows developers to customize the latest GPT models, including GPT-4. To explore its capabilities, I embarked on an interesting experiment: creating a real AI based on a fictional one. Using every line of dialogue from Star Trek: The Next Generation’s android character Data, I fine-tuned GPT to replicate his unique communication patterns. This project not only showcases the practical applications of the new API but also highlights the fascinating intersection between science fiction and real artificial intelligence. Let’s look at how the new fine-tuning process works, what it takes to prepare training data, and how well the resulting model performs compared to the base GPT model.

This article draws upon insights from the e-learning course Machine Learning, Data Science and Generative AI with Python, which covers a comprehensive range of topics in AI development.

Introduction to the Fine-Tuning API

In August 2023, OpenAI introduced an updated fine-tuning API, marking a significant advancement in AI customization. This API allows developers to fine-tune the latest models, such as GPT-4o. This development opens up new possibilities for tailoring AI models to specific needs and applications.

Structure and Requirements for Fine-Tuning

The updated fine-tuning API operates similarly to the older version, with a key difference: the input data must be in the chat completions API format. While you still provide a JSON-L file for training, containing one line of JSON per entry, the data now requires more structure to accommodate the chat format.

Example: Fine-Tuning a Model to Simulate Lt. Cdr. Data from Star Trek

Imagine creating a fine-tuned model of GPT to play the role of Data from the TV series Star Trek: The Next Generation. Data is a fictional android, and the goal is to transform this fictional character into a real artificial intelligence.

In this scenario, you can assign a system role with each message, providing additional context about the task. For instance, you can instruct the system to embody Data, the android from the show. You can then supply examples of user and assistant responses, potentially expanding to larger conversations for more extensive training. Typically, this involves using pairs of dialogue lines where someone interacts with Data, and Data responds.

For example, consider Data’s first line in Star Trek, season one. Picard says, “You will agree, Data, that Starfleet’s instructions are difficult,” to which Data responds, “Difficult? How so? Simply solve the mystery of Farpoint Station. Feeding these lines and their responses into the fine-tuning API builds a model that learns how to talk like Data. This illustrates the power of the new models, which can be fine-tuned using this API to train on the latest advancements.

Necessity of Fine-Tuning vs. Prompt Engineering

OpenAI emphasizes the importance of evaluating whether fine-tuning is truly necessary, as their models perform exceptionally well out of the box. Often, you can achieve desired results through prompt engineering—refining your prompts or providing a few examples—rather than resorting to the more costly and complex process of fine-tuning.

With older models, fine-tuning was more incentivized, but the newer models, like GPT-4o, often don’t require it. For instance, even GPT-3.5 can adequately play the role of Data from Star Trek without fine-tuning, although fine-tuning enhances its performance. (This capability does raise questions about how the model may have been trained on copyrighted TV scripts.)

Creating Training Data for Fine-Tuning

Once you have a fine-tuned model, it will be used with the chat completions API instead of the legacy completions API. The process is straightforward: define the message structure, omitting the assistant role to let the model generate it. You can provide both training and validation files for the fine-tuning job. Use OpenAI’s file API to upload these files in advance, referring to them by their IDs in subsequent requests to initiate the fine-tuning job. This approach provides objective metrics on the model’s performance, allowing you to assess how accurately it predicts responses, such as those Data from Star Trek might have given.

Python Script for Extracting Dialogue

Let’s explore an example of creating a real AI based on a fictional AI using OpenAI’s fine-tuning API. We aim to fine-tune GPT to simulate Data from Star Trek: The Next Generation. By training the model with every line of dialogue Data ever said, we can produce a simulation closely resembling the original character.

To gather this data, we extract every line spoken by Data and the preceding line from the scripts of Star Trek: The Next Generation. Although sharing the scripts directly would infringe copyright, they are accessible online for personal use. 

The challenge is to create a training data file for fine-tuning GPT. The system role is defined as an android from Star Trek: The Next Generation, with the user being whoever interacts with Data. For example, Picard might say “You will agree, Data, that Starfleet’s instructions are difficult,”, and Data would respond, “Difficult? How so?” By doing this for all of Data’s lines, we can use the chat completions API to generate responses consistent with Data’s speech in the show.

Uploading Files and Starting the Fine-Tuning Job

To begin the fine-tuning process, we need to prepare a script. This script, named extract_script_new.py, is a pre-processing Python script designed to handle data wrangling, which is a crucial part of the job. The script starts by using the process_directory function, which points to the directory where all the Star Trek scripts are stored. If you’re replicating this process, ensure you adjust the path to match where your scripts are saved.

The script allows for the simulation of any Star Trek character, but for this project, we focus on Data, as it poetically simulates a real AI with a fictional one.

# extract_script_new.py
import os
import re
import random
character_lines = []

def strip_parentheses(s):
    return re.sub(r'\(.*?\)', '', s)
    
def is_single_word_all_caps(s):
    # First, we split the string into words
    words = s.split()

    # Check if the string contains only a single word
    if len(words) != 1:
        return False

    # Make sure it isn't a line number
    if bool(re.search(r'\d', words[0])):
        return False

    # Check if the single word is in all caps
    return words[0].isupper()
    
def process_directory(directory_path, character_name):
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):  # Ignore directories
            extract_character_lines(file_path, character_name)
            
    with open(f'./{character_name}_lines.jsonl', 'w', newline='') as outfile:
        prevLine = ''
        for s in character_lines:
            if (s.startswith('DATA:')):
                outfile.write("{\"messages\": [{\"role\": \"system\", \"content\": \"Data is an android in the TV series Star Trek: The Next Generation.\"}, {\"role\": \"user\", \"content\": \"" + prevLine + "\"}, {\"role\": \"assistant\", \"content\": \"" + s + "\"}]}\n")
            prevLine = s

def extract_character_lines(file_path, character_name):
    with open(file_path, 'r') as script_file:
        lines = script_file.readlines()

    is_character_line = False
    current_line = ''
    current_character = ''
    for line in lines:
        strippedLine = line.strip()
        if (is_single_word_all_caps(strippedLine)):
            is_character_line = True
            current_character = strippedLine
        elif (line.strip() == '') and is_character_line:
            is_character_line = False
            dialog_line = strip_parentheses(current_line).strip()
            dialog_line = dialog_line.replace('"', "'")
            character_lines.append(current_character + ": " + dialog_line)
            current_line = ''
        elif is_character_line:
            current_line += line.strip() + ' '
            
def split_file(input_filename, train_filename, eval_filename, split_ratio=0.8, max_lines=10000):
    """
    Splits the lines of the input file into training and evaluation files.

    :param input_filename: Name of the input file to be split.
    :param train_filename: Name of the output training file.
    :param eval_filename: Name of the output evaluation file.
    :param split_ratio: Ratio of lines to be allocated to training. Default is 0.8, i.e., 80%.
    """
    
    with open(input_filename, 'r') as infile:
        lines = infile.readlines()

    # Shuffle lines to ensure randomness
    random.shuffle(lines)
    
    lines = lines[:max_lines]

    # Calculate the number of lines for training
    train_len = int(split_ratio * len(lines))

    # Split the lines
    train_lines = lines[:train_len]
    eval_lines = lines[train_len:]

    # Write to the respective files
    with open(train_filename, 'w') as trainfile:
        trainfile.writelines(train_lines)

    with open(eval_filename, 'w') as evalfile:
        evalfile.writelines(eval_lines)

process_directory('e:/Downloads23/scripts_tng', 'DATA')
split_file('./DATA_lines.jsonl', './DATA_train.jsonl', './DATA_eval.jsonl')

The process_directory function iterates through each file in the specified directory, confirming each is a file and not a directory. It then passes the file to the extract_character_lines function. Since scripts are not structured data, assumptions are made to process the information. The script reads each line into an array called lines and tracks whether the current line is part of a dialogue.

The assumption is that a line with a single word in all caps, without numbers, indicates a character’s dialogue. For example, if “PICARD” is in all caps, the following lines are assumed to be Picard’s dialogue. A blank line indicates the end of the dialogue.

The script identifies who is speaking by looking for these capitalized words, using the is_single_word_all_caps function to confirm character names. The output is written to data_lines.jsonl, formatted for the chat completions API. Each line includes a system role, which provides context that Data is an android in “Star Trek: The Next Generation,” a user role for the person speaking to Data, and an assistant role for Data’s response.

Finally, the script creates training and evaluation files, allowing OpenAI to measure the model’s progress. This process helps determine how accurately the model simulates Data’s speech, providing metrics to evaluate its performance objectively.

Monitoring the Fine-Tuning Process

To prepare for fine-tuning, we use a function to split the original data_lines.jsonl file into data_train and data_eval files, following an 80-20 split for training and testing data. This split ensures that the model has sufficient data for training while reserving a portion for evaluation.

For cost efficiency during development, you can limit the number of lines processed by setting max_lines to a smaller number. Initially, I used 100 lines of Data’s dialogue to expedite the process and minimize costs. However, the complete dataset contains approximately 6,000 lines, and for this example, we increase it to 10,000 lines to ensure comprehensive training.

To generate these files, execute the command python extract-script-new.py. This script will create the necessary training and evaluation files, readying them for use in fine-tuning the model.

Comparing Fine-Tuned and Non-Fine-Tuned Models

To evaluate the effectiveness of the fine-tuning process, we can use a Jupyter notebook to run comparisons between the fine-tuned and non-fine-tuned models. This step involves executing the fine-tuning process and analyzing the results. Note that fine-tuning can incur costs, potentially around $50, so consider this before proceeding with your own experiments.

!pip install openai --upgrade

Output:

Requirement already satisfied: openai in e:\anaconda3\lib\site-packages (0.27.9)
Requirement already satisfied: tqdm in e:\anaconda3\lib\site-packages (from openai) (4.64.1)
Requirement already satisfied: requests>=2.20 in e:\anaconda3\lib\site-packages (from openai) (2.28.1)
Requirement already satisfied: aiohttp in e:\anaconda3\lib\site-packages (from openai) (3.8.3)
Requirement already satisfied: idna<4,>=2.5 in e:\anaconda3\lib\site-packages (from requests>=2.20->openai) (3.4)
Requirement already satisfied: certifi>=2017.4.17 in e:\anaconda3\lib\site-packages (from requests>=2.20->openai) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in e:\anaconda3\lib\site-packages (from requests>=2.20->openai) (1.26.13)
Requirement already satisfied: charset-normalizer<3,>=2 in e:\anaconda3\lib\site-packages (from requests>=2.20->openai) (2.1.1)
Requirement already satisfied: frozenlist>=1.1.1 in e:\anaconda3\lib\site-packages (from aiohttp->openai) (1.3.3)
Requirement already satisfied: aiosignal>=1.1.2 in e:\anaconda3\lib\site-packages (from aiohttp->openai) (1.3.1)
Requirement already satisfied: multidict<7.0,>=4.5 in e:\anaconda3\lib\site-packages (from aiohttp->openai) (6.0.2)
Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in e:\anaconda3\lib\site-packages (from aiohttp->openai) (4.0.2)
Requirement already satisfied: attrs>=17.3.0 in e:\anaconda3\lib\site-packages (from aiohttp->openai) (22.1.0)
Requirement already satisfied: yarl<2.0,>=1.0 in e:\anaconda3\lib\site-packages (from aiohttp->openai) (1.8.1)
Requirement already satisfied: colorama in e:\anaconda3\lib\site-packages (from tqdm->openai) (0.4.6)

To begin, ensure you have the latest version of the OpenAI package installed, as version 27.9 or later is required for the fine-tuning API. After installation, import the necessary os and OpenAI packages. Set your API key using your OpenAI developer key, which should be stored in a system variable, such as openai_api_key. This key is essential for accessing the API, but once set, you won’t need to reference it again, simplifying the process.

import os
from openai import OpenAI
client = OpenAI()

To begin the fine-tuning process, upload the training and evaluation data files using the OpenAI file API. Specify the file path and set the purpose to ‘fine-tune’ to ensure proper usage. After uploading, the API will return file IDs, which are necessary for subsequent requests.

client.files.create(
  file=open("./DATA_train.jsonl", "rb"),
  purpose='fine-tune'
)

Output:

<File file id=file-9lI2ovFA1UJskgOPpxDTwEhG at 0x2266872ea90> JSON: {
  "object": "file",
  "id": "file-9lI2ovFA1UJskgOPpxDTwEhG",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 1774941,
  "created_at": 1692794707,
  "status": "uploaded",
  "status_details": null
}

client.files.create(
  file=open("./DATA_eval.jsonl", "rb"),
  purpose='fine-tune'
)

Output:

<File file id=file-UqPVnkk9z8Q74BEUqPlnhjHL at 0x226669e59f0> JSON: {
  "object": "file",
  "id": "file-UqPVnkk9z8Q74BEUqPlnhjHL",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 442619,
  "created_at": 1692794711,
  "status": "uploaded",
  "status_details": null
}

To ensure your files are ready for fine-tuning, it’s important to check their status. This step will reveal any validation errors, such as malformed JSON. You can use the file.retrieve method with the file ID to verify the status. For instance, when checking an evaluation file, it should confirm that everything is fine. If there are issues, it will indicate errors and guide you on how to fix them. OpenAI plans to release a Python script to help validate data and estimate training costs, which may be available by the time you read this.

client.files.retrieve("file-UqPVnkk9z8Q74BEUqPlnhjHL")

Output:

<File file id=file-UqPVnkk9z8Q74BEUqPlnhjHL at 0x2266865c180> JSON: {
  "object": "file",
  "id": "file-UqPVnkk9z8Q74BEUqPlnhjHL",
  "purpose": "fine-tune",
  "filename": "file",
  "bytes": 442619,
  "created_at": 1692794711,
  "status": "uploaded",
  "status_details": null
}

With our training and evaluation data ready, we can initiate the fine-tuning job. To do this, we pass the ID of the training file uploaded via the file API, along with the optional validation file, and specify the model we wish to fine-tune, which in this case is GPT-3.5-turbo but should also work with newer models. This process allows us to customize GPT to meet specific needs, such as simulating Data from Star Trek.

client.fine_tuning.jobs.create(training_file="file-9lI2ovFA1UJskgOPpxDTwEhG", validation_file="file-UqPVnkk9z8Q74BEUqPlnhjHL", model="gpt-3.5-turbo")

Output:

<FineTuningJob fine_tuning.job id=ftjob-mQlhbPB5vsog1SeDLNx2xAMj at 0x226669e58b0> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-mQlhbPB5vsog1SeDLNx2xAMj",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1692794886,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-DBeDgDH8c36NSJobwuaBPXrW",
  "result_files": [],
  "status": "created",
  "validation_file": "file-UqPVnkk9z8Q74BEUqPlnhjHL",
  "training_file": "file-9lI2ovFA1UJskgOPpxDTwEhG",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": null
}

Once the fine-tuning job is initiated, it provides information about the job status. Monitoring the progress is important, as the process can take some time; in this case, it took about half an hour, which is quite efficient. To track the job, use the fine-tuning job.retrieve function with the job ID provided in the response to the create call. This function gives general information about the fine-tuning job, ensuring it is proceeding correctly.

For instance, the default number of epochs is set to three, but you can specify a different number if needed. Keep in mind that costs are incurred per epoch based on the number of tokens in your training dataset. Therefore, it’s advisable not to exceed the necessary number of epochs unless you’re prepared for the additional expense.

client.fine_tuning.jobs.retrieve("ftjob-mQlhbPB5vsog1SeDLNx2xAMj")

Output:

<FineTuningJob fine_tuning.job id=ftjob-mQlhbPB5vsog1SeDLNx2xAMj at 0x22666a16220> JSON: {
  "object": "fine_tuning.job",
  "id": "ftjob-mQlhbPB5vsog1SeDLNx2xAMj",
  "model": "gpt-3.5-turbo-0613",
  "created_at": 1692794886,
  "finished_at": null,
  "fine_tuned_model": null,
  "organization_id": "org-DBeDgDH8c36NSJobwuaBPXrW",
  "result_files": [],
  "status": "running",
  "validation_file": "file-UqPVnkk9z8Q74BEUqPlnhjHL",
  "training_file": "file-9lI2ovFA1UJskgOPpxDTwEhG",
  "hyperparameters": {
    "n_epochs": 3
  },
  "trained_tokens": null
}

To monitor the progress of your fine-tuning job, you can use the list_events function. This allows you to specify which events you want to track and how many of the most recent events you wish to see. By passing in the ID of your fine-tuning job, you can receive updates on its status. Once the job is complete, you’ll receive a message confirming its success, along with the ID of the fine-tuned model. This ID is crucial for using the model with the chat completions API or in the playground.

client.fine_tuning.jobs.list_events(id="ftjob-mQlhbPB5vsog1SeDLNx2xAMj", limit=10)

Output:

<OpenAIObject list at 0x22668759950> JSON: {
  "object": "list",
  "data": [
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-n0GA9lmPtAulghPIgsfSSdM1",
      "created_at": 1692797270,
      "level": "info",
      "message": "Fine-tuning job successfully completed",
      "data": null,
      "type": "message"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-miblzvSktANikUk7sJOQe6Ir",
...(output truncated)...
  ],
  "has_more": false
}

Upon completion, the training loss was recorded at 1.89, and the token accuracy was 0.54. Unlike previous experiments where metrics were provided throughout the training process, this time only the final metrics were available. Typically, you would see these metrics fluctuate as training progresses, providing insight into the model’s improvement.

Conclusion: The Future of AI Development

In this article, we explored the fascinating process of fine-tuning GPT to create a real AI version of Star Trek’s Data. We explored the intricacies of OpenAI’s fine-tuning API, the preparation of training data, and the steps involved in customizing an AI model. This experiment not only showcased the practical applications of advanced AI technologies but also highlighted the intersection between science fiction and real-world artificial intelligence. By understanding these processes, we gain valuable insights into the potential and limitations of current AI development techniques. Thank you for joining us on this journey through the cutting edge of AI technology.

If you’re intrigued by the possibilities of AI and machine learning, and want to dive deeper into these topics, consider exploring the comprehensive Machine Learning, Data Science and Generative AI with Python course. This course covers a wide range of topics from basic statistics to advanced AI techniques, providing you with the skills needed to excel in the rapidly evolving field of artificial intelligence.

Published by

Frank Kane

Our courses are led by Frank Kane, a former Amazon and IMDb developer with extensive experience in machine learning and data science. With 26 issued patents and 9 years of experience at the forefront of recommendation systems, Frank brings real-world expertise to his teaching. His ability to explain complex concepts in accessible terms has helped over one million students worldwide gain valuable skills in machine learning, data engineering, and AI development.

Leave a Reply