1.2 C
New York
Monday, February 26, 2024

Fine-Tuning a Custom GPT Model with Personal Chat History using Infinite.Tech

Fine-Tuning a Custom GPT Model with Personal Chat History using Infinite.Tech

Transforming the conversations into Fine-Tune JSONL format.

The file needs to be converted from the conversations into a training set. This can be done with a Python script below. I also needed to do some hand work to delete files.

Python Scripts

Save this script to a file convert_conversations_to_fine_tune_set.py and run from the folder with Run this script in a folder with

python conversationsToTuningSet.py [line_count]
# Run this script in a folder with conversations.json file.
# Creates a jsonl file that can be used as a tuning set for the chatbot
#
# Usage:
# python conversationsToTuningSet.py [line_count]

import json
import sys
import random

# Load your dataset (replace with the path to your dataset file)
with open('conversations.json') as file:
data = json.load(file)

# Function to process each entry

def process_dataset(data):
processed_data = []

for entry in data:
messages = []
title = entry.get('title', '')
if(title == None):
title = "No title"
mapping = entry.get('mapping', {})

# Adding the title as a system message
# messages.append({"role": "system", "content": title})
newMessage = {"messages":[{ "role": "system", "content": title}]}

# Iterating through the messages in the mapping and adding them to the
for key, value in mapping.items():
message_info = value.get('message')
if message_info:
role = message_info.get('author', {}).get('role')
content = message_info.get('content')
parts = content.get('parts')

# Skip system messages
if(role == "system" or role == "tool"):
continue

if role and parts and len(parts) > 0:
newMessage["messages"].append({"role": role, "content": parts[0]})

# Only add conversations with more than 2 messages
if len(newMessage["messages"]) < 2:
continue

processed_data.append(json.dumps(newMessage))

return processed_data

# Process the dataset
processed_data = process_dataset(data)

# get the argument for line count if it exists and randomly reduce the dataset to that size
if len(sys.argv) > 1:
line_count = int(sys.argv[1])
processed_data = random.sample(processed_data, line_count)

# Attempt to encode with utf-8 and ignore errors
processed_data = [line.encode('utf-8', errors='ignore').decode('utf-8') for line in processed_data]

with open('conversations_processed.jsonl', 'w') as file:
for line in processed_data:
file.write(line + 'n')

  • Creates a new File with only the conversations for experimentation
  • Transforms the simplified JSON into a JSONL Training Set

Some Handwork (Encoding Errors) on the JSONL set

  • Remove Broken Lines, Nulls, and funniness

Sample of 600 for starting and not breaking the bank

  • I took a set of 600 random lines to see what would happen

Tuning on The Playground with Conversation Sample

Once you have the model, the training can be submitted to the Open-AI playground.

Completed Model

The Model is trained on a Sample set of 600 conversation prompts and responses with a file size of 1.3 MB. After running the scripts, I have 12 more MB of pure convo data… It was ~$7 to train and 900,471 Tokens.

The testing and evaluation of the trained FR-1 model were performed within Infinite.Tech.

First Impressions: Random Questions

“List a series of Ideas”

“Reading list for 2024”

Evaluating Broad Responses using GPT-4 : Part 1

Upon receiving the answers to various questions with model assignments, they were fed into GPT-4, and with a general idea of the different model characteristics, a series of preliminary evaluations were outlined better to pinpoint the FR-1 model’s specializations, shortcomings, and uses.

Evaluating GPT Domain Niche with GPT-4 : Part 2

Series of Test Prompts to discover strengths and weaknesses

Evaluating The Specialization Responses with GPT-4 (Large Context): Part 3

The Lathe Protocol Gameshow

Using 4 different models, FR1 was evaluated in one of its found specializations: safety protocol writing.

After using GPT-4 with a Large context for evaluation of all the created protocols, the model performed second to GPT-4 in its domain!

Source link

Latest stories