Cleaning your data with Ambrosia

To instruction-tune an LLM model, you need an instruction-tuning dataset. One of the most popular datasets for doing so is Teknium’s GPTeacher General-Instruct. The basic version consists of 18,194 entries generated from GPT-4.

A single record looks like this:

{
    "instruction": "Rewrite the given sentence in passive voice.",
    "input": "The committee approved the new budget.",
    "response": "The new budget was approved by the committee."
}

We’ll use Ambrosia (v0.1.0) to clean this particular dataset as an example To be clear, GPTeacher is already good. Let’s just make it even better. , but the same concepts apply to all your training data.

As we go through this, remember that how you clean your data depends on what you’re trying to do with it, the model it’s being applied to, and it will change over time as we learn more about how to create better datasets. This is a guide for Ambrosia, not for creating ideal data.

convert to JSONL

Ambrosia is designed to work with JSONL files, but General-Instruct is provided as JSON. We can convert it to JSONL with a quick jq command:

$ jq -c ".[]" gpt4-instruct-dedupe-only-dataset.json > gpgi.jsonl

We end up with 18,194 records:

$ wc -l gpgi.jsonl
   18194 gpgi.jsonl
$

whitespace

The whitespace command will strip out any Unicode-defined whitespace at the start or end of a specified field. We can clean every field like so:

$ ambrosia whitespace \
--fields instruction,input,response \
gpgi.jsonl gpgi_trimmed.jsonl
11:08AM INF loaded data command=whitespace in_record_count=18194 infile=gpgi.jsonl outfile=gpgi_trimmed.jsonl
11:08AM INF trimming whitespace command=whitespace in_record_count=18194 infile=gpgi.jsonl outfile=gpgi_trimmed.jsonl
11:08AM INF trimmed whitespace command=whitespace in_record_count=18194 infile=gpgi.jsonl outfile=gpgi_trimmed.jsonl trim_count=47
11:08AM INF writing trimmed data command=whitespace in_record_count=18194 infile=gpgi.jsonl outfile=gpgi_trimmed.jsonl trim_count=47
11:08AM INF done command=whitespace in_record_count=18194 infile=gpgi.jsonl outfile=gpgi_trimmed.jsonl trim_count=47
$

The trim_count field tells us there were 47 fields with leading or trailing whitespace.

We can re-run this command with the --debug/-d flag to get more detailed logging on what was changed:

$ ambrosia -d whitespace \
--fields instruction,input,response \
gpgi.jsonl gpgi_trimmed.jsonl
11:08AM INF loaded data command=whitespace in_record_count=18194 infile=gpgi.jsonl outfile=gpgi_trimmed.jsonl
11:08AM INF trimming whitespace command=whitespace in_record_count=18194 infile=gpgi.jsonl outfile=gpgi_trimmed.jsonl
11:08AM DBG trimmed whitespace after="Identify which of the given animals is a reptile and which is a mammal.  Order the reptiles first followed by the mammals." before="Identify which of the given animals is a reptile and which is a mammal.  Order the reptiles first followed by the mammals.  " command=whitespace field=instruction in_record_count=18194 infile=gpgi.jsonl line=247 outfile=gpgi_trimmed.jsonl
...
11:08AM DBG trimmed whitespace after="Identify the logical fallacy in the following argument: 'If you don't support the new environmental regulations, you must want our planet to be polluted and destroyed.'" before="Identify the logical fallacy in the following argument: 'If you don't support the new environmental regulations, you must want our planet to be polluted and destroyed.' " command=whitespace field=instruction in_record_count=18194 infile=gpgi.jsonl line=18176 outfile=gpgi_trimmed.jsonl
11:08AM INF trimmed whitespace command=whitespace in_record_count=18194 infile=gpgi.jsonl outfile=gpgi_trimmed.jsonl trim_count=47
11:08AM INF writing trimmed data command=whitespace in_record_count=18194 infile=gpgi.jsonl outfile=gpgi_trimmed.jsonl trim_count=47
11:08AM INF done command=whitespace in_record_count=18194 infile=gpgi.jsonl outfile=gpgi_trimmed.jsonl trim_count=47
$

The before and after fields show what changes were made in the outputs. The examples both had a trailing space in the instruction field.

dedupe

We’ll first do a simple dedupe looking at all fields together.

We’ll also set it to be case-insensitive by setting the -i flag, and pass -d to get more details on what’s being found.

$ ambrosia -d dedupe -i \
--fields instruction,input,response \
gpgi_trimmed.jsonl
11:12AM INF loaded data command=dedupe in_record_count=18194 infile=gpgi_trimmed.jsonl outfile=gpgi_trimmed_dedupe.jsonl
11:12AM INF deduped data command=dedupe duplicates_found=0 in_record_count=18194 infile=gpgi_trimmed.jsonl out_record_count=18194 outfile=gpgi_trimmed_dedupe.jsonl
11:12AM INF writing deduped data command=dedupe duplicates_found=0 in_record_count=18194 infile=gpgi_trimmed.jsonl out_record_count=18194 outfile=gpgi_trimmed_dedupe.jsonl
11:12AM INF done command=dedupe duplicates_found=0 in_record_count=18194 infile=gpgi_trimmed.jsonl out_record_count=18194 outfile=gpgi_trimmed_dedupe.jsonl
$

We see 0 for duplicates_found, as expected for a properly deduped dataset.

We may also want to dedupe on individual fields. In our case, let’s strip out any identical instruction fields:

$ ambrosia -d dedupe -i \
--fields instruction \
gpgi_trimmed_dedupe.jsonl
11:14AM INF loaded data command=dedupe in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl outfile=gpgi_trimmed_dedupe_dedupe.jsonl
11:14AM DBG duplicate found command=dedupe in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl line=249 outfile=gpgi_trimmed_dedupe_dedupe.jsonl
...
11:14AM DBG duplicate found command=dedupe in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl line=18192 outfile=gpgi_trimmed_dedupe_dedupe.jsonl
11:14AM INF deduped data command=dedupe duplicates_found=1047 in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl out_record_count=17147 outfile=gpgi_trimmed_dedupe_dedupe.jsonl
11:14AM INF writing deduped data command=dedupe duplicates_found=1047 in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl out_record_count=17147 outfile=gpgi_trimmed_dedupe_dedupe.jsonl
11:14AM INF done command=dedupe duplicates_found=1047 in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl out_record_count=17147 outfile=gpgi_trimmed_dedupe_dedupe.jsonl

1047 duplicate instruction fields represent nearly 6% of the total. Let’s re-run it with an extra -d specified to get more verbose output on what it’s finding:

$ ambrosia -d -d dedupe -i \
--fields instruction \
gpgi_trimmed_dedupe.jsonl gpgi_ins_dedupe.jsonl
11:18AM INF loaded data command=dedupe in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl outfile=gpgi_ins_dedupe.jsonl
11:18AM DBG duplicate found command=dedupe data="instruction: Translate the following English proverb into French and explain its meaning.\n" in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl line=249 outfile=gpgi_ins_dedupe.jsonl
...
11:18AM DBG duplicate found command=dedupe data="instruction: Rewrite the given sentence in passive voice.\n" in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl line=18192 outfile=gpgi_ins_dedupe.jsonl
11:18AM INF deduped data command=dedupe duplicates_found=1047 in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl out_record_count=17147 outfile=gpgi_ins_dedupe.jsonl
11:18AM INF writing deduped data command=dedupe duplicates_found=1047 in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl out_record_count=17147 outfile=gpgi_ins_dedupe.jsonl
11:18AM INF done command=dedupe duplicates_found=1047 in_record_count=18194 infile=gpgi_trimmed_dedupe.jsonl out_record_count=17147 outfile=gpgi_ins_dedupe.json

Looking it over, these are mostly similar types of requests where the instruction requests something be done to the input.

I would prefer if these instructions were unique, so I’ll treat them as valid duplicates.

We can do the same for the response field:

$ ambrosia -d -d dedupe -i \
--fields response \
gpgi_ins_dedupe.jsonl gpgi_ins_res_dedupe.jsonl
11:23AM INF loaded data command=dedupe in_record_count=17147 infile=gpgi_ins_dedupe.jsonl outfile=gpgi_ins_res_dedupe.jsonl
11:23AM DBG duplicate found command=dedupe data="response: Positive\n" in_record_count=17147 infile=gpgi_ins_dedupe.jsonl line=340 outfile=gpgi_ins_res_dedupe.jsonl
11:23AM DBG duplicate found command=dedupe data="response: positive\n" in_record_count=17147 infile=gpgi_ins_dedupe.jsonl line=376 outfile=gpgi_ins_res_dedupe.jsonl
11:23AM DBG duplicate found command=dedupe data="response: positive\n" in_record_count=17147 infile=gpgi_ins_dedupe.jsonl line=733 outfile=gpgi_ins_res_dedupe.jsonl
11:23AM DBG duplicate found command=dedupe data="response: C) Paris\n" in_record_count=17147 infile=gpgi_ins_dedupe.jsonl line=852 outfile=gpgi_ins_res_dedupe.jsonl
...
11:23AM DBG duplicate found command=dedupe data="response: The quick brown fox jumped over the lazy dog.\n" in_record_count=17147 infile=gpgi_ins_dedupe.jsonl line=17031 outfile=gpgi_ins_res_dedupe.jsonl
11:23AM INF deduped data command=dedupe duplicates_found=269 in_record_count=17147 infile=gpgi_ins_dedupe.jsonl out_record_count=16878 outfile=gpgi_ins_res_dedupe.jsonl
11:23AM INF writing deduped data command=dedupe duplicates_found=269 in_record_count=17147 infile=gpgi_ins_dedupe.jsonl out_record_count=16878 outfile=gpgi_ins_res_dedupe.jsonl
11:23AM INF done command=dedupe duplicates_found=269 in_record_count=17147 infile=gpgi_ins_dedupe.jsonl out_record_count=16878 outfile=gpgi_ins_res_dedupe.jsonl

Similar to the instruction field, some of these duplicate responses aren’t wrong. The most common duplicate results are from sentiment analysis/classification questions. Depending on what you’re doing, you may well want to keep them. We’ll remove them for now.

Finally, we can dedupe across all fields using ROUGE-L scores. We’ll use the default threshold (0.7) and add --progress to keep us updated:

$ ambrosia dedupe --progress --rougel \
--fields instruction,input,response \
gpgi_ins_res_dedupe.jsonl gpgi_rl.jsonl
11:29AM INF loaded data command=dedupe in_record_count=16878 infile=gpgi_ins_res_dedupe.jsonl outfile=gpgi_rl.jsonl
 100% |████████████████████| (16878/16878, 24 it/s)
11:41AM INF deduped data command=dedupe duplicates_found=107 in_record_count=16878 infile=gpgi_ins_res_dedupe.jsonl out_record_count=16771 outfile=gpgi_rl.jsonl
11:41AM INF writing deduped data command=dedupe duplicates_found=107 in_record_count=16878 infile=gpgi_ins_res_dedupe.jsonl out_record_count=16771 outfile=gpgi_rl.jsonl
11:41AM INF done command=dedupe duplicates_found=107 in_record_count=16878 infile=gpgi_ins_res_dedupe.jsonl out_record_count=16771 outfile=gpgi_rl.jsonl
$

On my M1 Pro, it takes about 8 minutes to compute and compare scores over the deduped dataset.

We’ll do one more pass across only the instruction field:

$ ambrosia dedupe --progress --rougel \
--fields instruction \
gpgi_ins_res_dedupe.jsonl gpgi_rl.jsonl
$ 11:45AM INF loaded data command=dedupe in_record_count=16771 infile=gpgi_rl.jsonl outfile=gpgi_dd.jsonl
 100% |████████████████████| (16771/16771, 733 it/s)
11:46AM INF deduped data command=dedupe duplicates_found=4424 in_record_count=16771 infile=gpgi_rl.jsonl out_record_count=12347 outfile=gpgi_dd.jsonl
11:46AM INF writing deduped data command=dedupe duplicates_found=4424 in_record_count=16771 infile=gpgi_rl.jsonl out_record_count=12347 outfile=gpgi_dd.jsonl
11:46AM INF done command=dedupe duplicates_found=4424 in_record_count=16771 infile=gpgi_rl.jsonl out_record_count=12347 outfile=gpgi_dd.jsonl
$

This takes about 20 seconds since the length of the field to be evaluated is much shorter.

We’re left with 12347 records for the next steps.

length

Since tokenizers and vocabs vary, Ambrosia calculates length as the number of bytes used to represent the content, encoded as UTF-8.

We can eliminate any entries with an empty instruction field by specifying a minimum:

$ ambrosia length --fields instruction --min 1 gpgi_rl.jsonl
11:26AM INF loaded data command=length in_record_count=16771 infile=gpgi_rl.jsonl outfile=gpgi_rl_length.jsonl
11:26AM INF finished filtering by length command=length filtered_count=16771 in_record_count=16771 infile=gpgi_rl.jsonl out_of_bounds=0 outfile=gpgi_rl_length.jsonl
11:26AM INF writing output command=length filtered_count=16771 in_record_count=16771 infile=gpgi_rl.jsonl out_of_bounds=0 outfile=gpgi_rl_length.jsonl
11:26AM INF done command=length filtered_count=16771 in_record_count=16771 infile=gpgi_rl.jsonl out_of_bounds=0 outfile=gpgi_rl_length.jsonl
$

No instructions were empty; that’s good. We’ll check the response field as well:

$ ambrosia length --fields response --min 1 gpgi_rl.jsonl
11:28AM INF loaded data command=length in_record_count=16771 infile=gpgi_rl.jsonl outfile=gpgi_rl_length.jsonl
11:28AM INF finished filtering by length command=length filtered_count=16763 in_record_count=16771 infile=gpgi_rl.jsonl out_of_bounds=8 outfile=gpgi_rl_length.jsonl
11:28AM INF writing output command=length filtered_count=16763 in_record_count=16771 infile=gpgi_rl.jsonl out_of_bounds=8 outfile=gpgi_rl_length.jsonl
11:28AM INF done command=length filtered_count=16763 in_record_count=16771 infile=gpgi_rl.jsonl out_of_bounds=8 outfile=gpgi_rl_length.jsonl
$

We found 8 records with an empty response, which we almost certainly don’t want.

filter

Using filter, we can clean out the data based on arbitrary strings. We can pass a single string or a wordlist and have it treated as either a wildcard (default) or a regular expression (RE2).

Let’s filter for noinput on the instruction field. We’ll throw a couple debug flags on as well:

$ ambrosia -d -d filter \
--fields instruction \
--string "noinput" \
gpgi_rl_length.jsonl gpgi_filter_1.jsonl
1:02PM INF loaded data command=filter in_record_count=16763 infile=gpgi_rl_length.jsonl outfile=gpgi_filter_1.jsonl
1:02PM DBG duplicate found command=filter data={"input":"","instruction":"\u003cnoinput\u003e","response":"Analyze the main contributing factors to the 2008 financial crisis and briefly explain actions taken by governments and central banks to mitigate the crisis."} in_record_count=16763 infile=gpgi_rl_length.jsonl line=16046 outfile=gpgi_filter_1.jsonl
1:02PM INF filtered data command=filter filter_hits=1 filtered_count=16762 in_record_count=16763 infile=gpgi_rl_length.jsonl outfile=gpgi_filter_1.jsonl
1:02PM INF writing data command=filter filter_hits=1 filtered_count=16762 in_record_count=16763 infile=gpgi_rl_length.jsonl outfile=gpgi_filter_1.jsonl
1:02PM INF done command=filter filter_hits=1 filtered_count=16762 in_record_count=16763 infile=gpgi_rl_length.jsonl outfile=gpgi_filter_1.jsonl
$

We found one record that looked like this:

{
  "instruction": "<noinput>",
  "input": "",
  "response": "Analyze the main contributing factors to the 2008 financial crisis and briefly explain actions taken by governments and central banks to mitigate the crisis."
}

Definitely want to filter that out.

We can also filter on a newline-delimited wordlist file. We’ll use that to filter out standard GPT-4 alignment boilerplate:

$ ambrosia -d -d filter -f response \
--wordlist ./align_filter.txt \
gpgi_filter_1.jsonl gpgi_filter_2.json
1:17PM INF loaded data command=filter in_record_count=16762 infile=gpgi_filter_1.jsonl outfile=gpgi_filter_2.json
1:17PM INF filtered data command=filter filter_hits=984 filtered_count=15778 in_record_count=16762 infile=gpgi_filter_1.jsonl outfile=gpgi_filter_2.json
1:17PM INF writing data command=filter filter_hits=984 filtered_count=15778 in_record_count=16762 infile=gpgi_filter_1.jsonl outfile=gpgi_filter_2.json
1:17PM INF done command=filter filter_hits=984 filtered_count=15778 in_record_count=16762 infile=gpgi_filter_1.jsonl outfile=gpgi_filter_2.json
$

psort

Ambrosia can sort data through an LLM, which gives us an incredible amount of flexibility.

There are quite a few options we can pass, and given that the prompts can be wordy, you can also set these with environment variables. For our examples, we’ll only use command line flags for clarity.

Let’s start with this:

ambrosia psort \
    --sysprompt "As an AI, your task is to evaluate chat logs \
based on provided criteria.  Provide '1' for valid data and '0' \
along with a brief explanation for invalid data.  Your response \
should always start with either '0' or '1'." \
    --instruction "Examine the data below for parsing errors, \
structural inconsistencies, or inaccuracies.  If the interaction \
is in intelligible English, factually correct, and free of \
parsing artifacts or incorrectly split fields, respond with \
'1' only.  Ignore potential errors in the text needing rewriting \
if the request in the chatlog is to fix the provided writing. \
If any issues are present, respond with '0' and a short \
explanation of the issue.  Your response must always start \
with '1' or '0'." \
    --fields "instruction,input,response" \
    --include-resp \
    --max-tokens 200 \
    --concurrency 30 \
    --dry-run \
    gpgi_filter_2.json

Most of these are self-explanatory or parameters you’d feed to any LLM. There are a few worth calling out, though.

You almost certainly want to specify --fields. If you don’t, it will include all fields by default, but the order is not guaranteed. Specifying them on the command line will ensure it’s set in the order you want.

If we set --include-resp, then the output from the LLM will be added to each piece of data evaluated in a field named ambrosia. Hugely helpful.

--dry-run will simply output the prompts that would have been sent to the LLM instead of actually running inference. This is a great way to confirm that you’ve got things set up how you want.

When we run this, we’ll start to get batches of inference requests that would be run if --dry-run wasn’t set:

...
--BEGIN

System Prompt: As an AI, your task is to evaluate chat logs based on provided criteria.  Provide '1' for valid data and '0' along with a brief explanation for invalid data.  Your response should always start with either '0' or '1'.
Prompt: Examine the data below for parsing errors, structural inconsistencies, or inaccuracies.  If the interaction is in intelligible English, factually correct, and free of parsing artifacts or incorrectly split fields, respond with '1' only.  Ignore potential errors in the text needing rewriting if the request in the chatlog is to fix the provided writing.  If any issues are present, respond with '0' and a short explanation of the issue.  Your response must always start with '1' or '0'.

instruction: Rephrase the following sentence to make it more concise.
input: Due to the fact that Jenna is an extremely good writer, she has been chosen to be responsible for writing the report.
response: Because Jenna is an excellent writer, she has been selected to write the report.

--END
...

Looks right to me. Now we can add the model details, a token, and remove the --dry-run flag. We’ll add --progress as well to keep track of what’s going on:

$ ambrosia psort \
    --model gpt-3.5-turbo \
    --token sk-XXXX \
    --sysprompt "As an AI, your task is to evaluate chat logs \
based on provided criteria.  Provide '1' for valid data and '0' \
along with a brief explanation for invalid data.  Your response \
should always start with either '0' or '1'." \
    --instruction "Examine the data below for parsing errors, \
structural inconsistencies, or inaccuracies.  If the interaction \
is in intelligible English, factually correct, and free of \
parsing artifacts or incorrectly split fields, respond with \
'1' only.  Ignore potential errors in the text needing rewriting \
if the request in the chatlog is to fix the provided writing. \
If any issues are present, respond with '0' and a short \
explanation of the issue.  Your response must always start \
with '1' or '0'." \
    --fields "instruction,input,response" \
    --include-resp \
    --max-tokens 200 \
    --concurrency 30 \
    --progress \
    gpgi_filter_2.json

As this runs, it’ll start writing output to different files depending on the response. An input file named data.jsonl will output to data_psort_X.jsonl, where X is the first character of the LLM’s response.

For example, a response of foo bar baz would be written to data_psort_f.jsonl, and a response of lorem ipsum would be written to data_psort_l.jsonl.

Despite having some help from GPT-4, I’m not much of a prompt engineer, and we don’t end up with everything perfectly sorted with 0 and 1:

$ wc -l ./gpgi_filter_2*
   15778 ./gpgi_filter_2.json
     170 ./gpgi_filter_2_psort_0.json
   15601 ./gpgi_filter_2_psort_1.json
       1 ./gpgi_filter_2_psort_C.json
       1 ./gpgi_filter_2_psort_L.json
       1 ./gpgi_filter_2_psort_N.json
       1 ./gpgi_filter_2_psort_R.json
       3 ./gpgi_filter_2_psort_T.json
   31556 total
$

At least they are pretty easy to review. Let’s check a few:

{
  "instruction": "Given a conversation between a customer and a customer service agent, determine whether the agent has resolved the customer's issue.  Respond with 'Resolved' or 'Unresolved'.",
  "input": "- Customer: I haven't received my package yet, and it's past the delivery date. \n- Agent: I apologize for the inconvenience.  It seems there's a delay in delivery.  I will contact our courier and expedite the shipping process.  We will update you as soon as possible.",
  "response": "Resolved",
  "ambrosia": "Resolved.  The customer service agent has acknowledged the issue and taken steps to resolve it by expediting the shipping process and promising to update the customer."
}

This one wasn’t handled correctly, unfortunately. The instruction ended up acting as a prompt injection. Some additional iterations on the prompts above should avoid false positives like this.

This one is more interesting:

{
  "ambrosia": "0 The provided response is not a valid decryption of the given message.  The correct decryption should be \"My mission, Earthlings, is to destroy you.\"",
  "instruction": "You get handed the following Alien decryption protocol.  Try to decipher the message provided.",
  "input": "Protocol: Swap every two adjacent characters, then reverse the entire string.\nMessage: osyms_rso_ lkarn,!oHe gt",
  "response": "Greetings, Earthlings!  My task was to help you decode this message."
}

This triggers hallucinations in the model(s). Both answers are wrong. Getting a human to review this from the larger subset of instructions gives us a chance to realize what’s happening.

{
  "ambrosia": "0 The response is incorrect.  The instruction was to rewrite the paragraph without using any adjectives, however, the response still contains the adjective \"nice\".  The correct response could be, \"The girl was wearing a blue dress.  Her hair flowed down her back, and she had a smile on her face.\"",
  "instruction": "Rewrite the following paragraph without using any adjectives.",
  "input": "The beautiful girl was wearing an elegant, blue dress that made her look absolutely stunning.  Her long, curly hair was flowing down her back, and she wore a wide, infectious smile on her face.",
  "response": "The girl was wearing a dress that made her look nice.  Her hair was flowing down her back, and she wore a smile on her face."
}

Another example of both models being wrong and finding an issue via psort that would be difficult to find without it. Both of those sentences contain adjectives.

wrapping up

If you’re looking for a JSONL file, you’re all set.

If you need to convert it back to a list of objects, this jq command will get it done while excluding the ambrosia field:

$ jq -s 'map({instruction, input, response})' gpgi_clean.jsonl | jq '.' > gpgi_clean.json

I want to give a huge thanks to @Teknium1 for being willing to let me use one if his datasets as an example here.

Feel free to reach out with any issues or suggestions, and definitely let me know if you end up using Ambrosia to make something cool.