Introduction
Large language models are trained on huge datasets of text to learn the relationships and contexts of words within larger documents. These relationships are what allows the model to generate text.
Recently I've read concerns about LLMs being trained on copyrighted text and reproducing it. This got me thinking: Can training text be extracted from an LLM? The answer, of course, is yes, and this isn't a new (or open) question. This led me to wonder what it would take to extract entire books- or have an LLM reproduce text it's never directly been trained on. I figured that, for the most part, many texts contain sections that would naturally align with the language relationships the model has learned. If that's the case, then perhaps I could use the model to infer those relationships and correct its course whenever it deviates.
So that's how I got here.
To see if this would work, I decided to use technology that I am familiar with. I'll use llama.cpp via its python bindings.
How it Works...
The solution I put together has the following key functions:
load_document(filename)
:
- This reads a text file and tokenizes it using the model's tokenizer. If the text is too long for the model's context window, it is split into smaller parts that fit within this window. This prevents token overflow.
generate_text(prompt, max_tokens=1)
:
- This generates text, n tokens at a time, with 0.0 as the temperature and a static seed. It essentially continues the text from where the input text stopped.
compress_text(source_text)
:
- This function attempts to compress the input text by generating parts of it using the LLM. If the generated text matches the start of the source text, it continues– otherwise, it adds the character directly to the compressed string.
- To record the generated text, the function notes how many tokens were generated and places that number between a delimiter.
decompress_text(compressed_text)
:
- Decompresses text compressed by the
compress_text
function. It splits the text using the delimiter and reconstructs the original text by generating missing parts or directly appending the text.
Testing
I used two texts for test. For the first, I decided to use the first chapter of "Alice's Adventures in Wonderland" as I assumed it would be in the model's training data. As I expected, I got very good compression.
Compression
Here's the meat of the compression function:
Code
"""Compress text by generating and comparing segments to the source text."""
generated_text = ""
compressed_string = ""
gen_count = 0
i = 0
# let's loop until we have generated the entire source text
while generated_text != source_text:
# get a new token
part = generate_text(generated_text)
# if our generated text aligns with the source text then tally it
if source_text.startswith(str(generated_text + part)) and len(part) > 0:
gen_count += 1
generated_text += part
i = len(generated_text)
if debug:
print(BLUE + part + RESET, end="", flush=True)
# if not, then grab a letter from the source document
# hopefully we'll be back on track during the next loop
else:
i += 1
if gen_count > 0:
compressed_string += f"{re.escape(DELIMITER)}{gen_count}{re.escape(DELIMITER)}"
gen_count = 0
generated_text += source_text[i - 1]
compressed_string += source_text[i - 1]
if debug:
print(source_text[i - 1], end="", flush=True)
Results
Here's the model processing the script. The text in blue matches text generated by the LLM and white is from the source text. Yes, it's slow.
The "Compressed" content:
Here's what the output looks like. Yes, it's in JSON format and yes it's ugly, but this is just a proof of concept, right? For the sake of clarity in this post, I picked an easy-to-read delimiter: @
This is the complete "compressed" text of Chapter 1.
["\ufeffCH@2@I.@1@Down@7@\n\n@15@\n@18@\n@13@\n@6@ \u201c@17@s@2@\n@79@ _@13@ _@72@ a\n@10@_,@106@ how@100@as@51@cup@24@ down@26@\u201d,@4@\n@17@\n@3@ underneath@11@cup@62@ fell@9@Which@23@?@4@\n@19@\n@35@ lear@2@se@12@room@2@\n@5@ _@25@ go@1@d\np@20@\n@19@ no@13@ thought@2@ nice@22@ _@2@\n@16@ walk@5@ward@12@she@3@gl@11@,@9@the@1@ word@7@ to@27@\n(and@21@\n@120@ Din@59@ here@5@ get@15@ream@19@\n@82@ tell@111@ the wind@21@ e@55@ a@50@ walked@11@\n@43@first@35@\n@104@\n@43@,@31@ would@7@ should@21@,@6@ h@43@\n@17@\n@17@\n@17@\n@8@,@11@,@21@,\u201d@2@ on@2@ large@23@ was@5@ _@1@_@21@ it@4@_@12@\nse@2@ nice@1@ hist@8@,@2@e@6@ and@9@_@34@\n@1@ th@1@t@5@ _@45@soon@11@ _@63@hot@3@,@10@* @3@ @3@ @1@ @3@ *@10@\n@1@*@10@ @27@\n@20@right@5@ that@14@ that@9@wait@35@\n@36@fl@4@ is@36@ going@11@ for@3@ when@37@\n@13@ and@6@ cl@34@sat@7@Come@16@\nr@2@;@16@ very@2@,@24@\n@49@But@69@ very@3@ on@17@Well@14@ if@23@ can cre@11@\n@37@,@39@ generally@13@ much@28@ life"]
11,994 to 986 Characters
Wow, that's a pretty big reduction. The compressed text is only about 8% of the original size.
For fun, I compressed the whole file. This method reduced the number of characters from 174,355 to 25,360 - the compressed text being 15% of the original.
Decompression
Compression is pointless if I can't reverse it. Let's look at the decompress function:
Code
decompressed_text = ""
# split the parts into sections, text and generation counts
parts = re.split(rf'({re.escape(DELIMITER)}\d+{re.escape(DELIMITER)})', compressed_text)
for part in parts:
# if we're looking at a generation count, then generate text
if re.match(rf'{re.escape(DELIMITER)}\d+{re.escape(DELIMITER)}', part):
number = int(part[1:-1])
for count in range(number):
part = generate_text(decompressed_text)
if debug:
print(GREEN + part + RESET, end="", flush=True)
decompressed_text = decompressed_text + part
else:
# just add the text to the decompressed string
decompressed_text += part
if debug:
print(part, end="", flush=True)
Results
It works!
One more thing
- I don't know how well this will perform across different GPUs, as I've heard that outputs could vary. While I don't have the ability to test this, I confirmed that the results were consistent between a GPU and a CPU.
- I haven't gotten around to uploading the script to Github. Once I do, I'll post it here.
Here's a draft version of this post, compressed:
["\nWarning@1@ What f@1@ows@1@ not practical, well@1@written,@1@ finished@5@lso probably not the@1@ idea. It was fun th@1@u@1@h.@1@Int@1@duction@1@Large language@1@ are trained@1@ huge datas@3@ to learn the relationships and contexts of@1@ within larger doc@1@ments@1@ These relationships are what@1@s the@3@ text.\nRec@1@ I@2@ read concerns@1@ LL@2@ trained@1@ copyright@1@ text and repro@1@ing@1@.@1@ got@2@: Can training text be extracted@5@ The@1@, of@4@, and this@1@n@4@ (@1@ open@1@ question@1@ This led@3@ what it@1@ take@1@ extract entire@1@- or have an LL@1@ repro@1@e text it'@1@ ne@1@r dire@1@tly been@3@ I fig@1@ed that, for@1@ most@2@ many texts contain sections@1@ would natur@1@y align@2@ language relationships the@4@ If that@2@ the@2@ then@1@ I@2@ the@2@ infer those relationships@1@ correct its course whenever@1@ dev@2@.@1@So th@1@t@2@ how@1@ got@1@. @1@To see@2@ would@5@ use technology th@1@t I am@2@.@1@'ll use ll@3@ via its p@1@hon bind@1@.\nHow@1@ Works...\n@1@ solution I put@1@ has the@1@ key fun@2@ns@2@load@1@document(fil@1@ame@1@\n@1@ reads a@3@ token@2@ using@1@ model@2@ to@1@n@2@ If@1@ text@3@ for@2@'@5@ is@2@ smaller parts th@1@t fit@1@ this@2@ This prev@1@ token over@1@ow.@13@):@2@ generates@1@, n@1@ at@3@ w@1@h 0.0 as@3@ a static@1@. It ess@1@i@1@y continues@1@ text@2@ the input text stopped@2@compress@3@sour@1@e_@1@):@3@ attempts@3@ input@1@ by generating parts@4@ LL@3@ the@1@ te@1@ mat@1@s@1@ start@2@ source@2@ it continues\u2013 o@1@rwise@2@ adds@1@ character directly to@1@ comp@1@ string@1@\nTo record@1@ generated@2@ the fun@1@ion notes how@1@ tokens@1@ gener@1@ed@1@ places that@1@ between a del@1@.\nde@10@De@2@ text comp@2@ the compress@8@ te@1@ using@3@er@1@ recon@5@ by generating missing@1@ or directly app@1@ the text.@1@Testing\nI used two texts for t@1@t. For@2@,@1@ decided@2@ the@4@Al@8@\"@1@ I ass@1@med@3@ in@7@ As I@2@ I got very good compression. @1@Com@1@\nHere'@1@ the meat@2@ compression function@2@Code@1@\nResults@1@Here'@1@ the model processing@1@ script. The text in blue mat@1@s text gener@1@ed@2@ LL@1@ and white is from@1@ source@2@ Yes@4@ slow. @2@The \"Com@2@ content:@1@Here@2@ wh@1@t@1@ output@2@. Yes@4@ in JSON@1@ and@1@ it@2@ u@1@y,@1@ this@1@ just@4@, right@1@ For@3@ clarity in@4@ pic@1@d an easy@4@ del@1@: @\nThis@3@lete \"@4@ of Chapter@2@.@2@De@2@ @1@Com@1@ion is point@1@ if I@3@ reverse@2@ Let@2@ look@1@ the@1@p@1@s@8@\nIt@2@\nNot@2@I don@2@ know@3@ will perform ac@1@s@1@ GP@1@, as@1@'@1@ heard@1@ outputs cou@1@d@2@ While I don@3@ the ability@3@,@1@ confirmed@1@ the results@2@ bet@1@en a GPU@5@I haven@2@ gotten arou@1@d@1@ u@1@oad@1@ the sc@1@t@2@hub. Once I@5@ post it@3@Here'@1@ this post, comp"]
3,436 to 2,691 Characters
As expected, the method performs better better on data that the model has been trained on, but there's still some reduction in size.
Thoughts
- The model is huge
- Would it practical to train a model for the purpose of compression?
- Could this method be used to identify any data that was used to train a model?
- Do different models yield better results?
- Can this be extended to other data types, like images?