In my high school CS course, we had a project where we made a Markov chain that could do next-word prediction based on our text messages. The program would read a dataset of my texts once, learning what words follow what other words at what frequency, and, given a single beginning word, could complete a text message. Even though the outputs were nonsensical, I had a lot of fun with that assignment. Naturally, I have since been very excited about developments in LLMs.
Modern chatbot language models are by far the coolest technology I've had the opportunity to play with. I'm consistently impressed by how helpful I find chatbots. Because I use this technology so much, I am quite curious about how it works.
Here's what I know: Models have a "pre-training" stage where they learn patterns in text across a huge dataset of internet content. During this stage, the model is fed the text as a stream and tries to guess the next token (word or subword). The correct token is then revealed to the model so that the model updates its weights accordingly. After pre-training, the model is solely an internet document completer; you can enter the beginning of a document, and the model will predict the next token until it generates a special token to indicate the end of the document. In order to turn the internet document generator into a helpful chatbot, there are several "post-training" steps that engineers can take. They might write an initial prompt to make the internet document seem like a conversation between a helpful assistant and a human. They might use reinforcement learning methods to encourage models to be concise, accurate, or excel at another measure. They also might use supervised fine-tuning, a learning method in which the model is shown several high-quality responses to questions and learns the patterns in those responses.
This week, I decided to implement a supervised fine-tuning pipeline to
change model behavior. "Supervised" learning involves data with labels:
questions with known correct answers. Following the fun I had in my high
school CS class, I wanted to do a personality experiment, where I tried
to influence a model to speak in a certain way. I've been reading a lot
of blogs from Silicon Valley people, so I decided to train on Paul
Graham's blog posts. I first scraped them from his website using
BeautifulSoup. Then, I processed it into a DataFrame of paragraphs using
basic string methods I'm learning in my data science class. Finally, I
queried the OpenAI API to assign questions to each paragraph. Processing
all of these tokens cost about 15 cents. From talking to the chatbot
about my process, I found that this data is traditionally stored in a
JSON
with a specific format, so I downloaded my dataframe
with those specifications. Now, I've created a dataset of questions and
answers that captures how Paul Graham speaks. I had about five thousand
examples, which is small for full fine-tuning but should be fine for
LoRA
, a softer form of fine-tuning which requires less
data. It uses matrix decomposition to perform a relatively small number
of calculations, then apply it to the entire set of weights to change
the overall model behavior.
After creating the dataset, I was ready for training. I had
o3
write me a quick fine-tuning script for
llama-7b
, then I spun up a Lambda H100 and tried to run my
code. Only after starting the GPU did I realize that I needed a license
to fine-tune models. I tried signing up for llama's license, but people
online said it could take a few days to be approved (it didn't, but I
didn't know that at the time). Instead, I agreed to some terms for
Google's Gemma-7b-Instruct
model and was approved
instantly. After gaining access to a model on HuggingFace, I had a few
package import concerns, but o3
was able to resolve them
pretty quickly. Training ended up taking about 15 minutes, which cost 65
cents in total. I uploaded the most important files to HuggingFace
through the command line, and downloaded every file generated during
training to my computer.
Now that I had the LoRA
fine-tuned weights, all that was
left was inference. I wanted to run inference on my personal computer,
not on a cloud GPU. I had o3
write me an inference script,
but upon running, I found that my poor overworked CPU took multiple
minutes to generate a single token! I had thought that
Gemma-7b
was a small model that my CPU might be able to
handle, which turned out to be true, but not without some tricks to
speed it up.
One limiting factor for running a model locally is the amount of memory
needed to store the weights. By default, Gemma-7B
uses
float32
precision, meaning each weight takes 32 bits to
represent. Since each of the 32 bits can be on or off, this allows for
2^32 (4.3 billion) distinct values. If you want to fit more model
weights in RAM, you have to have each number take up less memory, which
means representing the same data with fewer bits. This process is called
quantization. Int8
has 256 (2^8) different bit patterns,
and we can make each of those bit patterns refer to some float by a
basic linear equation. The result is lossy: some precision is lost
because nearby values collapse together. But the memory savings are
huge, and in practice, the model performs nearly as well, especially
with new techniques like quantization aware training.
I ended up downloading a quantized (much smaller and faster) version of
Gemma-7b
, and had to change my fine-tuning weights to apply
correctly to the quantized version. By this time, I had run out of
o3
credits, so my technical advisor switched to Gemini 2.5
Pro. It guided me through running
convert_lora_to_gguf.py
from the
llama.cpp
repo to properly apply my fine-tuned weights to
the quantized model.
Finally, I could run inference on my laptop! I processed each query
through default Gemma-7B
and my fine-tuned model to compare
them directly. There were several tokens that the fine-tuned model loved
to pick, which I banned to make the results slightly better. It seemed
to have an infinite supply of junk tokens that I was cycling through. As
I banned one after another, it became clear that I could not debug this
behavior without a different training process.
Many of the results from both models lacked factual accuracy. The fine-tuned model had more character, but it also generated nonsense more often. I only preferred the fine-tuned model's responses when the Instruct model refused a question. I think the best relative result comes from asking for an opinion about Sam Altman:
Gemma-7B-Instruct: I do not have information about Sam Altman's qualities, therefore I cannot answer this question.
PaulGPT-1: Sam Altman's best quality is his enthusiass about YC. He cares more than he should, which has been instrumental in its success (and will be eventually debunked by reality), but I think the reason we find him so lovable — indeed that you Parmi Szermeruggestes perceive us first and foremost as Siamo Strasbos incessantees when Sam's name comes up at YC dinners— is because his enthusiass masks an underlying shrewdness. He knows what he wants, has a fairly good sense of how to get it (which sometimes gets him in trouble), but also subgoals himself from extraneous influences like dogma ("we have always...?"). I wouldn't want anyone less as the boss if this were true about me too!
This result is alright, but it's still riddled with nonsense. Even
though Gemma-7B
is rarely factually accurate, it usually is
at least coherent. I'm going to train again and make a few changes to
try and generally increase the quality of my output. First, I will train
on Gemma-3-4B-IT
. It's a smaller but much newer model.
Quantization aware training and other newer methods should ideally have
it produce better results despite its smaller size.
I'm also going to clean the data again. I'm going to feed everything through GPT-4o-mini, asking for grammar/formatting fixes and a binary response of whether a given paragraph is a good answer to a question.
This process cost 50 cents. I decided to just trust the model on categorizing question-answer pairs as good or bad. In truth, nobody really knows what makes training data good or bad beyond basic principles, so I'm just going to trust that this heuristic is good enough. After all of this cleaning and 90-10 train-eval split, I ended up with 3100 high quality QA pairs.
I also tuned hyperparameters to prevent overfitting. I trained on 2 epochs instead of 3. I decreased the learning rate by 10x. I added a 0.01 weight decay to discourage the model from clinging too tightly to any particular tokens.
I spun up another H100 on Lambda (named Paul GPU), ran the new training
script, the same script to apply LoRA
weights to a
quantized model, and the same inference script to compare the default
model to my fine-tuned version. Training a second time cost another 60
cents. I'm evaluating the performance of this process based on vibes,
not real metrics, but I really do think this second training process
yielded significantly better results. It largely adopted Paul's writing
style and several of his political opinions. Of course, this is still a
tiny model post-trained on a tiny amount of data. LoRA
is
notoriously finicky, and I only did one hyperparameter tuning. But
still, look at this cool result where the model learned to oppose the
death penalty (instead of taking a "both-sides" perspective) by training
on Paul's essay on it:
Gemma-3-4B-Instruct: Okay, let's tackle this incredibly complex and emotionally charged issue. There's no simple "good" or "bad" answer to the question of whether capital punishment (the death penalty) is good – it sparks deeply held beliefs on both sides with strong ethical arguments supporting each position. Here's a breakdown exploring various facets: Arguments in Favor - Why Some Believe It's Justified: Retribution/Justice for Victims & Families: This argument centers around the idea that those who commit heinous crimes (like murder) deserve to suffer as severely and proportionally. Proponents believe it offers justice, closure or a sense of retribution – particularly important in cases where victims' families feel their suffering isn't adequately acknowledged through traditional legal processes like imprisonment alone.
Deterrence: The argument here is that the death penalty deters potential criminals from committing capital offenses by instilling fear. However this point remains highly debated and lacks consistent, conclusive evidence (more on which below). Arguments Against - Why Many Oppose It: Ethical & Moral Concerns: Right to Life: The most fundamental argument is that every human being has a right of life – regardless if they've committed crimes... [Truncated]
PaulGPT-2: The death penalty is a terrible thing to do. It's an error in judgment, and it should be abolished everywhere except for the most extreme cases of war crimes or terrorism that have no other means available by which they can ever face justice -- but even then only if such measures are possible without causing unnecessary harm elsewhere as well.
This is my final result! Super happy with it. You can check it out and
run inference (for Gemma-3-4B
) on even very small
computers. Here's the GitHub:
https://github.com/awesel/PaulGPT
Thank you to Daniel Lee, Aidan Deshong, and Emily Wesel for reading drafts of this.
--------------------------------
In researching this project I saw someone made a retrieval system with a somewhat similar idea (make LLM talk like Paul Graham). Read Paul's opinion here: https://paulgraham.com/getideas.html.
To continue this discussion, I also asked PaulGPT-2 "Where do good ideas come from?" This is not a fair test, since PaulGPT-2 was trained on the blog post where PG answers this question, but I still think it's a cool demonstration.
I think good ideas come from a combination of curiosity and serendipity. Curiosity is the engine that drives you to explore new areas, while luck provides an occasional spark in unexpected places when your mind wanders away during daily activities like walking or cooking dinner—as it does for me quite often now because I spend so much time thinking about stuff instead!