I have a month that I'm spending in Los Angeles before going back to the Bay to work for the summer. Over the school year, I've developed a pretty long reading list of machine learning papers that I've wanted to read but haven't had the time. This month, I'm going to try to catch up!
This writing will contain my own notes and is not intended to be super interesting to read. I'm only putting it online because part of this project is to talk with my friends about what I'm reading, which I think having these notes online will help me do. I will also ask ChatGPT to come up with a list of everything I got wrong this month, which will be fun
Last note: I have a lot of free time this month. This readings seem to be taking me 1-2 hours, so I will do them at coffee shops. In addition to tracking what papers I read, I will also track what coffees I drink, and how many people I run into serendipitously.
This is my third time reading this appendix. Every time I read it, I learn something new, which I think either demonstrates the passage's high quality of writing, or my poor reading comprehension. Thinking of neural networks as a composition of functions really makes it clear that machine learning is really just an optimization problem in multiple variables. Reading this passage also prompted me to learn more about different optimizers- after implementing logistic regression for CS109 last week, I feel much more comfortable with the differences between gradient descent, SGD, mini-batch SGD, and Adam. Last thought: I cannot believe this paper was written in 1950. Claude Shannon was DIFFERENT. A Chess-Playing Machine
Last month, I fine-tuned Qwen to improve accuracy at estimating
cross-national survey distributions. I showed that project to one of
my professors, who recommended I read this paper as they do a
similar experiment. It seems like their prompt-completion pairs
looked something like:
- Prompt: You're going to fill in a
distribution of survey responses. Answer in brackets, where each
value corresponds to a percentage of respondents who chose that
option. Subpopulation: AGE 65+. Survey: How important, if at all, is
being a gun owner to your overall identity? Options: ['Very
important', 'Somewhat important', 'Not too important', 'Not at all
important', 'Refused']
- Completion: [0.28108718, 0.235006,
0.27790644, 0.20600038]
and they fine-tuned the LLM to just output the distribution of
answers. In my experiment, we trained on reasoning traces and the
distribution of answers, not just the distribution of answers. I am
somewhat surprised that their method worked because, during
fine-tuning on a dataset of this structure, the model can only
memorize answers, not learn problem-solving strategies for this type
of question. Clearly, there's some sauce here though that I don't
fully understand.
They use Wasserstein (Earth Mover's) distance and KL divergence as their distance metrics to evaluate how similar two distributions are. My experiment used Jensen-Shannon distance, and I only did that because this Anthropic paper did it. I think it's bad form that I blindly adopted the same strategy without thinking about it, and I'd love to investigate further whether one of these distance metrics is better than the other. We had some issues with JS distance being impacted by the number of options in the question, even when we filled in the number of options in the "log" field in Python.
Also, I found it interesting that they experience diminishing marginal returns when fine-tuning on larger datasets. Specifically they got 75% of best results when fine-tuning on only 25% of the training dataset. This might validate my project's use of only 1300 train examples. I was quite worried about having a small training dataset, but we ended up with quite good results. This validates our observation.
Last thing: I think the use case they identify for this technology is intriguing. They talk about social scientists using this technology to probe different survey questions and populations when designing experiments, and identifying which subpopulations might need to be oversampled. I like that they emphasized that they aren't intending to replace human data.
I've learned about attention in class, but I've never actually read the original paper. I assume that, in my life, I may read this paper multiple times. I see this is just a first pass and expect not to totally understand everything.
The first thing I investigated (just upon reading the abstract) was the difference between encoder-decoder and decoder-only models. I have heard people say that GPT and Llama are "decoder-only transformers" but didn't really know what that meant until I saw this paper talk about encoder-decoder transformers. Encoder-decoder models rely on two neural networks (one each for encoding and decoding). The encoder network creates a set of vectors representing the input sequence. The decoder then generates tokens by looking at the encoder's representation, and the tokens before it in the output sequence. Decoder-only models don't distinguish between an input and output sequence, they consider the input to be the beginning of the output sequence, and try to complete it. This means that the input/output distinction is implicit and learned by the model, not a "real" attribute of a token.
Next is their discussion of parallizability and why transformer models parallelize better than RNNs. RNNs have a hidden state that is updated based on each token, requiring every prior token to be done processing before processing the next. While inference in (non-diffusion) transformers is autoregressive/sequential, training is easily parallelizable. Transformers embed all tokens in a sequence, compute key, query, and value matrices (three huge matmuls), then compute the attention scores with softmax(query * keyT / sqrt(d)) and the ultimate "attention-refined representation" of each token by multiplying by the value matrix. Because these matricies depend only on the original embedings of the input and target sequences, not on intermediate outputs, they can be parallelized easily, unlike with RNNs. They also give a similar argument for why transformers are more efficient thant CNNs, which require stacking multiple convolutional layers to gain full context for a sequence.
The paper also explains the 'multi-head attention' mechanism. A single set of key, query, and value matricies can theoretically encode a lot of information due to high dimensionality, but there must be some limit. Because learned parameters often have some stochasticity, one can't be certain that a given instance of attention values is necessarily 'correct' or ideal. Multi-head attention resolves this issue by computing multiple 'heads' of attention and concateinating the results. In technical language, "Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions," which in plain english means different heads can learn different things and form a rich representation when combined. Each head is of a smaller dimension than the default QKV matricies, so the total computation ends up being similar.
Since the attention mechanism only looks at how much each token matters to another, not their relative position, one needs to deliberately add positional information. So then FinalEmbed(token) = DefaultEmbed(token) + Position(token). The vector Position(token) has each value i computed with either a sin or cos depending on whether token is at an even or odd position through the function sin(token_position/10000^(2i/d)). I don't quite understand why they alternate between sin and cos, or why they don't just use integer relationships (i.e. -3 position embed means one token is 3 behind another). Someday I will figure this out. I assume it has to do with numbers getting too large, or because of the way dot products/multiplication processes these values.
The rest of the paper details their experiment, and how successful it was. I'll note just a few of their observations: more attention heads is not always better, too-small attention size hurts quality, and bigger models are better. This paper required a lot intuition about linear algebra. While the actual math might have been less intense than the linear algebra textbook passage, it's a lot less explicit, They expect readers to be clear about matrix operations and don't hand-hold through any operations, which is to be expected of higher-level research. But of course, this made reading this passage quite challenging. I'll come back to this paper in a few weeks and see whether I can get anything more out of it.
I've found the ML videos from this channel to be quite well explained. This dovetails nicely with the earlier reading from the Math 51 textbook. It reiterates that backpropogation is just an application of the chain rule.
I knew that, in language modeling, the gradient is found by setting the "correct" word from a piece of training text's probability to 1 and everything else to zero (one-hot encoding), and subtracting the predicted values from that. This video goes over all of the calculus required to derive that simple expression. I was surprised how well it simplifies!
This paper was written by my data science professor! I saw it being cited in a Google red-teaming report, so I thought it must be pretty relevant. The main argument is that, sometimes, ML models learn "non-robust features" that are somehow highly predictive but not truly representative of the label they are classifying. The researchers look at image classification, which is an illustrative example of this phenomenon. Viewing any natural photo, the model will perform fantastically. But if one looks into the features themselves, there are ways to edit photos to make the model misclassify them, even when a human would not be able to tell the difference.
One of the main takeaways is that there sometimes exists a tradeoff between a model being accurate and it being comprehensible to humans. If we want a Dog Classifier to accurately classify dogs, the easiest way to do it might be to find some abstract pattern in the noise of an image, not to learn what a dog actually is (ears, legs, etc). This is a problem because, if the model's "definition" of a dog is not the same as a human's, humans can't easily understand what the model is doing.
I saw this a similar discussion in the DeepSeek-R1 paper. DeepSeek-R1-Zero found, during reinforcement learning, that the best way to solve math problems is to write a solution that alternates between different languages, presumably English and Chinese. Rather than let the model achieve maximum accuracy, the researchers added a reward for the model to write solutions in a single language. If I was given this choice, I don't know what I would do. To achieve superintelligence, I feel that we have to, almost by definition, let AI do some stuff that we don't fully understand. This is one interpretation of The Bitter Lesson- we shouldn't put human knowledge on a pedestal, we should just run huge searches and see what comes up.
On the other hand, I totally understand why, for safety reasons, we might want to be able to easily observe the manner by which a language model reaches its conclusions. Obviously, I don't really have a solution to this trade-off, I think it's important to recognize that it exists, though.
I started this book today. Since I'm a computer guy interested in consciousness, I feel like I am supposed to read this sometime soon. My discrete math professor recommended it to the class (some of our problems were based on the book) and I really enjoyed that class.
The preface clarifies some common misconceptions about the book, it's true meaning, and its structure. I think it's promising that the book focuses so much on metaphor (from what I understand, the dialogues are all metaphors for themes). I know that I learn well from metaphor, so this book may be very good for me.
This chapter concerns paradoxes in self-referential statements. There's a strong connection being drawn between mathematical statements that reference themselves paradoxically, Escher drawings that illustrate paradoxes, musical self reference (theme and variation), and the way that conscious beings reference themselves. This observation is already quite profound to me.
Here, we go over the MU system, which was part of a CS 103 lecture. We learn about infinities and Zeno's paradoxes. We also learn about reasoning "about" versus "within" a system, with some surprisingly relevant concepts for thinking about the economy, consciousness, etc. In the following dialog, we follow characters humorously realizing that, when writing proofs, they shouldn't write "D. If clause A and clause B, then clause C," because that would require them to write a similar clause E, ad infinitum. I like this book.
I re-read this paper and wrote a separate post about using its concepts to think about how to build AI products.
This book is a great read after CS 103. Today, what I read went over recursion in artwork and the figure/ground distinction.
Today, I read at the beach.
These lectures are great but honestly a little above my understanding. I think I understand 65% of sentences. I now know more about GPUs parallelize LLM training. They had a slide where they look at "Naive" implementation (how I though parallization worked) versus modern sharding, and it almost looks like magic. It's absolutely wild to me how little information is needed on each GPU to do training.
Got a better impression of why people are using MOEs more. I don't have a perfect understanding of how the routing function is learned, but I have a better understanding of different architecture choices (token vs. expert choice, size and number of experts, shared experts, etc).
I also hosted a small ML reading group with my friends. A few of my AI-interested friends and I got on a Facetime call and talked about papers we've read. I thought this was kind of fun!
Read my book in Manhattan Beach. Covered recursive transition networks in the context of formalizing English sentence structure.
This was a fantastic lecture to watch after reading that GEB chapter. The book is from 1979, and the lecture is from 2024, but the concepts are similar to each other.
Today, I made a chatGPT wrapper to track my nutrition. It's a light website that takes structured output from ChatGPT about the nutrition of an image or natural language description, and aggregates and displays it in a nice way.
Today covered where meaning lies in language. Every message has an 'inner message' that is, what someone who understands the message understands it to mean. Then, there's an 'outer message' that is information about how to interpret the message. All writing in English, no matter what the words are, communicates 'I am a message in English!', otherwise, nobody would know how to interpret it. There is a third layer of meaning, a 'frame message' that tells the brain that these marks on paper/screen are a message worth interpreting to begin with.
I am now 1/3 (250 pages) into this book. I think the sign of a good teacher is that everything they cover seems so simple or obvious that you can't tell that you're learning anything. I felt this way when I was taking CS 103 (a discrete math class)- every class felt so obvious, but at the end of the course, looking back, I felt like I had learned a lot. It turns out that my discrete math course covers a lot of the same material as this book, so I'm not surprised that the who experiences made me feel similar ways. Oh- and today I read at Alfred coffee, where I got their Cloud Cream latte. I didn't see anyone I knew :(
I read a bunch today- one hundred pages. I'll cross the halfway mark tomorrow. Since I'm nearing that mark, I thought I would reflect on my favorite dialogues of the book so far.
Second Place: Ant Fugue. In this section, Achilles, Crab, and Tortiose meet Anteater, who talks about his conversations with an ant hill named Aunt Hillary. The collection of aunts is a loose metaphor for how neurons that aren't independently intelligent come together to form intelligent behaviors. I also really appreciated the author's drawings of words made of letters. Thought they were pretty funny.
First Place: Little Harmonic Labyrinth. This dialogue is super meta. It starts out with a standard Tortiose and Achilles conversation about some math concept, then they are abducted by a nefarious villian. In the ship, they begin reading a dialogue about themselves. In that dialogue, they enter a painting, where they start reading another dialogue about themselves. I don't really remember exactly how many layers deep it gets, but I think this story was super fun to read and also gave me better intuition about recursion. Reminded me of Inception!
Halfway! Reached page 375. Really enjoyed content on neuroscience, even if much of it went over my head. The math, music, CS, and philosophy stuff is all in my areas of interest, but I really know nothing about the brain. Not surprising that I found it the most difficult. I think it's important for me to get some understanding of the brain, since people in artificial intelligence often draw parallels (with varying degrees of accuracy) between the brain and how ML models learn.
I start at my summer internship in about a week now. The company, Ceramic, maintains a technical blog about some of their experiments and findings. I've skimmed through all of the posts, but I hope to read them in depth today. I remember finding some of them challenging on first read, so this will also be a test of whether my technical reading comprehension has improved in the past few months.
Blog 1: 'Cost and Efficiency of DeepSeek.' This post doesn't contain any experimental results. Most is uncontroversial information about LLMs applied to DeepSeek's R1 launch. Performance scales with parameter count and time/tokens spent training. MoEs can be more efficient than dense models during inference since they only activate a subset of params. Storing parameter values in different data types can meaningfully change model performance, and one should consider memory, precision, and compute requirements when choosing one. The ratio of money spent on pre vs. post training right now leads heavily towards pre-training, but this is expected to change as humanity gets better at scaling up RL.
Blog 2: 'Three Mistakes Meta Made with Llama4.' By now, the fallout from the failure of Llama4 is a major news story, as it caused much of Meta's AI team to leave, and probably directly led to the acquisition of Scale and Meta's recent hiring run. The first of three mistakes is that Meta trained an mixture-of-experts model. I don't immediately see why this is a mistake, because I thought that MoEs have a better intelligence-to-cost ratio since they don't activate all of their parameters. They're also easily parallelizable. The two arguments in the blog against Meta's deployment here are that 1. MoEs don't benefit from speculative decoding and 2. Meta chose a poor model architecture (wrong number and size of experts). The argument about expert size makes sense to me. Ablations from DeepSeekMoE paper show that having more, smaller experts, and routing to multiple of them, tends to outperform larger experts. The blog cites another paper from around the same time with essentially the same conclusion. The argument about speculative decoding/inference compute is more complicated. Speculative decoding is a technique to minimize inference compute where you have a smaller (usually distilled) model run inference, and the larger model just 'checks' whether the token that the smaller model generated is high probability. Of course, this is much faster, because the 'check' operation doesn't require a full forward pass, it just computes the probability of that single token (instead of the whole distribution). If the probability of the draft model is low enough, those tokens get regenerated, but that is somewhat rare if the draft model is trained well. The blog says that MoEs can't do this technique efficiently since the verifier model has to have all parameters in memory (just in case it needs to use any of the experts) which cancels any gains from sparse activation. In the months since this blog was posted, Georgia Tech x Nvidia published this paper (https://arxiv.org/html/2506.20675) which documents this issue (2-3x slowdown running speculation on MoE vs dense) and proposes a solution that only runs verification if certain tokens are 'high utility'. To me, this means that MoEs can still get some gains from speculation. The blog also mentions Meta not using modern attention variants, which makes their KV cache (storing key and value matricies for attention) unnecessarily large. Model providers generally want the math per memory access ratio to be high. But a huge KV cache resulting from long contexts makes it so that lots of memory has to be moved to whatever GPU has the expert you want to use, resulting in poor performance. The final 'mistake' was poor exploitation of parallelism. The calculations are largely beyond my understanding, but I think the main point is layering multiple types of parallelism allows better use of large amounts of GPUs. I remember from the CS 336 lecture that the [all reduce = reduce scatter + all gather] equivalence allows for some nifty tricks- each GPU requires a much smaller amount of weights and context than someone might 'naively' think.
Blog 3: 'Datasets Are Where It's At.' I think the comment about weights leaking is a reference to AGI 2027, a sci-fi piece about AGI arriving in the next few years. The overall argument of the blog is that, despite an environment of secrecy, there is nothing that secret about LLM development. The algorithms are several years old and quite well known, the data sources are public, and strategies for data pruning are also publicly available. If someone wanted a state-of-the-art large language model and had a lot of money, the easiest way to get one might be to train it themselves. I understand this argument but am not confident it is true. These top labs have thousands of talented experts working eighty hour weeks. There is still a huge amount of engineering and research workload that has to be done in order to train a state-of-the-art model, even if it's still, at its core, a transformer language model trained on CommonCrawl.
Blog 4: 'Revisiting LayerNorm.' This is the first experimental blog. Normalization functions are all over deep learning. They take values of arbitrary relative size, and output the values such that their mean is zero and stdev is one. Numbers being close to zero is good because it makes training smoother, and for quantization because around zero is where floats are most accurate. There are two standard ways to normalize values: LayerNorm (from Geoff Hinton's lab) and RMSNorm (root-mean-square Norm).LayerNorm subtracts the mean and divides by variance. RMSNorm just divides by variance. LayerNorm is the standard normalization I described above; RMSNorm is scaled where one unit is a standard deviation, but it isn't centered around 0. Importantly, both also have some learnable parameters, which I think of as allowing the model to 'add back' some of the information it removes by subtracting or dividing. In the CS 336 lecture on architecture, they mention that RMSNorm is the industry standard because it has fewer operations, fewer learned parameters, and less data movement (despite pretty much equivalent FLOPs). The lecture also showed a paper (Narang et al 2020) showing gains in final loss and a few other benchmarks by using RMSNorm. This blog is challenging the conventional wisdom that RMSNorm is significantly better. The reasoning is that the observed differences (by these papers) in performance are not due to the mathematical change (removing or adding certain terms) but due to implementation changes: LayerNorm's implementation loaded and unloaded parameters through memory twice (one to calculate mean + var, one to apply normalization) which is slow. That change already narrows the wall-clock time between LayerNorm and RMSNorm. Through other performance optimizations, the differences in time between RMS and Layer Normalization become negligible. Since the performance is 'remarkably similar,' LayerNorm ought to be preferred since it keeps values close to zero which is optimal for quantization into lower-memory data types.
Blog 5: 'Zero-centered Re-parameterization of LayerNorm.' This blog has a great TLDR at the top: LayerNorm has certain learnable parameters (as discussed earlier). The learned weight ought to be centered around zero which gives better accuracy and allows us to implement weight decay, a strategy that, well, decays weights overtime to prevent overfitting. Existing models sometimes decay their weights to zero without the parameter's default value being zero, others don't decay their normalization weights at all, leading to some large (300) weight norms when the normal range seems to be single digits, or even between -1 and 1. I'm initially skeptical about the need to quantize these weights (how much memory do gamma parameters really take, especially compared to feed-forward parameters, or attention or something?). A broader insight: we ought to try and select the best balance between memory and precision. Okay, now we have confirmation: Gamma is expected to wind up be around 1.0, but we want our learned params to be near zero. So, the blog details an experiment where they define another parameter, omega, and use omega+1 (gamma) in the normalization function. They even show a nice graph of how precise floating point numbers get close to zero. Then, there is a dicsussion of weight decay in norm layers. The community is split: there's a GitHub issue from Karpathy saying TO weight decay norm params, but AI2 doesn't. In that context, there is a wide range of actual gamma values in open-source models. Some are bouneded strictly between zero and 1, others inch beyond it (to 2 or 3) and Gemma 3 goes all the way to 300. The writers have a few theories for what might be making this the case: 1. quadratic decaying learning rate causes normalization to be huge, in a way 'absorb' the magnitude of the weights. 2. Late layers are just wacky and might try to 'stay relevant' by having huge norms to avoid weight decay in normal params (in the condition where ff params have weight decay but gamma doesn't). 3. AdamW assumes 'independence of gradients' (one param's gradient being one way holds no information about the likelihood of another param to be a certain way) which might not be the case. They then have some math about the how the gamma parameters are updated. I will go back and work through that tomorrow. Then, they did some experiments training a 0.5B param Llama3 model. The changes in loss were within noise/not statistically significant, but still might improve quantization. They tried 100x the learning rate to see if that would reveal differences in performance, and they did find that the variance on omega params was less than the variance when gamma definition was used. Interesting!
Blog 6: 'Beneath the Surface.' The last blog post touched briefly on the 'curse of depth': Gemma 3's wacky gamma values were all in the later layers in the model. In my limited experience with deep learning, much of the challenge comes from the fact that the learning is 'deep.' I wasn't aware of how intense this problem is: they show a visualization that seems to indiate you can just take out any layers in the second half of a model, and it still does fine! Of course, the goal in model training isn't to train a model with half-useless layers, so this blog tries a method from the 'Curse of Depth' paper where you 'scale the output of LayerNorm by the square root of the depth 1/sqrt(l),' which has the effect of paying less attention to later layers. The blog also augments the method from the paper by scaling RMSNorm's initial weights, scaling attention weights, and scaling output projection weights. In their experiment, they found that input scaling led to the greatest reduction in eval loss. They also tried normalization key and query matricies. I think I saw something about QKnorm on twitter in the context of the open-training Marin model.The conclusion is interesting: the bunch of different methods of scaling seem like indirect ways to scale key and query matricies, and when you just do QKNorm, they all perform similarly. While it did reduce loss, it doesn't completely free models from the Curse of Depth.
I am putting my book on pause briefly to watch this playlist. There's like 12 hours of videos here. I hope to get through it in the next few days. I would like to think that I am not at 'Zero,' but being realistic, I'm probably much closer to Zero than Hero and should just work through this whole playlist. I might try to follow along in Jupyter notebooks. We'll see.
Today, I watched the first video on micrograd, a tiny gradient descent library Karpahthy wrote that implements backpropogation. I thought this video was pretty straightforward. That could mean it was too easy for me, but I really think it is a sign that Karpathy is such a fantastic teacher that he can disguise learning as telling you stuff you have already heard. Either way, it was a great intuitive refresher on some pretty foundational concepts. I think I'll take tomorrow off to just hang out with friends, but will resume this playlist on July 2.
One question that I had while watching is how you can have a non-differentiable activation function. The slope of ReLU, for example, isn't defined at zero. Why doesn't that break packprop? I guess you can just say the slope is zero or one, depending on where the 'or-equal-to' sign is in the ReLU definition. But still- it feels like it shouldn't work since there's a kink in the graph.