Understanding AI

November 2024 – Lee Robinson

I wanted to better understand how AI models are created. Not to become an expert, but to gain an appreciation for the abstractions I use every day.

This post will highlight what I’ve learned so far. It’s written for other engineers who are new to topics like neural networks, deep learning, and transformers.

Machine Learning

Software is deterministic. Given some input, if you run the program again, you will get the same output. A developer has explicitly written code to handle each case.

Most AI models¹ are not this way. They are probabilistic. Developers don’t have to explicitly program the instructions.

Machine learning teaches software to recognize patterns from data. Given some input, you might not get the same output². AI models like GPT (from OpenAI), Claude (from Anthropic), and Gemini (from Google) are "trained" on a large chunk of internet documents. These models learn patterns during training.

Then, there’s an API or chat interface where you can talk to the model. Based on some input, it can predict and generate sentences, images, or audio as output. You can think about machine learning as a subset of the broader AI category.

Neural Networks

AI models are built on neural networks. You can think of them as a giant web of decision-making pathways that learn from examples. Neural networks can be used for anything, but I'll focus on language models.

Output:

These networks consist of layers of interconnected neurons that process information:

An input layer where data enters the system. Input is converted into a numerical representation of words or tokens (more on tokens later).
Many hidden layers that create an understanding of patterns in the system. Neurons inside the layer apply weights to the input data and pass the result through an activation function³. This function outputs a value, often between 0 and 1, representing the neuron’s level of activation.
An output layer which produces the final result, such as predicting the next word in a sentence. The outputs at this stage are often referred to as logits, which are raw scores that get transformed into probabilities.

For example, if the input was "San", the model would assign a high probability to "Francisco" as the next word, while an unrelated word like "kitten" would have a near-zero probability.

A big takeaway for me is: it’s just math. You can build a neural network from first principles using linear algebra, calculus, and statistics. You likely won’t do this when there’s helpful abstractions like PyTorch, but it’s helpful for me to demystify what is happening under the hood.

Deep Learning

Deep learning is a subset of machine learning that involves neural networks with many layers, hence the term "deep." While a simple neural network may have only one or two hidden layers, deep learning models can have hundreds of layers.

These additional layers enable the network to learn complex patterns. For example, language models have been trained on multi-language datasets. This allows them to understand, generate, and translate text in multiple languages.

You can ask a question in English, and get a response back in Japanese.

Tokenization

Before a neural network can process text, the text must be converted into numerical data, a process known as tokenization.

Tokenization breaks down text into smaller units called tokens.
Each token is then mapped to a numerical value that the model can process.
Models learn to understand the statistical relationships between these tokens, which helps them predict and produce the next token in a given sequence.

To handle the complexity of human language, including rare words and misspellings, models use subword tokenization techniques like Byte Pair Encoding (BPE). BPE starts with individual characters and iteratively merges the most frequent pairs of symbols to form new tokens.

San Francisco

Original Text

For instance, the word "unbelievable" might be tokenized into ["un", "believ","able"]. This allows the model to understand and generate words it has not explicitly seen during training. Newer models have also developed specialized tokenizers for processing math and code, as both have unique syntax and structure.

Pretraining

Large Language Models (LLMs) are trained on vast amounts of data.

Collecting and cleaning this data is much more complex than simply scraping websites. If you feed the model poor-quality data, you'll get poor-quality predictions in return. Garbage in, garbage out.

Training, also called pretraining, is the process of teaching the network to recognize patterns in the input data. Essentially, we're taking large chunks of internet text and compressing them into knobs the model can tune, known as weights and biases (collectively called parameters).

The number of parameters is often used as a measure of a model’s power⁷. For example, open-source models like Llama have versions ranging from 17 billion active parameters (Scout) to 288 billion active parameters (Behemoth).

You can think of training like a big for loop that runs many times, each time adjusting the weights and biases slightly. Each complete pass through the entire training dataset is called an epoch. But how does the network know how to adjust these values?

The training process involves several key steps:

Forward Pass: The network takes input data and passes it through its layers to produce an output. This is called the forward pass. During this step, the network makes a prediction based on its current weights and biases.
Loss Function: After obtaining the output, we need to measure how good or bad the prediction is. This is done using a loss function, which measures the prediction and gives us a value to try and minimize.
Backward Pass (Backpropagation): To improve the network’s performance, we need to adjust the weights and biases to reduce the loss. This involves computing the gradient of the loss function with respect to each weight and bias. The gradients indicate the direction and magnitude by which to adjust the parameters to minimize the loss.
Gradient Descent: Using the calculated gradients, we update the weights and biases. This process is called gradient descent, an optimization algorithm⁴ that iteratively adjusts the parameters to minimize the loss function.

Forward Pass

→

Calculate Loss

→

Backward Pass

→

Update Weights

Input flows through the network to produce a prediction

Epoch 1 / 5

Training Pseudocode

weights, biases = init_random_parameters()

epochs = 10
learning_rate = 0.01 # See ⁴

for epoch in range(epochs):
  for input_data, target in training_dataset:
    predictions = forward_pass(input_data, weights, biases)
    loss = loss_function(predictions, target)
    gradients = backpropagation(loss, weights, biases)

    # Step 4: Gradient Descent (Parameter Update)
    weights -= learning_rate * gradients["weights"]
    biases -= learning_rate * gradients["biases"]

  print(f"Epoch {epoch + 1}/{epochs}, Loss: {loss}")

For a more complete example, see Karpathy’s microGPT.

By repeatedly performing these steps over many epochs, the network learns to make better predictions. It uses the gradients calculated during backpropagation to adjust its internal parameters, effectively minimizing the loss function and improving its performance over time.

Aside: What are scaling laws?

You might hear people talk about scaling laws in the context of training large language models. In "Scaling Laws for Neural Language Models", researchers at OpenAI showed that as we make models larger and train them with more data and compute power, their performance tends to improve in predictable ways.

By throwing more compute at the problem and building bigger models, we’ve been able to achieve better and better results. In 2024, there was active debate about whether we were approaching a limit to this approach. By 2026, pretraining scaling has continued to work, but the bigger story has been test-time compute: giving models more time to reason, run tools, and iterate during inference.

Transformers

"Attention Is All You Need" introduced the transformer architecture, a more efficient and powerful way of building language models.

Before processing, words are converted into numerical vectors called embeddings. These embeddings capture semantic information, meaning that similar words are represented by similar vectors in a high-dimensional space.

For example, the relationship between "king" and "queen" is similar to that of "man" and "woman". If you subtract the vector for "man" from "king" and add "woman", you get a result close to "queen".

Words become vectors in high-dimensional space

Transformers use attention mechanisms, specifically self-attention. While attention had been used in previous papers and models, it only read sequences word by word. Transformers can consider the entire input sequence at once.

This allows the model to focus on different parts of the input when generating each part of the output, much like how a human understands language by considering the context of a sentence or paragraph.

For example, in the sentence "The cat sat on the mat because it was tired," the word "it" refers to "cat." Self-attention helps the model establish this connection by assigning higher weights to related words, enabling it to understand that "it" refers to "cat" and not "mat."

The

0.00

cat

0.00

sat

0.00

the

0.00

mat

0.00

because

0.00

was

0.00

tired

0.00

Input Sequence

Using self-attention, transformers can process entire sequences of data in parallel rather than one element at a time, making training faster and more efficient, especially with GPUs.

Fine-tuning

After pretraining, the model may need to be adapted for specific tasks or aligned with human values. Fine-tuning is the process of adjusting a pretrained model to have a unique style or perform well in a specific domain.

Fine-tuning involves using a smaller, high-quality dataset relevant to the specific task. This data is often human-reviewed to ensure quality and alignment with the desired outcomes. During fine-tuning, the model’s parameters—the weights and biases we’ve discussed earlier—are adjusted slightly to improve performance on this new dataset.

Reinforcement Learning

To further improve the model’s usefulness and ensure it behaves in desirable ways, a technique called Reinforcement Learning from Human Feedback (RLHF) is used. RLHF involves a few steps:

The AI model generates several possible responses to a given input or question.
Human reviewers compare these responses and rank them from best to worst based on criteria like accuracy, helpfulness, and alignment.
Using the human rankings, a separate model called a "reward model" is trained to predict the quality of responses.
The original AI model is then fine-tuned using reinforcement learning techniques guided by the reward model.

Generate

→

Rank

→

Train Reward

→

Fine-tune

AI Model

↓↓↓

Response A

Here is a brief answer...

Response B

Let me explain in detail...

Response C

I cannot help with that.

AI model generates multiple response options

RLHF can be effective with hundreds to thousands of high-quality human-rated examples. The emphasis is on quality over quantity, as each example is carefully judged by a human’s preferences.

For example, if you want a language model for customer support, you might fine-tune it on transcripts of customer service interactions and then use RLHF to ensure it responds politely and effectively resolves customer issues. Increasingly, AI is also used to generate initial content and humans help refine it through this feedback loop.

Inference

After training and fine-tuning, the model is ready for inference: the process of making predictions or generating outputs from new inputs. During inference, the model applies what it has learned to produce responses. The model may support multimodal input (text, images, audio, etc) and output.

One important parameter during inference is temperature, which controls how random or focused the model's outputs are. A lower temperature makes the model more deterministic, while a higher temperature increases creativity and randomness.

"The cat sat on the ___"

mat

45%

floor

25%

couch

15%

table

bed

chair

Temperature1.0 - Balanced

0.1 (deterministic)2.0 (random)

Moderate temperature balances between likely and diverse outputs. Good for most tasks.

Modern language models can perform complex reasoning tasks using techniques like chain of thought. This involves the model generating intermediate reasoning steps before arriving at a final answer, much like how you might work through a math problem step by step⁵. CoT started as a prompting technique, but newer reasoning models have this behavior trained directly into the model.

Distillation

Sometimes you need a model that can run efficiently on smaller devices or handle millions of requests with lower inference costs. That’s where model distillation comes in, which is a technique for creating smaller, faster models that still perform well.

The process works by training a smaller "student" model to mimic a larger "teacher" model. Instead of learning from raw data, the student learns from the teacher’s outputs. Think of it like a compressed version of the knowledge. You lose some detail, but keep most of what matters.

For example, there are many distilled versions of Llama, which is an open-source model. The distilled versions run faster and have lower inference costs, while still being very good at certain tasks. The trade-off is usually a small decrease in quality for a significant gain in efficiency.

Evaluation

How do you know if your model is good? This is where evaluation (often shortened to "evals") becomes important. Just like you wouldn’t ship code without tests (right...) you shouldn’t deploy AI models without evals.

Evaluation involves testing your model against a set of tasks with known correct answers. These can be standardized benchmarks like Terminal-Bench for coding tasks, or custom evals specific to your use case. The key is to measure what actually matters for your app.

Good evals are comprehensive enough to catch edge cases, but also representative of real-world usage. Building custom evals for your domain is equally important as the model training itself. Otherwise, you won’t know if quality regresses.

Example: Coding Agent

Let’s say you want to build a coding agent⁶ like Cursor Composer:

You’d likely start with an existing model that already understands code, perhaps Kimi K2.5 or a similar open-source model. These models have been pretrained on massive amounts of public code and understand syntax, common patterns, and multiple programming languages.
Next, you’d fine-tune this model on data specific to the coding domain. This could include high-quality code, internal documentation, coding standards, or architectural patterns. The model learns the nuances of how real developers write code.
To ensure the assistant gives helpful suggestions, you’d likely use both RLHF and general reinforcement learning. RLHF works well with hundreds to thousands of carefully human-rated examples. But for coding, you can also use RL at a much larger scale by defining automated reward signals: does the generated code compile? Does it pass tests? Does the resulting diff actually solve the issue? This lets you run thousands of rollouts and continuously improve the model without needing a human to judge every single output.
Throughout training, you’d want the RL environment to closely match the real product environment. For example, if your agent has tools like file reading, code search, and terminal commands, you’d train the model using those exact same tools so it learns to use them effectively. This is how Cursor trained Composer: by running rollouts in an environment that mirrored their production IDE, the model learned to call tools in parallel, search codebases effectively, and read before editing.
Finally, you’d continuously evaluate the model using both automated benchmarks (like Terminal-Bench) and internal benchmarks built from real usage.

Then, you’d keep iterating. As developers use the model, their feedback provides more signal for training. RL is particularly powerful here because improvements compound: more rollouts, better rewards, and a tighter feedback loop between the model and the real coding environment it operates in.

Resources

Still interested? There’s so much more to learn. Hopefully I have piqued your interest. If you have other resources, please reach out and I'll add them here.

¹: Some older AI models were not this way – see the first chatbot ELIZA from 1964.

²: This is where evals come in to add predictability, which you can think of like end-to-end tests for AI. You can also change the temperature of the model. A lower temperature makes the output more deterministic and focused on the most probable responses, while a higher temperature increases randomness, allowing for more creative outputs.

³: An activation function decides how much information a neuron should pass along to the next layer, often squashing the values to a range like 0 to 1 (or -1 to 1). Think of it like a light switch or dimmer that controls how "on" or "off" a neuron is. For example, ReLU (Rectified Linear Unit) and sigmoid functions.

⁴: In basic gradient descent, a fixed learning rate determines the size of the steps we take towards minimizing the loss function. Setting this learning rate can be tricky; a rate that’s too high might cause the model to overshoot the minimum, while a rate that’s too low can make training painfully slow. More advanced optimization algorithms adjust the learning rate on the fly for you, taking larger steps at the beginning and smaller steps later on.

⁵: Prompt engineering is the art of crafting effective inputs to get the best outputs from AI models. Techniques include being specific, providing examples (few-shot learning), and structuring requests clearly. For engineers, mastering prompt engineering is often more practical than training new models.

⁶: In specialized domains like coding, midtraining is also commonly used. This is an additional pretraining phase between general pretraining and fine-tuning, where the model is trained on a curated, domain-specific dataset (e.g., high-quality code and documentation) using a next-token prediction objective. It bridges the gap between broad web text and the more structured data used in post-training.

⁷: Update 2026: Many frontier models now use a Mixture of Experts (MoE) architecture, which changes how we think about parameter counts. In a MoE model, only a fraction of the total parameters are active for any given input. A "router" learns to send each token to the most relevant expert sub-networks. For example, a model might have 1 trillion total parameters but only use 32 billion per token. This means the model has the knowledge capacity of a much larger model while being as fast as a much smaller one.