Understanding AI

Lee Robinson

I wanted to better understand how AI models are created. Not to become an expert, but to gain an appreciation for the abstractions I use every day.

This post will highlight what I’ve learned so far. It’s written for other engineers who are new to topics like neural networks, deep learning, and transformers.

Machine Learning

Software is deterministic. Given some input, if you run the program again, you will get the same output. A developer has explicitly written code to handle each case.

Most AI models¹ are not this way — they are probabilistic. Developers don’t have to explicitly program the instructions.

Machine learning teaches software to recognize patterns from data. Given some input, you might not get the same output². AI models like GPT (from OpenAI), Claude (from Anthropic), and Gemini (from Google) are “trained” on a large chunk of internet documents. These models learn patterns during training.

Then, there’s an API or chat interface where you can talk to the model. Based on some input, it can predict and generate sentences, images, or audio as output. You can think about machine learning as a subset of the broader AI category.

Neural Networks

AI models are built on neural networks — think of them as a giant web of decision-making pathways that learn from examples. Neural networks can be used for anything, but I'll focus on language models.

San
Output:

These networks consist of layers of interconnected neurons that process information:

  1. An input layer where data enters the system. Input is converted into a numerical representation of words or tokens (more on tokens later).
  2. Many hidden layers that create an understanding of patterns in the system. Neurons inside the layer apply weights (also known as parameters) to the input data and pass the result through an activation function³. This function outputs a value, often between 0 and 1, representing the neuron's level of activation.
  3. An output layer which produces the final result, such as predicting the next word in a sentence. The outputs at this stage are often referred to as logits, which are raw scores that get transformed into probabilities.

For example, if the input was “San”, there is likely an activation close to 1 for the next word of “Francisco”. A unrelated word like “kitten” would be close to 0.

A big takeaway for me is: it’s just math. You can build a neural network from first principles using linear algebra, calculus, and statistics. You likely won’t do this when there’s helpful abstractions like PyTorch, but it’s helpful for me to demystify what is happening under the hood.

Deep Learning

Deep learning is a subset of machine learning that involves neural networks with many layers—hence the term "deep." While a simple neural network may have just one or two hidden layers, deep learning models can have hundreds or even thousands of layers.

These additional layers enable the network to learn complex patterns. For example, large language models have been trained on multi-language datasets. This allows them to understand, generate, and translate text in multiple languages.

You can ask a question in English, and get a response back in Japanese.

Tokenization

Before a neural network can process text, the text must be converted into numerical data—a process known as tokenization.

  1. Tokenization breaks down text into smaller units called tokens.
  2. Each token is then mapped to a numerical value that the model can process.
  3. Models learn to understand the statistical relationships between these tokens, which helps them predict and produce the next token in a given sequence.

To handle the complexity of human language—including rare words and misspellings—models use subword tokenization techniques like Byte Pair Encoding (BPE). BPE starts with individual characters and iteratively merges the most frequent pairs of symbols to form new tokens.

San Francisco
Original Text

For instance, the word "unbelievable" might be tokenized into ["un", "believ","able"]. This allows the model to understand and generate words it has not explicitly seen during training. Newer models have also developed specialized tokenizers for processing math and code, as both have unique syntax and structure.

Pretraining

Large Language Models (LLMs) are trained on vast amounts of data.

Collecting and cleaning this data is much more complex than simply scraping websites. If you feed the model poor-quality data, you'll get poor-quality predictions in return. Garbage in, garbage out.

Training, also called pretraining, is the process of teaching the network to recognize patterns in the input data. Essentially, we're taking large chunks of internet text and compressing them into knobs the model can tune, known as weights and biases (collectively called parameters).

The number of parameters is often used as a measure of a model's power. For example, open-source models like Llama have a 7 billion and 70 billion parameter version (with the latter being more powerful).

You can think of training like a big for loop that runs many times, each time adjusting the weights and biases slightly. Each complete pass through the entire training dataset is called an epoch. But how does the network know how to adjust these values?

The training process involves several key steps:

  1. Forward Pass: The network takes input data and passes it through its layers to produce an output. This is called the forward pass. During this step, the network makes a prediction based on its current weights and biases.
  2. Loss Function: After obtaining the output, we need to measure how good or bad the prediction is. This is done using a loss function, which measures the prediction and gives us a value to try and minimize.
  3. Backward Pass (Backpropagation): To improve the network's performance, we need to adjust the weights and biases to reduce the loss. This involves computing the gradient of the loss function with respect to each weight and bias. The gradients indicate the direction and magnitude by which to adjust the parameters to minimize the loss.
  4. Gradient Descent: Using the calculated gradients, we update the weights and biases. This process is called gradient descent, an optimization algorithm that iteratively adjusts the parameters to minimize the loss function.
Training Pseudocode
weights, biases = init_random_parameters()

epochs = 10
learning_rate = 0.01 # See 

for epoch in range(epochs):
  for input_data, target in training_dataset:
    predictions = forward_pass(input_data, weights, biases)
    loss = loss_function(predictions, target)
    gradients = backpropagation(loss, weights, biases)

    # Step 4: Gradient Descent (Parameter Update)
    weights -= learning_rate * gradients["weights"]
    biases -= learning_rate * gradients["biases"]

  print(f"Epoch {epoch + 1}/{epochs}, Loss: {loss}")

By repeatedly performing these steps over many epochs, the network learns to make better predictions. It uses the gradients calculated during backpropagation to adjust its internal parameters, effectively minimizing the loss function and improving its performance over time.

Aside: What are scaling laws?

You might hear people talk about scaling laws in the context of training large language models. In "Scaling Laws for Neural Language Models", researchers at OpenAI showed that as we make models larger and train them with more data and compute power, their performance tends to improve in predictable ways.


Essentially, by throwing more compute at the problem and building bigger models, we've been able to achieve better and better results. However, there's ongoing debate about whether we're approaching a limit to this approach. Some experts suggest that we might be ”hitting a wall”, meaning increasing model size and compute may not continue to yield substantial improvements. But the verdict is still out.

Transformers

One paper changed the course of AI: "Attention Is All You Need". It introduced the transformer architecture, a more efficient and powerful way of building language models.

Before processing, words are converted into numerical vectors called embeddings. These embeddings capture semantic information, meaning that similar words are represented by similar vectors in a high-dimensional space.

For example, the relationship between "king" and "queen" is similar to that of "man" and "woman". If you subtract the vector for "man" from "king" and add "woman", you get a result close to "queen".

Word Embeddings

Transformers use attention mechanisms, specifically self-attention. While attention had been used in previous papers and models, it only read sequences word by word. Transformers can consider the entire input sequence at once.

This allows the model to focus on different parts of the input when generating each part of the output, much like how a human understands language by considering the context of a sentence or paragraph.

For example, in the sentence "The cat sat on the mat because it was tired," the word "it" refers to "cat." Self-attention helps the model establish this connection by assigning higher weights to related words, enabling it to understand that "it" refers to "cat" and not "mat."

The
0.00
cat
0.00
sat
0.00
on
0.00
the
0.00
mat
0.00
because
0.00
it
0.00
was
0.00
tired
0.00
Input Sequence

By utilizing self-attention, transformers can process entire sequences of data in parallel rather than one element at a time, making training faster and more efficient—especially when leveraging GPUs.

Fine-tuning

After pretraining, the model may need to be adapted for specific tasks or aligned with human values. Fine-tuning is the process of adjusting a pretrained model to have a unique style or perform well in a specific domain.

Fine-tuning involves using a smaller, high-quality dataset relevant to the specific task. This data is often human-reviewed to ensure quality and alignment with the desired outcomes. During fine-tuning, the model's parameters—the weights and biases we've discussed earlier—are adjusted slightly to improve performance on this new dataset.

To further improve the model's usefulness and ensure it behaves in desirable ways, a technique called Reinforcement Learning from Human Feedback (RLHF) is used. RLHF involves a few steps:

  1. The AI model generates several possible responses to a given input or question.
  2. Human reviewers compare these responses and rank them from best to worst based on criteria like accuracy, helpfulness, and alignment.
  3. Using the human rankings, a separate model called a "reward model" is trained to predict the quality of responses.
  4. The original AI model is then fine-tuned using reinforcement learning techniques guided by the reward model.

For example, if you want a language model for customer support, you might fine-tune it on transcripts of customer service interactions and then use RLHF to ensure it responds politely and effectively resolves customer issues. Increasingly, AI is used to generate initial content and humans help refine it through this feedback loop, enhancing the model's performance in real-world applications.

Inference

After training and fine-tuning, the model is ready for inference—the process of making predictions or generating outputs from new inputs. During inference, the model applies what it has learned to produce responses. The model may support multimodal input (text, images, audio, etc) and output.

Modern language models can perform complex reasoning tasks using techniques like chain of thought. This involves the model generating intermediate reasoning steps before arriving at a final answer, much like how you might work through a math problem step by step.

Resources

Still interested? There's so much more to learn. Hopefully I have piqued your interest. If you have other resources, please reach out and I'll add them here.


¹: Some older AI models were not this way – see the first chatbot ELIZA from 1964.


²: This is where evals come in to add predictability, which you can think of like end-to-end tests for AI. You can also change the temperature of the model. A lower temperature makes the output more deterministic and focused on the most probable responses, while a higher temperature increases randomness, allowing for more creative outputs.


³: An activation function decides how much information a neuron should pass along to the next layer, often squashing the values to a range like 0 to 1 (or -1 to 1). Think of it like a light switch or dimmer that controls how "on" or "off" a neuron is. For example, ReLU (Rectified Linear Unit) and sigmoid functions.


⁴: In basic gradient descent, a fixed learning rate determines the size of the steps we take towards minimizing the loss function. Setting this learning rate can be tricky—a rate that's too high might cause the model to overshoot the minimum, while a rate that's too low can make training painfully slow. More advanced optimization algorithms adjust the learning rate on the fly for you, taking larger steps at the beginning and smaller steps later on.