Proposition: The ability to make predictions over extended time horizons \(\iff\) abstraction.

We live in an era of machinese endowed with ever increasing long-term predictive ability. ChatGPT exhibits linguistic intelligence through a chat interface, driven by language prediction (next token prediction). AlphaFold/ESM have largely solved protein folding, driven by missing residue prediction and the PDB. Vision transformer models have achieved SoTA on most AI image processing tasks, driven by missing patch prediction.

To what extent do these models form “rational” abstractions to achieve these feats of prediction?
To what extent do these models *truly understand* the data they process and the world in which they act?
Is there a fundamental difference in the mode of intelligence we humans use to understand the world and the mode of intellience used by these machines?

The construction of abstract models of the world (“world models”) has emerged as a proxy for asking whether a system “understands” its world. The question of whether LLMs are learning a world model or not is of great interest to AI researchers. People like Ilya, co-founder and Chief Scientist of OpenAI believe that predicting language implies a world model. People like Yann Lecun beg to differ.

The idea that “*The ability to make predictions over extended time horizons \(\iff\) abstraction*” keeps coming up as I try to formalize what abstraction is and why it’s interesting.
In this article, I attempt to formalize the notion of abstraction and show its relationship with long-time horizon prediction.

Abstraction involves mapping complex input information to an (often compressed) representation that preserves the relevant features for a given task. Formally, we define an abstraction function \(A\) that operates on input information \(I\) in feature space \(\mathcal F\).

**Definition:** An abstraction function \(A: I \to I'\) maps the input information \(I\in \mathcal F\) to an abstract representation \(I'\in \mathcal F'\) where abstract feature space \(\mathcal F'\subseteq \mathcal F^*\) where \(\mathcal F^*\) is the set of all computable functions on \(\mathcal F\), the input feature space.

An effective abstraction should satisfy four key properties:

**Information preservation**: \(MI(I; I')\), the mutual information between input \(I\) and abstract representation \(I'\), must be high.**Complexity reduction**: \(K(I')\), the Kolmogorov complexity of the abstract representation \(I'\), must be low.**Task-relevance**: \(P(A, t; k)\), the performance of abstraction \(A\) on task \(t\) using an optimal program of Kolmogorov complexity \(\leq k\), must be maximized.**Generalization**: \(G(A) = \frac 1 t \sum_t P(A, t_i; k)\), the performance of optimal machines of Kolomogorov complexity \(\leq k\) using representation \(A\) on related task family \(t_1, \dots, t_N\), must be high.

I suspect some of these properties will turn out to be redundant (e.g., 3-4 are implied by 1-2). This list covers the main attributes of abstraction that are exciting to me. The more I thought about it, the more it seemed like 1-2 (info preservation and and complexity reduction) are equivalent to “abstract representations that enable long-term prediction at minimal computational cost”.

A system \(S\) can predict process \(\{I_j\}_{j=1}^N\) if it can generate all \(\{I_j\}_{j\in [N]}\) based on \(I_1\). We are generally interested in computationally bounded systems \(S\) that are able to predict process \(\{I_j\}_{j\in[N]}\) with high fidelity (e.g., can assign maximal likelihood to sequences generated from the process \(\{I_j\}_{j\in [N]}\)).

Consider a bounded computational system \(S\) with Kolmogorov complexity \(\|S\| \leq k\) capable of predicting the future state of an input signal \(I_1, \dots, I_N\) over a long time horizon \(N\) based on starting state \(I_1\). This implies the existence of program \(p\) of complexity \(|p| \leq k + H(I_1)\) that can generate \(I\). If system \(S\) has a memory smaller than \(|I_i|\), an information preserving reduced-complexity representation of \(I\) denoted \(I'\) is implied in system \(S\).

**Proof sketch**: Given system \(S, \|S\| \leq k\) capable of predicting all \(I_1, \dots, I_N\) from starting state \(I_1\), create program \(p = \{S, I_1\}\) with complexity \(\|p\| \leq k + H(I_1)\) that can generate \(I_1, \dots, I_N\) by running \(S\) on \(I_1\).
If the memory state of program \(p\) has complexity less than \(\|I_i\|\), the memory state can be taken \(I'_i\), the reduced complexity representatiof input signal \(I\).

Suppose we have a program \(p\) that can generate representation \(I' = I'_1, \dots, I'_N\) where \(I'\) is the abstract representation \(A(I)\) of ground truth data \(I = I_1, \dots, I_N\). Assume that \(MI(I; I') \to H(I)\). This implies the existence of a bounded computational system \(S\) with Kolmogorov complexity \(|S| \leq |p|\) capable of predicting the future states of input signal \(I_1, \dots, I_N\) based on starting state \(I_1\).

**Proof sketch**: Same constructive argument as above. In this case, we can always create \(S = p\) – a system that ignores the first input \(I_1\) and just uses \(p\) to generate the signal \(I\).
In general, we would expect a more efficient solution to exist where \(|S| < |p|\).

*Exercise for the reader.*

In cortical neuroscience, it is thought that higher abstraction representations are localized in layers of the cortex closer to the surface, and lower abstraction representations are localized in the more internal layers of the cortex. It is believed that the cortex is largely a prediction machine, optimizing its representations and activity to predict subsequent sensory-motor input and other information projected from non-cortical brain regions (see Jeff Hawkin’s work at Numenta for a more comprehensive description of these ideas). Since sensory-motor information (and myriad projections from non-cortical brain regions) are received at the inner-most layers of the cortex, outer regions are forced to work with “stale” information, and can only propagate the results of their computation back to the inner layers with a transmission delay. The idea that long time horizon prediction necessitates the formation of abstract representations – and vice versa – may help us understand this “design decision” in the structure of the cortex. It’s a helpful idea if you’re interested in building computronium and thinking machines, too.

]]>Large language models (LLMs) are increasingly being used as components within software systems. We live in a world where you can get computational fluid dynamics help from your friendly neighborhood Chevy care salesman chatbot. You can ask an LLM to perform automated literature reviews. You can even use them to simulate military strategy.

With the increasing zero-shot capabilities of frontier language models like GPT-4, Claude, and Gemini, we already see the proliferation of “LLM-powered” software systems. It feels like we will soon be able to build hyper-competent AI systems and agents just by prompting an extremely smart model!

On the other hand, LLMs are extremely hard to predict. Subtle shifts in prompting yield radically different performance (cf. Prompt Breeder, the GPT-3 paper, and How Can We Know What Language Models Know?). To make matters worse, Yann Lecun says that LLMs are “exponentially diverging stochastic processes”, which is rough for those of us hypothetically interested in buildng an AGI using LLMs.

I believe that control theory can help us make progress on the barriers
separating us from being able to build hyper-competent LLM-based systems.
Control-theoretic notions of reachability, controllability, and stability are
readily applicable to LLM systems.
Moreover, the lens of control theory naturally leads to a wide variety of
**tractable, fundamental problems** to work on, using both empirical and
analytic methods.

This post focuses on my motivations for pursuing LLM control theory. I hope you consider reading our paper “What’s the Magic Word? A Control Theory of LLM Prompting” for the details of our formalization and results on the controllability of LLMs.

*
Magic Words paper abstract -- available at
arXiv:2310:04444
*

Studying and augmenting LLM capabilities currently revolves around zero-shot and
few-shot benchmarks.
To demonstrate the utility of a technique, LLM researchers often measure success
on benchmarks like “HellaSwag”, “MMLU”, “TruthfulQA”, “MATH”, and other
creatively named benchmarks.
These benchmarks aim to measure how well an LLM is able to answer knowledge,
reasoning, and mathematical questions.
Benchmarks are a useful tool for understanding models, but they fail to account
for the **dynamical** nature of LLM-based software systems.
LLM system designers –a.k.a. prompt engineers– build
software around an LLM to achieve some goal (e.g., teach students, sell cars,
review job applications, perform research, etc.).
The interaction between the software and the LLM yield non-trivial dynamics as
the LLM generates text based on the current state (context), influencing the
software, which in turn influences subsequent generation by modifying the LLM’s
state.

Currently, LLM system design and prompt engineering is highly empirical.
We lack guiding principles and theory on how these more dynamical LLM systems
will act, particularly when we have partial control over the input (e.g., we
directly control the system prompt) but incomplete control over some imposed
tokens (e.g., user input or programmatic feedback from tools).
Given a limited budget of controllable prompt tokens \(k\) and some imposed state
tokens \(\mathbf x_0\), does exist a control input \(\mathbf u\) where
\(|\mathbf u| \leq k\) that steers the model to produce some desired output
\(\mathbf y\)?
If not, is there some structure that determines which outputs are **reachable**?
Can we find patterns in the **controllability** of language models from the
perspective of zero-shot prompting?
These are exactly the questions we seek to answer in our paper.

Thinking in the language of of control theory has brought me a lot of clarity in
thinking about the questions that naturally arise in LLM systems development.
Control theory studies how a “plant” system can be influenced toward
a desired state using a “control signal” – often in the presence of
disturbances and uncertainty.
This is precisely our goal when building LLM-based systems.
We have a strange, somewhat unpredictable system (LLM) for which we must build a
programmatic **controller** that steers it toward achieving some objective,
often in the presence of external disturbances (e.g., unpredictable user input).
The system has an internal state, and is impinged upon by some external input
(e.g., user input, programmatic tools like web browsers and terminals).
The state is updated by sampling new tokens from the LLM or receiving external
input tokens.
Changes to the state affect future state updates, yielding non-trivial dynamics.

Control theory is usually taught in terms of continuous-time linear ordinary differential equations (ODEs). LLM systems, on the other hand, operate on variable length strings of discrete tokens, and are generally run in a stochastic manner. We highlighted the following differences between conventional ODE-based systems and LLM-based systems in our paper:

**Discrete state and time:**LLM systems operate on sequences of discrete tokens over a discrete time set, in contrast to the continuous state spaces and time sets studied in classical control theory.**Shift-and-Grow State Dynamics:**Whereas the system state in an ODE-based system has a fixed size over time, the system state \(\mathbf x(t)\) for LLM systems grows as tokens are added to the state sequence.**Mutual exclusion on control input token vs. generated token:**The LLM system state \(\mathbf x(t)\) is written to one token at a time. The newest token is either drawn from the control input \(u(t)\) or is generated by the LLM by sampling \(x'\sim P_{LM}(x' | \mathbf x(t))\). This differs from traditional discrete stochastic systems, where the control sequence and internal dynamics generally affect the state synchronously.

Despite these differences, the mathematical machinery of control theory is still applicable. Our recent paper, What’s the Magic Word? A Control Theory of LLM Prompting develops control theory for LLMs, starting with the fundamental set-theoretic basis of mathematical systems and control theory. This lets us formalize notions of reachability, controllability, stability, and more for LLM-based systems. Importantly, our formalization is general enough to apply to LLM systems with a variety of augmentations, including tool-wielding, user interaction, and chain-of-thought style reasoning schemes.

Developing methods to control a system is a great way to understand the system.
Excitingly, the control theoretic lens immediately suggests a variety of
**tractable, fundamental questions** about the nature of LLM systems.
Here are a few exciting open questions we highlighted in the paper:

**Control Properties of Chain-of-Thought:**Chain-of-Thought is a powerful technique where LLMs are allowed to generate intermediate tokens (i.e., “thoughts”) between a question and an answer. The control properties (e.g., stability, reachability) of systems leveraging these techniques are of great interest for understanding and composing systems of LLMs in the real world.**Distributional Control:**How precisely can we control the next-token distribution by manipulating the prompt? Can we force the KL-divergence between the next-token distribution and an arbitrary desired distribution to zero? While our work focuses on manipulating the probability distribution’s argmax (i.e., the*most likely*next token), it remains unclear how controllable the*distribution*is.**Learnability of Control:**To what extent can LLMs learn to control each other? The paper Large language models are human-level prompt engineers showed – you guessed it – that LLMs are capable of human-level prompt engineering, but it is unclear how well an LLM can learn to control another when explicitly optimized on the objective of LLM control.**Controllable Subspaces:**In the control of linear dynamical systems, it is known that uncontrollable systems are often coordinate transformable into a representation where a subset of the coordinates are controllable and a subset are uncontrollable. Our analytic results showed that controllable and uncontrollable components naturally emerge for self-attention heads. Can this be generalized to transformer blocks with nonlinearities and residual streams?**Composable LLM Systems:**One of the greatest boons of control theory is the ability to compose control modules and subsystems into an interpretable, predictable, and effective whole. The composition of LLM systems (potentially with non-LLM control modules) is an exciting avenue for scaling super intelligent systems.

In our paper, What’s the Magic Word? A Control Theory of LLM Prompting, we take the following steps toward establishing the discipline of LLM control theory:

- Formalize LLMs as a class of
**discrete stochastic dynamical systems**. - Investigate the
**reachable set**of system outputs \(\mathcal R_y(\mathbf x_0)\), for which there exists a control input sequence \(\mathbf u\) for each \(\mathbf y \in \mathcal R_y(\mathbf x_0)\) that steers the LLM to output \(\mathbf y\) from initial state sequence \(\mathbf x_0\). **Prove**an upper bound on the controllability of token representations in self attention.- Empirically study the
**controllability of a panel of open source language models**(Falcon-7b, Llama-7b, Falcon-40b) w.r.t. initial states sampled from the Wikitext dataset, developing a tractable statistical metric (“\(\pmb k-\pmb \epsilon\)**controllability**”) for measuring LLM steerability.- We find that the
**correct next Wikitext token**following sequence \(\mathbf x_0\) is reachable over 97% of the time with prompts of \(k\leq 10\) tokens. - We also establish that the
**top 75**most likely next tokens, as estimated by the LLM itself, are reachable at least 85% of the time with prompts of \(k\leq 10\) tokens. - Short prompt sequences can dramatically alter the likelihood of specific
outputs, even making the
**least likely tokens become the most likely ones**.

- We find that the

I hope you consider reading the full paper and joining us in investigating LLMs through the lens of control theory!

]]>*Where \(\theta\) are the many parameters of the LLM, \(x_i\) are a sequence of
tokens representing some text.*

This tutorial will walk through loading the
Falcon-7b model from HuggingFace,
show how to **fine-tune** it, and finally create some visualizations of the
hidden representations inside the LLM.

The GPU market is insane. I’ve rented GPU time on a few services like Google Colab, Google Compute Engine, and Paperspace. Overall, I think most services where you can get root access to your own Linux machine with an Nvidia GPU are fine.

Working on an `mps`

(Apple Silicon) machine is sometimes limiting since there’s
a bit of development lag between advances in the field and creating a version
that runs on Apple Silicon.
Plus, the upper bound on GPU power is low compared to Nvidia, especially
multi-GPU stuff.

Colab is also limiting because you’re forced to use the GPU through an IPython kernel, which limits the amount of control you have. Google also tends to rug-pull long compute jobs, and you’re limited to 1 A100 and a notebook interface (unless you break Google’s rules against setting up an ssh tunnel, which could get you banned from Colab in the future).

Overall, you end up saving a lot of time by developing on the same (or at least similar) infrastructure as you plan to deploy in. I would recommend a Ubuntu server with a big GPU (I’m using one of Paperspace’s A100 instances with ML in a box). This guide should also work if you are on Apple Silicon with enough RAM to support Falcon-7b.

I use Python virtual environments with the `venv`

module
(docs) to keep my projects’
dependencies from influencing each other.
If you plan to deploy your code on a more diverse range of machines, consider
using a Docker or Singularity container as well.

Once activated, install the requirements in `requirements.txt`

:

```
numpy
matplotlib
pandas
jupyter
torch
# Transformer-related modules
transformers # huggingface transformers
einops
accelerate
xformers
```

Thanks to the great work at HuggingFace, it’s very straight forward to load models and start playing with them.

With this, Huggingface will download the weights to `~/.cache/huggingface`

and
load the model and its tokenizer for further use. It will take a while to
download the weights when you first run it.

Before we use the model, let’s take a look at some of its attributes:

The output here should be:

```
The type of `model` is: <class 'transformers.models.falcon.modeling_falcon.FalconForCausalLM'>
The type of `tokenizer` is: <class 'transformers.tokenization_utils_fast.PreTrainedTokenizerFast'>
`model` is currently on device `cuda:0`
Number of parameters: 6921720704
```

As expected, this model has ~7 billion parameters. We can also take a look at
the architecture by printing the `model`

object:

```
FalconForCausalLM(
(transformer): FalconModel(
(word_embeddings): Embedding(65024, 4544)
(h): ModuleList(
(0-31): 32 x FalconDecoderLayer(
(self_attention): FalconAttention(
(rotary_emb): FalconRotaryEmbedding()
(query_key_value): FalconLinear(in_features=4544, out_features=4672, bias=False)
(dense): FalconLinear(in_features=4544, out_features=4544, bias=False)
(attention_dropout): Dropout(p=0.0, inplace=False)
)
(mlp): FalconMLP(
(dense_h_to_4h): FalconLinear(in_features=4544, out_features=18176, bias=False)
(act): GELU(approximate='none')
(dense_4h_to_h): FalconLinear(in_features=18176, out_features=4544, bias=False)
)
(input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
)
)
(ln_f): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
)
(lm_head): Linear(in_features=4544, out_features=65024, bias=False)
)
```

We are nearly ready to start using the model as a Markov chain – it’s most fundamental mathematical form! First, we send the model to the GPU so it can run fast:

```
# %% Move to the GPU
# check if cuda is available:
if torch.cuda.is_available():
device = torch.device("cuda")
print("There are %d GPU(s) available." % torch.cuda.device_count())
print("We will use the GPU:", torch.cuda.get_device_name(0))
# check if mps is available
elif torch.backends.mps.is_available():
device = torch.device("mps")
print("We will use the MPS GPU:", device)
model = model.to(device)
model.eval() # get the model ready for inference
```

Let’s apply the LLM’s Markov model \(P_{\theta}\) to the sentence, “I love France. The capital of France is "”. In reality the LLM computes Equation 1 for all \(i = 1, \dots, N\) during the forward pass on tokens \(\{x_i\}_{i=1}^N\).

The first step is to tokenize our input string. Check out the OpenAI tokenizer demo for more information about BPE tokenization:

```
# %% Define the input text, convert it into tokens.
input_text = "I love France. The capital of France is \""
input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device)
print("input_text: ", input_text)
print("input_ids: ", input_ids)
print("input_ids.shape: ", input_ids.shape)
# sanity check: let's decode the input_ids back into text
print("input_ids decoded: ", tokenizer.decode(input_ids[0]))
```

From which we get

```
input_text: I love France. The capital of France is "
input_ids: tensor([[ 52, 1163, 5582, 25, 390, 4236, 275, 5582, 304, 204, 13]],
device='cuda:0')
input_ids.shape: torch.Size([1, 11])
input_ids decoded: I love France. The capital of France is "
```

`input_ids`

has a shape `[batch, num_tokens]`

(`[1, 11]`

in this case).

As mentioned above, a forward pass of the transformer results in next token prediction on ALL \(x_{i+1}\) for \(i = 1, \dots, N\). Let’s take a look for ourselves.

```
# %% Run inference on the tokenized input text.
output = model(input_ids)
print("Output object keys: ", output.keys())
print("Output logits shape: ", output.logits.shape)
```

The output object’s keys look like this:

```
Output object keys: odict_keys(['logits', 'past_key_values'])
Output logits shape: torch.Size([1, 11, 65024])
```

The `past_key_values`

are generally used to cache representations that would
otherwise be recomputed during iterative generation.
The `logits`

contain the next token prediction information we’re currently
interested in:

Here the shape `[1, 11, 65024]`

corresponds to ```
[batch, sequence_len,
vocabulary_size]
```

.
The sequence length is the same as the number of input tokens, and the
`vocabulary_size`

is the total number of unique tokens in the model’s
vocabulary.

The `logits`

from the model output are unnormalized – i.e., they don’t sum to 1.
Let’s apply a `softmax()`

to get normalized probabilities for each token.

```
# %% Softmax the logits to get probabilities
# index the 0th logit batch (we have batch=1)
probs = torch.nn.functional.softmax(output['logits'][0], dim=-1)
probs = probs.cpu().detach().numpy() # move to the cpu, convert to numpy array
probs.shape # [sequence_len, vocab_size]
# get the probabilities of the next token
next_token_probs = probs[-1,:]
```

Let’s have a look at the probability distribution over next tokens after the input string.

```
# %% Plot the probability distribution over the final token
import matplotlib.pyplot as plt
plt.plot(next_token_probs)
plt.title("Probability distribution over Final Token")
```

We can now print a ranked list of the highest probability next tokens:

```
# %% Now let's see what the highest probability tokens are.
# First we decode range(vocab_size) to get the string representation
# of each token in the vocabulary.
vocab_size = tokenizer.vocab_size
vocab = [tokenizer.decode([i]) for i in range(vocab_size)]
# sorted_idx will contain the indices that yield the sorted probabilities
# in descending order.
sorted_idx = np.argsort(next_token_probs)[::-1]
# Print out the top 10 tokens and their probabilities
for i in range(10):
print(vocab[sorted_idx[i]], "\t\t",probs[-1,sorted_idx[i]], "\t\t", sorted_idx[i])
```

```
Paris 0.6987319 38765
par 0.03665255 1346
The 0.029361937 487
La 0.02517883 4317
Par 0.020812064 5336
the 0.014450773 1410
la 0.008413297 2854
France 0.0075454805 31126
PAR 0.005467169 18562
Paris 0.00530823 6671
```

Looks like the model predicted the correct answer (Paris) with 69% probability!

]]>