aji's Blog

This Could Be a Notebook

2026-05-22T12:00:00Z

I’ve been using Jupyter notebooks for a while now, mostly for small things that are the kind of thing Jupyter is good at: loading some data, processing that data, plotting some stuff, going back and redoing stuff differently, etc. I like that cells can produce rich outputs, even things like progress bars or interactive plots. I also like being able to include Markdown cells, which feels like it achieves what literate programming was meant to.

I keep trying to think up ways to use notebooks for other things in a way that actually adds value and isn’t just a tech demo, and so far my attempts have been unsuccessful. I got a decent part of the way through setting up Hakyll so I could use .ipynb files as posts on this blog, but the rate at which jank was accumulating made me give up on that idea pretty quickly. (I still think this is practical, it just has some rough spots that need to be worked out first.) I’ve experimented with using notebooks (with both Python and Deno kernels) for simple ops tasks, and keep feeling like a script would do the job better on too many axes. I do think there’s value in using notebooks as a format for things like analyzing logs, but this is just an ops-flavored version of a task we already know notebooks are good for, with CSVs swapped for JSONL files.

Maybe there’s not actually a problem here, but I think the raw material of the notebook workflow has a lot of potential for ops work. To me, the following points are critical limitations of the status quo:

Cells can be run out of order, multiple times, etc. This is an important part of exploration and experimentation, but makes notebooks pretty ineffective at ensuring a series of steps are executed in a specific order a specific number of times.
Re-running a cell erases the output of previous executions. This means a notebook doesn’t tell you what was actually done, only what was done last.

What I would do differently

I’m imagining a notebook-based tool that enables the following style of automation:

A workflow is generated from a parameterized notebook.
The workflow executes one cell at a time.
The workflow stops on failed cells. The operator can choose to either re-run the cell or cancel the workflow. If retrying a failed cell, the workflow will be marked as having gone off script.
The operator can edit cells, re-run successful cells, or run cells out of order. Any of these will also mark the workflow as having gone off script.
The exact sequence of cells that were run is logged, with timestamps.

This would be a relatively small addition to current notebook tooling, essentially being a “restricted mode” for notebook execution with some additional logging. It has a number of properties that I think make it an interesting possibility for ops work:

The ability to include rich outputs such as graphs, tables, etc. is useful both during and after workflow execution.
Auth is consolidated, since all actions are performed by the notebook kernel on behalf of the operator. There’s no need to think about SSH keys, API tokens, web UI access, etc. Such a tool could also conceivably allow notebooks to be executed with a particular role, allowing finer grained control over what actions the notebook is allowed to perform.
Logging all cell executions provides both valuable audit information as well as a good starting point for investigating why a workflow went off script and how it might need to be changed.
If .ipynb files can be imported as new workflows, then developing a new workflow is as simple as opening the notebook editor of your choice and getting to work. (Although this workflow tool as described would need to include all the functionality necessary for a notebook editor, and in the context of the aforementioned auth point, it might make sense to write new workflows directly in tool.)

I’m gonna have a look around and see if there’s anything like this out there.

Hexbot Part 4: Did It Work?

2026-05-15T12:00:00Z

This is the final post in a series of posts about my foray into building a bot to play Hex. The focus of this post is on quantifying the strength of our shiny new bot, plus some miscellaneous followups.

Hexbot Part 1: The 101 Unit
Hexbot Part 2: The Board and Our First Bot
Hexbot Part 3: Using Neural Networks
Hexbot Part 4: Did It Work?

Now that we’ve trained up a neural network to play Hex, it would be nice to know if our effort (and cloud computing bill) have paid off, and there are a number of things we can do to get some insight here.

The first is to simply play against the bot to get a qualitative sense for how well it plays. I’ve played a few games against it and while I’m still kind of a beginner, it’s clearly much stronger than me.¹ I copied the moves between it and the “Impossible” CPU in Clubhouse Games and it won. Both of these with the model playing as the second player without the swap rule, and with 2 seconds of thinking time per turn. Neither of these achievements is very impressive though (even the Clubhouse Games “Impossible” difficulty is still pretty weak in the grand scheme of Hex-playing programs) and they don’t tell us much about the bot’s strength quantitatively.²

Benchmarking against MCTS

In part 2 we noted that the strength of MCTS can be configured by changing the number of search iterations, i.e. the number of rollouts. Thus, we might ask how many rollouts are required for MCTS to play at roughly the same strength as our bot. Furthermore, we can check if this number actually increased over the course of training, and at what rate.

As described in part 3, the purpose of the neural network is to augment an MCTS search with a value function and a prior over available moves, but we can actually just use the prior (i.e. the policy) directly, evaluating the current board once and immediately picking the move considered “most likely” instead of searching the tree deeper. By finding the MCTS configuration that wins about half of its games against this type of model-only strategy, we can get a sense for the amount of “MCTS knowledge” contained within a model checkpoint.

In principle we could run hundreds or thousands of games at different MCTS strengths, but each game can take minutes to run so we would like to need as few as them as possible since we have 142 checkpoints to evaluate. Instead we can use Bayesian inference on a distribution over a range of MCTS strengths, where the likelihood is given by the win rate predictor we calibrated in part 2. The prior informs the choice of which strength to try next, then we calculate the mean and variance of the posterior distribution after a number of experiments have been run.

The result of this analysis is the following plot:

The X axis is checkpoint index (correlating roughly with training time) and the Y axis is the MCTS strength ( $\log_{10}$ of rollout count) required to get a roughly 50/50 win rate against the checkpoint playing in model-only mode. (The dashed line marks the MCTS strength where the amount of time spent thinking is the same for MCTS and a model-only move.) The Y axis stops at 8 because 100,000,000 rollouts is already quite a lot for my laptop to handle and the analysis would take way too long if the MCTS were allowed to think any longer. Clearly the latest checkpoints are much stronger than pure MCTS!

Generalized benchmarking

The same type of analysis from the last section can be extended to any two Hex-playing programs, since we have a single configuration axis we can use for all of them: thinking time. That is, instead of asking how many rollouts are required for pure MCTS to play at the same strength as a model checkpoint, we can ask what ratio of thinking time is required for two programs to play at the same strength.³

With this in mind, an interesting thing we can try is to train a smaller model on the self-play data generated when training a bigger one. A small model needs less time to think for the same number of search iterations, so it might actually need less time to play at the same level.

To test this out, I ran the optimizer for a much smaller model (about 6x smaller) using the same self-play data as generated for the model we’ve been using until now, then benchmarked it against the bigger model trained on the same data. Unsurprisingly, the smaller model plays just as well, and much more quickly! This is the models/model-v0-16-8-64-20260514 file in the repo, which when benchmarked against the v0-16-32-256 model comes in around 10x faster for the same level of gameplay. (I’m training a 12-16-64 model and it’s even stronger. I think 24 hours wasn’t enough for the big model!)

What’s next?

This is the end of this series of blog posts at least. I might write some additional posts if I do anything else interesting that I feel like sharing, but this first iteration is done for now.

As far as the project itself, I feel like I’ve mostly gotten what I wanted out of it. I had a lot of fun and learned a lot along the way, and got a pretty decent Hex-playing program out of it too! There are a number of additional directions this project could go that I might go in if I feel up to it someday:

Hinted at in the previous section, there’s still room to optimize thinking time by training smaller models. At some point the model is too small to play well, but it would be interesting to try to find this point.
Every time we do a search, we start with an empty move tree. However, it’s possible to reuse the previous move tree, thus saving some computation, especially for moves that were expected (and thus have well-explored subtrees). A bot can also be programmed to “ponder”, i.e. to think while it’s the opponent’s turn to play. For simplicity I chose not to introduce either of these things when evaluating bot strength, but for the strongest possible play these are important (and straightforward) optimizations.
The big training run only went for 24 hours. While the resulting model maxed out our MCTS benchmark, there’s no reason to believe it’s as strong as it will ever be for that model size. Unfortunately, training is time-consuming and expensive, so I’m not very keen on digging much deeper here unless a bunch of GPUs fall in my lap. (Also, I would want to get CUDA and quantization working before I do that.) Some experimentation here with a 12-16-64 model is producing some good results, and I’m going to keep pushing it until it seems like it’s topping out.
Our model can only play 11x11 Hex without the swap rule, but “real” Hex bots like KataHex can play on a variety of board sizes with or without the swap rule.
We only explored pure MCTS and AlphaZero-style MCTS, but there are many other ways to write a Hex-playing program! Claude Shannon built an analog Hex-playing device that used features of an electric potential field to choose moves. It would be interesting to try some other architectures and to benchmark them against the ones we have.
The UI, bin/play.rs, is not very good! A more full-featured UI, with configurable bot players, time controls, analysis features, etc. would be nice to have around. (Maybe you could benchmark yourself against the AI!)

But I think for now I’m going to take a break. I’m starting to see hexagons whenever I close my eyes.

If you’d like to play against it yourself, you can check out the code at https://github.com/aji/hex-table and run cargo run --release --bin play -F ui,candle -- --model models/model-v0-16-32-256-20260510 but the UI is not very good at the moment :( I’m not sure if I’ll ever get around to making it better, but I hope to!↩︎
A more interesting followup would be to set up some tournaments or leagues with much more established bots like MoHex or KataHex. This is a lot of work to set up though and I might get around to it someday. I would be surprised if my little bot does very well but I won’t know for sure until I try.↩︎
There are a lot of assumptions buried in here. For example, we’re assuming that the relative change in strength due to extra thinking time is roughly the same for all programs, that the strength difference between 2 minutes and 1 minute is the same as the difference between 2ms and 1ms, etc.↩︎

Hexbot Part 3: Using Neural Networks

2026-05-13T12:00:00Z

This is the third post in a series of posts about my foray into building a bot to play Hex. The focus of this post is on how to augment MCTS with a neural network, and in particular the AlphaZero-style strategy for doing so.

Hexbot Part 1: The 101 Unit
Hexbot Part 2: The Board and Our First Bot
Hexbot Part 3: Using Neural Networks
Hexbot Part 4: Did It Work?

Improving MCTS

In part 2 we went over MCTS, which serves as a tunable baseline for how to choose moves in a game of Hex: more rollouts means stronger play at the cost of thinking time, where “strength” grows logarithmically with rollout count.

We can, of course, do a lot better than basic MCTS, though many paths forward incorporate domain knowledge about Hex strategy. In part 1 we already mentioned alpha-beta search, a different search algorithm entirely which relies on a value function, often a hand-crafted one.

MCTS itself can also be improved in a number of ways. The expand step, for example, can be modified to use heavy rollouts where a heuristic is used instead of choosing moves completely at random. The search step can be augmented by calculating a prior over the possible moves, which influences the weight assigned to moves with low visit count.

The AlphaZero architecture, which we will be applying to Hex, takes essentially both of these approaches, using a neural network to augment an MCTS in two ways:

A value network is used instead of a pure rollout, giving each new leaf node a number indicating how favorable the network considers the board state for either player, and these are aggregated up the tree in the same way as rollout outcomes.
A policy network initializes each node with a prior distribution over child nodes, i.e. available moves, which biases the search towards moves the network deems more likely.

The code I used for this type of search is on GitHub, but it’s very similar to the basic MCTS we discussed before just with a different function to maximize when descending the tree that accounts for the values from the policy network.

Improving the neural network

The question of course is how to train these networks. One strategy is to use expert games as a training set, and this was how AlphaGo was initially trained. However a key insight of theirs was that the network-augmented MCTS is an algorithm for taking policy calculations from the model and aggregating them into a better policy. By having the model play games against itself and training the model to predict the improved policy returned by MCTS, the model can improve through self-play. The discovery of AlphaGo Zero is that this idea of improvement through self-play works even when starting from a randomly-initialized network.¹

It may seem a little mysterious that this works at all, and indeed there are a lot of ways for it to go wrong, but the basic idea makes sense. The model is essentially learning to predict what an MCTS search will do, and the self-play is structured to ensure that a diversity of positions are explored and that the problem is computationally feasible by gradually “compressing” recent MCTS effort into the model. The model improves the MCTS, and the MCTS improves the model.

The model architecture

The model is fundamentally a convolutional neural network, taking an “image”² of the board and applying a convolution and several residual convolutional layers to it before taking the final image and flattening it to be mapped into a policy and value output.

The model size has three configuration parameters:

conv_channels, the number of channels in the intermediate board images.
conv_layers, the number of residual convolutional layers to apply.
value_hidden, the number of hidden neurons in the value calculation.

In pseudo-Python, the model calculation looks like this:

def evaluate(self, board):
    im_input = self.board_to_image(board)

    im = self.conv_input(im_input)
    for (ker0, ker1) in self.residual_conv_layers:
        im = self.residual_conv_layer(im, ker0, ker1)

    policy = self.policy_head(im)
    value = self.value_head(im)

    return policy, value

def board_to_image(self, board):
    # an 11x11 image with 2 channels, one for red pieces
    # and one for blue pieces where a "pixel" is 1.0 if
    # there is a piece and 0.0 otherwise
    return ...

def conv_input(self, im):
    # conv2d with conv_channels 3x3 filters
    return im
        .hex_conv2d(self.conv_input_kernels)
        .batch_norm()
        .leaky_relu()

def residual_conv_layer(self, im, ker0, ker1):
    # conv2d with conv_channels 3x3 filters
    return im
        .hex_conv2d(ker0)
        .batch_norm()
        .leaky_relu()
        .hex_conv2d(ker1)
        .batch_norm()
        .add(im) # skip connection
        .leaky_relu()

def policy_head(self, im):
    # conv2d with two 1x1 filters
    return im
        .hex_conv2d(self.policy_kernels)
        .batch_norm()
        .leaky_relu()
        .flatten()
        .matmul(self.policy_linear) # 242 -> 121
        .log_softmax()

def value_head(self, im):
    # conv2d with one 1x1 filter
    return im
        .hex_conv2d(self.value_kernel)
        .batch_norm()
        .leaky_relu()
        .flatten()
        .matmul(self.value_linear0) # 121 -> value_hidden
        .leaky_relu()
        .matmul(self.value_linear1) # value_hidden -> 1

The actual model code is on GitHub, and is built on Burn, a tensor and autodiff library for Rust. hex_conv2d here refers to a normal convolution but with some of the kernel values set to zero, making e.g. a 3x3 kernel only able to incorporate information from cells that are up to 1 cell away. (I was sort of just winging it when I decided to do this but I was worried information being able to travel more quickly in one direction might result in weird behavior.)

This architecture is lifted more or less directly from the AlphaGo Zero paper, with the aforementioned “hex convolution” change.

Training it

The training procedure at this point is straightforward in concept but full of many little optimization³ problems⁴ and fiddly implementation⁵ details⁶. I’ll go over the main points quickly here but will leave discussion of smaller points for another time.

I wrote some code to set up distributed training with components that interact over HTTP:

A controller, which serves model checkpoints and the self-play log. This was just an HTTP server with a light enough workload to run on my laptop.
A self-play daemon, which periodically downloads the latest checkpoint and generates self-play data. I ran three instances of this in a Google Cloud Run worker pool with NVIDIA L4 GPUs.
An optimizer daemon, which follows the self-play log and runs gradient descent on the latest model checkpoint, periodically uploading a new checkpoint to the controller. I ran this component on my laptop while at my computer and moved it to Cloud Run while away (because running my M4 unattended at high loads makes me nervous.)

I trained the model for about 24 hours with the above setup and the following model configuration:

{
  "conv_layers": 16,
  "conv_channels": 32,
  "value_hidden": 256
}

Up next

In the next post, I’ll explain how to decide if the work we’ve done is any good.

This paragraph only describes how the policy network is trained, but the value network is an equally important part of the model. However, its training target is rather straightforward: the value network is simply trained to predict the outcome of the game. An obvious question is why the self-play data shouldn’t use the MCTS-reported value as the training target. I don’t know for sure and imagine there is some reinforcement learning wisdom at play here, since MCTS also takes value calculations and aggregates them into a better value estimate. But it’s worth pointing out that the final outcome is a more “clean” signal that directly reflects the rules, and as a training target it lets the information learned at the end of the game affect how all earlier positions are evaluated, instead of each position only seeing as far ahead as the MCTS was able to get.↩︎
Not a literal image, but I’ll use this term to refer to any two-dimensional array of vectors. In machine learning the two arguments to a convolution are often referred to as the image and the kernel, regardless of whether the image is actually an “image” in the conventional sense. Likewise, the dimensions of the vectors are referred to as “channels”, and I may even refer to an element of the two-dimensional array as a “pixel” sometimes although this is a bit of an abuse of terminology in this context.↩︎
Keeping the GPU while doing self-play is an interesting performance challenge. The board image and model outputs need to be sent between the GPU and CPU for the search to work, which introduces a lot of latency especially for smaller models. The obvious solution is to batch model evaluations by sending multiple board states at once, but basic MCTS only looks at one state at a time. There are strategies for parallelizing MCTS, but these are tricky to implement in Rust since they involve concurrent edits to a data structure. Instead my strategy was to simply run many self-play games concurrently and to send their evaluation requests to another thread which batches them up for execution on the GPU. This resulted in significantly more positions per second per GPU, which means more self-play data generated per second.↩︎
A weird problem I ran into is that the performance of the L4s was not actually very impressive by comparison with my laptop. I assume this is because I was using wgpu in both places, which means the kernels get compiled into Metal shaders or Vulkan shaders depending on the available hardware, and I imagine I would have gotten better performance from the L4s if I had used Burn’s CUDA backend instead, but this seemed like enough of a headache to get working that I punted on it. I would like to come back on this some point and maybe also look at quantization and other performance tricks, but at the time I mostly wanted to get a single training run taken care of just to see if things were moving in the right direction.↩︎
It turns out that simply installing Vulkan inside a container is not enough to actually be able to use the NVIDIA L4s attached to a Cloud Run instance due to the fact that certain configuration files are still missing. This was a little tricky to figure out and came with an overwhelming feeling of being well off the beaten path, and I think should have been a strong indicator that CUDA would have been the right choice.↩︎
One interesting thing to point out about training via self-play is that the self-play portion turns out to be the most computationally intensive part, and not by a small amount. I used 800 model evaluations per turn, and each turn results in one new example for the self-play training set. Meanwhile each iteration of the optimizer is doing only one model evaluation per example in the minibatch, plus autodiff overhead. It’s difficult to quantify the relative value of generating self-play examples and doing optimizer iterations, but the asymmetric computational requirements should be clear, and we would hope each GPU-hour of optimization would be matched by tens or hundreds of GPU-hours of self-play. Indeed, according to the AlphaZero paper they used 64 second-generation TPUs for optimization and 5000 first-generation TPUs for self-play, a massive FLOPS asymmetry even accounting for the different hardware. Running the optimizer on my laptop felt much more reasonable when considering that I had “only” three L4s to use for generating self-play data.↩︎

Hexbot Part 2: The Board and Our First Bot

2026-05-12T12:00:00Z

This is the second post in a series of posts about my foray into building a bot to play Hex. The focus of this post is on the bitboard implementation and an MCTS-based bot using it, as well as some analysis of the MCTS algorithm’s relative strength for different iteration counts.

Hexbot Part 1: The 101 Unit
Hexbot Part 2: The Board and Our First Bot
Hexbot Part 3: Using Neural Networks
Hexbot Part 4: Did It Work?

The bitboard

The first step in implementing a bot for any board game is to have logic and data structures for the game itself. For Hex, as with any game, there are a variety of approaches that would work here, but for our purposes, and for MCTS especially, we want this code to be as lightweight as possible.

To start, let’s consider how to represent a hexagonal grid on a computer in the first place. Luckily for us, the grid for Hex is a square that’s been skewed into a rhombus. We can simply represent the data in memory as a square and then skew or un-skew as appropriate, e.g. for displaying the board:

AA  AB  AC  AD  AE  AF  AG  AH  AI  AJ  AK
  BA  BB  BC  BD  BE  BF  BG  BH  BI  BJ  BK
    CA  CB  CC  CD  CE  CF  CG  CH  CI  CJ  CK
      DA  DB  DC  DD  DE  DF  DG  DH  DI  DJ  DK
        EA  EB  EC  ED  EE  EF  EG  EH  EI  EJ  EK
          FA  FB  FC  FD  FE  FF  FG  FH  FI  FJ  FK
            GA  GB  GC  GD  GE  GF  GG  GH  GI  GJ  GK
              HA  HB  HC  HD  HE  HF  HG  HH  HI  HJ  HK
                IA  IB  IC  ID  IE  IF  IG  IH  II  IJ  IK
                  JA  JB  JC  JD  JE  JF  JG  JH  JI  JJ  JK
                    KA  KB  KC  KD  KE  KF  KG  KH  KI  KJ  KK

We also define the red player as the one attempting to connect the left and right edges and the blue player as the one attempting to connect the top and bottom edges.

Bitboards are one class of board representation that use bitmap layers to represent various features of the game. This works nicely for Hex because each of the 121 cells of an 11x11 board can be in one of 3 states. We’ll use two u128s, one for the red pieces and one for the blue pieces:

#[derive(Copy, Clone)]
struct Bitboard {
    pub red: u128,
    pub blue: u128,
}

With some bits left over, we can also choose one of the unused bits to represent the player whose turn it is to move:

const NEXT_MOVE: u128 = 1 << 127;

fn next_move(&self) -> Player {
    match self.red & NEXT_MOVE != 0 {
        true => Player::Red,
        false => Player::Blue,
    }
}

This struct is only 32 bytes, making it very cheap to copy around, and we can even derive a Copy impl for it. With this representation, the 121 least significant bits of red represent cells containing a red piece, and likewise for blue. (Strictly speaking it’s possible for both planes to have a 1 in the same position, but we’ll simply assume this never happens.) A function for mapping a row and column to a cell on the board is rather simple:

fn rc_mask(&self, r: usize, c: usize) -> u128 {
    1 << (120 - r * 11 - c)
}

fn rc(&self, r: usize, c: usize) -> Option<Player> {
    let mask = self.rc_mask(r, c);
    let is_red = (self.red & mask) != 0;
    let is_blue = (self.blue & mask) != 0;
    match (is_red, is_blue) {
        (true, _) => Some(Player::Red),
        (_, true) => Some(Player::Blue),
        _ => None,
    }
}

We can set cells and do other basic manipulations in a similar manner, and anyone familiar with bit manipulation should feel quite comfortable with the ideas expressed so far. Queries such as calculating the number of available moves should also be fairly obvious at this point (look up popcount if you’re stuck).

The more interesting problem in front of us is how to check whether the game is over, i.e. which player has connected the two edges. This brings us to another strength of bitboards: we can represent a translated board state using bit manipulation. In our case, we want to be able to translate the board by one cell in each of the 6 directions. The basic premise for any given direction is to use a mask to select the bits that will move then use a bit shift to move them to their new locations. The reason this works is because the bit offset between the cell at $(r,c)$ and the cell at $(r+dr,c+dc)$ is the same for all such pairs of cells, $dr\times 11 + dc$ .

To check whether a player has connected their edges, we start with the cells on one edge, traverse the graph of adjacent cells of their color, then check if the traversal reached any cells on the opposite edge. The function that implements the graph traversal on bit sets is called bb_fill by analogy with flood filling:

fn bb_fill(start: u128, traversable: u128) -> u128 {
    let mut cur = start & traversable;
    loop {
        let mut next = cur;
        next |= (cur & ADJ0_KEEP) >> ADJ0_SHR;
        next |= (cur & ADJ1_KEEP) >> ADJ1_SHR;
        next |= (cur & ADJ2_KEEP) >> ADJ2_SHR;
        next |= (cur & ADJ3_KEEP) << ADJ3_SHL;
        next |= (cur & ADJ4_KEEP) << ADJ4_SHL;
        next |= (cur & ADJ5_KEEP) << ADJ5_SHL;
        next &= traversable;
        if next == cur {
            return cur;
        }
        cur = next;
    }
}

Where ADJn_KEEP and ADJn_SHd are masks and offsets for translations as described above. cur stores the set of currently reachable cells, and the loop body translates this set in all 6 directions and masks it with traversable to calculate the new set of reachable cells. This process is repeated until no new cells are added to cur and returns the full set of cells that are reachable from start through traversable. With this function, checking if a player has won is fairly straightforward:

const RED_START = ...; // left edge bits
const RED_END = ...; // right edge bits
const BLUE_START = ...; // top edge bits
const BLUE_END = ...; // bottom edge bits

fn win(&self) -> Option<Player> {
    let r = bb_fill(RED_START, self.red) & RED_END;
    let b = bb_fill(BLUE_START, self.blue) & BLUE_END;
    match (r != 0, b != 0) {
        (true, _) => Some(Player::Red),
        (_, true) => Some(Player::Blue),
        _ => None,
    }
}

And for a basic Hex board, that’s really all there is to it! Bitboards are still used for games with much more complex rules (I’ve even written one for Khet), but the simplicity of Hex works tremendously in our favor and makes an efficient bitboard possible with very little code.

The full code for the bitboard module is on GitHub. (The actual code differs from the code in this post in a number of more or less trivial ways, such as “black and white” instead of “red and blue” and bool instead of Player, but the basic idea is the same.)

Monte Carlo tree search

As discussed in part 1, a basic MCTS requires very little domain-specific code beyond what have in our bitboard, i.e. a way to enumerate valid moves and check if the game has ended. Described briefly, the algorithm is as follows:

An MCTS search tree contains a node for each visited state, where the node’s children represent states reachable by making a valid move from that state. The edges from state $s$ to $a$ track various statistics:

$N(s,a) =$ the number of times $(s, a)$ has been traversed during the search.

$V(s,a) =$ the total value of actions in the subtree rooted at $a$ .

Each iteration of an MCTS search proceeds as follows:

Select. Starting from the root, descend the move tree until hitting a node $L$ which hasn’t been seen yet. From a node $s$ , choose the child node $a$ which maximizes a function $\mathrm{select}(s, a)$ . One popular choice is called Upper Confidence on Trees (UCT):

$\mathrm{UCT}(s, a) = \frac{V(s, a)}{N(s, a)} + C \sqrt{\frac{\ln N(s)}{N(s, a)}}$

Expand. Evaluate $L$ with a rollout, playing random moves until the game ends. The winner determines the value $v$ which will be used in the next step.

Backpropagate. Returning to the root of the tree, update $V(s,a) = V(s,a) + v$ and $N(s,a) = N(s,a) + 1$ , altering $v$ as appropriate depending on whose turn it is to play from $s$ .

Once enough iterations have been performed, the $a$ that maximizes $N(\mathrm{root},a)$ is played.

For a more in-depth explanation of MCTS, there are a variety of helpful resources online and I won’t try to outdo them here. The Wikipedia article and Chess Programming Wiki article provide a good starting point for more information.

The baseline MCTS implementation used in the code more or less follows the algorithm described above, but with one important optimization which is worth pointing out. A naive rollout would play randomly and check the win condition, but for Hex we can make a much faster and much more efficient approximation, once again owing to the simplicity of Hex. We know that every additional move selects an unoccupied cell, and that a completely filled board is guaranteed to be a terminal state. Thus, in a single step, we can perform an MCTS rollout by assigning the empty cells randomly and checking who has won:

fn mcts_rollout(&self) -> Player {
    let empty = self.empty_bits();
    let mask = rand::random::<u128>();
    let red = self.red | (mask & empty);
    match bb_fill(RED_START, red) & RED_END != 0 {
        true => Player::Red,
        false => Player::Blue,
    }
}

This code assigns a random subset of empty cells to red and checks if this is a winning placement of red cells. If not, its complement must be a winning placement of blue cells. Note that this is not strictly correct, since it will likely result in too many or too few cells being assigned to red. It’s equvialent to making random moves in a game where players flip a coin to decide who plays next. However, this difference does not significantly reduce the effectiveness of MCTS, and the amount of computation saved is well worth it. (An experiment worth doing would be to compare the relative strength of “proper” rollouts, but I haven’t taken the time to do this.)

Measuring the relative strength of MCTS

An MCTS search can be stopped at any time for any reason and return a result. This lets us use rollout count (i.e. iteration count) as a way of configuring or measuring the “strength” of an MCTS search. In practice, such as when playing with a clock, the decision of how much computation to spend per move becomes an important strategic one, but a fixed rollout count per move works well as a simple reference point.

What we would like is to model the probability that an MCTS playing as red with $n$ rollouts wins against an MCTS playing as blue with $m$ rollouts. A simple model is to use logistic regression with $\log n - \log m$ as the independent variable. This is essentially the same idea as the Elo rating system, where $\log n$ plays the role of the rating. Logistic regression lets us calibrate our rating system so we can turn rollout count ratios into win probabilities. The choice of $\log n$ instead of simply $n$ or some other function of $n$ is motivated by two important assumptions: $\log n$ correlates with the average depth of the search tree, and the depth of the search tree correlates with strength. An effect of this assumption is that only the ratio of computational effort matters: thinking 20x as long is assumed to be equally strong no matter the absolute amount of computational effort invested.

By generating pairs $(n, m)$ (sampled so $\log n$ and $\log m$ are uniform and i.i.d.) and running games, we can generate a dataset of outcomes and perform logistic regression to calibrate our rating system. Something like the following turns out to be a decent predictor:

fn p_red_win(red_rollouts: usize, blue_rollouts: usize) -> f64 {
    let red_rank = (red_rollouts as f64).log10();
    let blue_rank = (blue_rollouts as f64).log10();
    let x = red_rank - blue_rank;
    (0.3 + 1.8 * x).sigmoid()
}

fn sigmoid(x: f64) -> f64 {
    1.0 / (1.0 + (-x).exp())
}

(The mathematically inclined will notice that I’ve got some terms with an $\mathrm{exp}({\log m - \log n})$ shape here meaning we can simplify to a ratio of powers of $m$ and $n$ but I’ll leave that as an exercise.)

The code and notebook for this analysis are on GitHub.

Up next

In the next post, I’ll talk about how we can use an AlphaZero-style neural network to significantly improve the effectiveness of MCTS.

Hexbot Part 1: The 101 Unit

2026-05-10T12:00:00Z

This is the first post in a series of posts about my foray into building a bot to play Hex. The focus of this post is on Hex itself and board game algorithms more generally.

Hexbot Part 1: The 101 Unit
Hexbot Part 2: The Board and Our First Bot
Hexbot Part 3: Using Neural Networks
Hexbot Part 4: Did It Work?

I’ve been vaguely aware of Hex for years, but didn’t really discover it until playing it in Clubhouse Games: 51 Worldwide Classics for Nintendo Switch. I figured out rather quickly that it’s a very strategically interesting game, kind of the checkers to go’s chess.

The game takes place on a grid of hexagonal tiles arranged in a rhombus. Many sizes are possible, but $11 \times 11$ is popular:

Players take turns placing pieces in unoccupied hexagons racing to be the first to make a path between the edges of their color:

That’s all the rules!¹ And yet despite this simplicity, Hex has a lot of strategic depth. (I won’t go into Hex strategy here. I’m not very good at the game, and there are plenty of resources online for how to play hex including the Wikipedia article. However, if you want a taste for it, the best thing to do would be to simply try playing a few games!)

Terminology note: The choice of colors for the first and second player are a matter of convention. I’ve chosen to represent the first and second players as red and blue respectively, and I’ll use this terminology consistently throughout these posts.

The simplicity of Hex gives it a number of interesting properties that are easily shown from the rules.

Games can never end in a draw, since blocking all paths for one color requires making a complete path of the opposite color. This fact was known in 1942 by Piet Hein, the game’s inventor.
It’s known that there exists a strategy for the first player to force a win, although the specific strategy is not yet known. This fact is reflected in the strong first-player advantage in pure Hex, which is commonly corrected using a swap rule. The first written existence proof was given by John Nash in 1952, and board sizes up to 9x9 have been completely solved by computers.

The Wikipedia article and Wolfram MathWorld page for Hex have additional information. Hex has also been written about extensively in an academic context, especially in regards to game-playing AI, and there are plenty of papers and other resources that discuss aspects of Hex in great detail.

Let’s make a bot!

Making bots to play board games has always been an application of AI that fascinates me, so naturally upon encountering Hex I decided I wanted to try making a bot for it. I’ve made bots for board games before, such as for Khet (also known as Laser Chess), and most of what I’ve been doing with Hex has been essentially a further iteration of that work. (The main differences are that the Khet bots were all CPU-based, even the neural network ones, and also that my math and programming skills have grown a bit since then.)

Our ultimate goal will be to build a bot similar to AlphaZero, an MCTS-based search algorithm augmented by a neural network trained on self-play. Owing to the popularity of Hex there are already plenty of very strong bots out there, including prior art for this exact idea: KataHex, a fork of KataGo, is a popular AlphaZero-style bot for Hex. But retracing those steps is educational and fun, and what are board games about if not fun. The resulting implementation is small, hackable, and surprisingly strong.

How a computer plays a board game

Hex, like many abstract strategy games, can be thought of as tree of board positions where the edges represent moves. The root of the tree is the initial board state, and a game proceeds by descending down the move tree. This tree is enormous, making it infeasible to search exhaustively even up to a fairly shallow depth. Most algorithms still search the move tree, but use heuristics to decide which parts of the tree are worth spending time exploring. Search algorithms vary wildly, from highly complex algorithms incorporating a lot of domain knowledge, to very general ones that need no more than just an understanding of the rules.

A popular search algorithm is alpha-beta search, which traverses the move tree using a number of heuristics in order to reduce the branching factor. One such heuristic is a value function that can assign an approximate “value” for a board state, e.g. for chess by scoring material advantage. Though even simple value functions can result in good gameplay, the strength of the algorithm relies on this function’s accuracy, and necessarily incorporates domain knowledge about the game, e.g. for chess the relative value of the different pieces. Additionally, alpha-beta search performs better when visiting the most promising moves first, called move ordering, which is yet another way its performance relies on domain knowledge. While the sensitivity of alpha-beta search to domain knowledge is not by itself a problem (and indeed, strong chess programs such as Stockfish and Komodo have used this algorithm to great success), it does require the author to have some understanding about the game’s strategy.²

The most general practical search algorithm is arguably a pure Monte Carlo tree search (MCTS), where one only has to be able to take a board state and determine if it’s terminal (win/loss/draw) or enumerate all the valid moves from it. The Wikipedia article provides a good explanation of how it works, but since it forms the foundation of the AlphaZero-based model we will be implementing, it’s worth going over.

Monte Carlo tree search (MCTS)

An MCTS search is iterative, where each iteration adds a new leaf node to a game tree data structure kept in memory. The algorithm chooses where to add the node by descending the tree, choosing edges in a way that balances exploration of under-explored subtrees and exploitation of promising subtrees (from the perspective of which player is to make the given move). Upon descending to where a new leaf node is to be added, the algorithm performs a rollout (also sometimes called a “simulation”) to assign a value (win/loss/draw) to the new node: random moves are played until reaching a terminal state, which becomes the initial value of this node. This value is then propagated back up the tree, with the average value of each subtree being aggregated all the way up to the root.

These iterations can be performed as many times as desired, with each iteration adding to the amount of information available at the root node. When the search is stopped (e.g. by hitting an iteration count or computation time limit) the algorithm plays the move represented by the node with the largest visit count, i.e. the number of times the algorithm descended to that node. (Note that the node with the largest value is not chosen, but the search algorithm will visit high-value nodes many times, so this is a subtle but minor distinction.)

The reason MCTS works at all might not be immediately obvious, since valuing new leaf nodes by playing random moves to the end of the game might seem like it provides very little useful information at all. However, the idea is that each iteration descends the move tree all the way to a single terminal state, using information from previous traversals when available and random moves otherwise, then aggregates this information up the tree for later iterations. In other words each iteration of MCTS is sampling the entire move tree, with each new sample adding more detail.

Up next

In the next post, I’ll be covering the board representation and go more in depth on the baseline MCTS implementation.

An additional rule called the “swap rule” is commonly implemented as well, in order to correct for the rather significant first-player advantage. If playing with the swap rule, then after the first move the second player can choose to swap sides, thus incentivizing the first player to choose a move that is not too strong for either player.↩︎
It’s interesting to think what kind of value function and move ordering strategy could be used for a Hex bot based on alpha-beta search. Hex strategy revolves essentially entirely around positional judgement, and concepts like “material advantage” don’t exist at all. There are alpha-beta implementations for Hex and I’ll leave it as a homework assignment to go learn about what value functions they use.↩︎

Undefined Behavior

2024-07-11T12:00:00Z

There’s a lot of confusion among programmers about what C’s “undefined behavior”, or “UB” as it is commonly abbreviated, is for, and why C compilers are allowed to assume UB never happens. This post won’t talk about why the spec contains UB, but will attempt to shed some light on what from my perspective is a rather confusing aspect of how compilers work with it.

A particularly tricky aspect of UB for the typical C programmer is that it sometimes causes the compiler to do counterintuitive things, even to seemingly unrelated parts of the program. It’s a confusing moment at first when disabling optimizations changes the visible effects of your program. If UB represents some kind of abstract “out-of-spec” error condition, why is the compiler allowed to change the behavior of statements leading up to it? There is a very reasonable explanation you could give here, how the spec covers the meaning of whole programs and not just individual statements, but I think it’s much easier understood via time travel.

The unreachable() macro is one way for a program to explicitly invoke undefined behavior. Because the behavior is undefined, the implementation of unreachable() could be a program that sends a robot back in time to prevent the code from running, then destroys the entire world so that none of us are alive in a timeline where unreachable() finishes execution. A final printf("Undefined behavior can *never* occur.\n"); is the last thing the universe sees.

This solution may of course have some retroactive effects that at first seem unrelated. The time traveling robot may choose not to simply terminate the program right before it enters unreachable(), and may instead prevent the program from being started in the first place. It might even delete the code that was compiled to produce the program, leaving a cryptic commit message before deactivating itself in a car crusher. It might go even further back in time to prevent the programmer’s parents from meeting. An oversight in the robot’s programming might even cause it to be overzealous in its interpretation of its mission, as it points the time machine to “1971, Bell Labs.” Technically, the specification prohibits none of these things.

Pending the development of time travel, though, foresight will have to do. Timelines containing an invocation of UB can be safely ignored, and while compilers are not mandated to prevent its occurrence, or even informed that they should assume it never occurs, it’s a useful model for what a conforming compiler is allowed to do in the remaining timelines. The futures where the optimizer breaks its neck doing assembly parkour are not futures we expect to find ourselves in. Or at least, their behavior is undefined.

Fast Dice

2024-07-01T12:00:00Z

I got nerd sniped by a fun JavaScript performance puzzle recently, having to do with efficiently calculating the probability of a dice-based Bernoulli trial, for the purpose of a game I’m working on. It goes like this:

The player works to collect sets of up to 4 dice, which can be d4s, d6s, or d10s.
The player then chooses up to 10 sets of dice from their collection and rolls them.
If at least 3 of the sets rolled have a total of 10 or more, the player advances to the next round. All rolled sets are removed from the player’s collection, regardless of whether they advance or not.

It’s helpful to be able to calculate the exact probabilities involved when balancing the game, in addition to playtesting normally. Furthermore, having the exact probabilities allows game design elements to reflect the chance of success (e.g. showing things in a different color) in a way that gives the player some useful information while still leaving a bit of uncertainty. However, with the way this process is designed, calculating the probabilities presents a few challenges:

Sets can have any combination of dice, e.g. 4d6, 2d4+d10, 3d6+d10, etc.
Sets of dice are tested based on their sum.
The player advances if at least 3 sets score 10 or more, but have the option to play more than 3 if they think it would help their chances enough to be worth it.

So, to get this problem out of the way and avoid coming up with any tricky equations, I went with the most general solution I know: convolutions of probability distributions. (This formulation is equivalent to multiplying probability generating functions, since we are dealing with discrete random variables.)

Background: Too convoluted?

The calculation works in two passes:

For each set of dice, generate a CDF of the dice total by reducing the PMFs of the dice with convolutions and doing a cumulative sum. The CDF at 9 represents the Bernoulli parameter for whether the set fails to reach a score of 10.
For each Bernoulli parameter p calculated in the previous step, generate a PMF [p, 1-p] to represent a random variable that has a value of 1 if the test succeeds and 0 otherwise. Reduce these by convolution and do a cumulative sum to get the CDF for the sum of these scores. The CDF at 2 represents the Bernoulli parameter for whether fewer than 3 sets succeeded.

If you’re not familiar with convolution, the 3blue1brown video about it is an excellent introduction. However, if you’re in a hurry, you can think of it as a multiplication of two polynomials, where the i-th element of each input is the coefficient of xⁱ. The fact that the probability distribution of a sum of random variables is a convolution of their individual distributions is extremely useful for numerically calculating probabilities where simpler analytic solutions are out of reach, and is the foundation of the approach outlined above.

Attempt 1: Just write the code

These days JavaScript VMs are very fast, so I generally try not to overthink things unless there is a clear need for it. Naturally I wrote some code that looked like this:

function range(n) {
  return [...Array(n).keys()];
}
function sum(a) {
  return a.reduce((x, y) => x + y, 0);
}
function convolve(a, b) {
  return range(a.length + b.length - 1).map((i) =>
    sum(range(b.length).map((j) => (a[i - j] ?? 0) * b[j]))
  );
}
function cumsum(a) {
  let sum = 0;
  return a.map((x) => (sum += x));
}
function dicepmf(n) {
  return range(n + 1).map((i) => (1 <= i && i <= n ? 1 / n : 0));
}

function pSetFail(ds) {
  const cdf = cumsum(ds.map((n) => dicepmf(n)).reduce(convolve, [1]));
  return cdf[9] ?? 1;
}
function pSetsFail(sets) {
  const pdf = sets
    .map(pSetFail)
    .map((p) => [p, 1 - p])
    .reduce(convolve, [1]);
  const cdf = cumsum(pdf);
  return cdf[2] ?? 1;
}

const sets = [
  [4, 6, 6],
  [4, 6, 6],
  [4, 6, 6],
];
console.log(1 - pSetsFail(sets));

If we run the above, we get 0.125, which is what we would expect, since the odds of d4+2d6 totaling 10 or more is exactly 50%. In other words, we’re flipping a coin 3 times and trying to get HHH.

So this solution works and has the benefit of being concise and readable, but how fast is it? Naive convolution is O(n²), so let’s time its worst case in our context, 10 attempts of 4d10:

const worstcase = range(10).map(() => [10, 10, 10, 10]);

const start = performance.now();
for (let i = 0; i < 1000; i++) pSetsFail(worstcase);
const end = performance.now();

console.log((start - end) / 1000);

This comes out to about 1.4ms on my machine in Node.js, which is definitely usable if you only need it occasionally. Okay, problem solved, let’s move onto the next thing, we’ve got a game to build.

…But it does seem a bit slow, doesn’t it? We’re only dealing with 40 dice here, and only up to d10. What if we want to offer d20s or d100s? We’re just doing multiplications and additions, so surely we can do better, right? We should be able to call this 100 times a frame if we want!

Attempt 2: Loops

The convolve function we wrote is definitely a big part of the problem, and you don’t need a profiler to figure that out. It’s two loops, where the inner loop is generating an array just to calculate its sum. We’re also deliberately looking up keys that don’t exist and using the nullish coalescing operator to convert them to 0s, instead of explicitly checking the index. Furthermore, we’re calculating up to 41 elements of each PMF when we only need the first 10. We can omit those without changing the result. There’s a lot of room for some cheap improvements and we owe it to ourselves to at least try, don’t we?

In particular, we suspect the following things might help:

Replace most calls to range(), map(), and reduce() with explicit loops and mutation.
Replace the nullish coalescing operator with an explicit bounds check.
Truncate the PMFs.

When we make the above changes, the code looks like this:

function convolve(a, b, limit) {
  const n = Math.min(limit, a.length + b.length - 1);
  const out = [];
  for (let i = 0; i < n; i++) {
    out[i] = 0;
    for (let j = 0; j < b.length; j++) {
      const k = i - j;
      if (0 <= k && k < a.length) {
        out[i] += a[k] * b[j];
      }
    }
  }
  return out;
}
function cumsum(a) {
  for (let i = 1; i < a.length; i++) {
    a[i] += a[i - 1];
  }
  return a;
}
function dicepmf(n) {
  const out = [0];
  for (let i = 1; i <= n; i++) {
    out[i] = 1 / n;
  }
  return out;
}

function pSetFail(ds) {
  let pdf = [1];
  for (const n of ds) {
    pdf = convolve(pdf, dicepmf(n), 10);
  }
  const cdf = cumsum(pdf);
  return cdf.length < 10 ? 1 : cdf[9];
}
function pSetsFail(sets) {
  let pdf = [1];
  for (const ds of sets) {
    const p = pSetFail(ds);
    pdf = convolve(pdf, [p, 1 - p], 3);
  }
  const cdf = cumsum(pdf);
  return cdf.length < 3 ? 1 : cdf[2];
}

We’ve sacrificed a bit of conciseness perhaps but this is still quite readable. After quickly double-checking that our first example still prints 0.125, we time it and find that each calculation of the worst case input only takes around 15 microseconds. That’s a 90x improvement!

Of course, we should probably identify how much of an impact each of the above changes had, so here’s rough timing figures for each one individually:

Loops and mutation: 340us, 4.0x improvement
Bounds checks: 1.0ms, 1.3x improvement
PMF truncation: 670us, 2.0x improvement

Loops definitely seem to have the biggest impact by themselves… but where is the 90x improvement coming from when taken together? Microbenching a language like JavaScript in the particular way I’m doing isn’t an exact science but there is definitely something fishy going on. Let’s try the other direction and remove each optimization from the fastest solution to see which one results in the biggest slowdown:

No loops and mutation: 410us, 28x slowdown
No bounds checks: 290us, 20x slowdown
No PMF truncation: 39us, 2.6x slowdown

Bizarre! Loops once again seem to responsible for the biggest improvement, but the bounds checks are an impressive factor as well. I double checked the code to make sure I didn’t get something wrong here, but using explicit bounds checks does seem to be responsible for a noticeable improvement. I don’t know enough about V8 to know why this would be the case, but it’s an interesting thing to keep in mind when trying to write JIT-friendly code I suppose. Maybe I’ll do a deep dive someday to figure out why this happens.

But okay, 15 microseconds is pretty dang fast, and that’s a worst case! If we use the 3 sets of d4+2d6 input we’ve been using for validation, we get speeds closer to 2.5 microseconds. So we’re done, right?

Right??

Attempt 3: Insanity

Our calculation has a very predictable shape: ten times do 10-element convolutions of four sequences and sum their entries, then do 3-element convolutions of ten sequences and sum those. What if we just… hard-coded this? No loops, minimal branches. How fast would it be?

I won’t bore you with the details and will just show you the code I came up with. I’m not going to use this code, of course. It just felt like a fun puzzle. Here’s the function that calculates the same result as pSetFail() above, with slightly different parameters:

function pSetFail(d0, d1, d2, d3) {
  let w0, w1, w2, w3, w4, w5, w6, w7, w8, w9;
  let x0, x1, x2, x3, x4, x5, x6, x7, x8, x9;
  let y0, y1, y2, y3, y4, y5, y6, y7, y8, y9;
  let z0, z1, z2, z3, z4, z5, z6, z7, z8, z9;
  let m;

  w0 = d0 === 0 ? 1 : 0;
  w1 = d0  >= 1 ? 1 : 0;
  w2 = d0  >= 2 ? 1 : 0;
  w3 = d0  >= 3 ? 1 : 0;
  w4 = d0  >= 4 ? 1 : 0;
  w5 = d0  >= 5 ? 1 : 0;
  w6 = d0  >= 6 ? 1 : 0;
  w7 = d0  >= 7 ? 1 : 0;
  w8 = d0  >= 8 ? 1 : 0;
  w9 = d0  >= 9 ? 1 : 0;

  m = 1;
  m *= d0 === 0 ? 1 : d0;
  m *= d1 === 0 ? 1 : d1;
  m *= d2 === 0 ? 1 : d2;
  m *= d3 === 0 ? 1 : d3;

  switch (d1) {
    case 0:
      x0=w0;       x1=w1;       x2=w2;       x3=w3;       x4=w4;
      x5=w5;       x6=w6;       x7=w7;       x8=w8;       x9=w9;       break;
    case 4:
      x0=0;        x1=w0+x0;    x2=w1+x1;    x3=w2+x2;    x4=w3+x3;
      x5=w4+x4-w0; x6=w5+x5-w1; x7=w6+x6-w2; x8=w7+x7-w3; x9=w8+x8-w4; break;
    case 6:
      x0=0;        x1=w0+x0;    x2=w1+x1;    x3=w2+x2;    x4=w3+x3;
      x5=w4+x4;    x6=w5+x5;    x7=w6+x6-w0; x8=w7+x7-w1; x9=w8+x8-w2; break;
    case 10:
      x0=0;        x1=w0+x0;    x2=w1+x1;    x3=w2+x2;    x4=w3+x3;
      x5=w4+x4;    x6=w5+x5;    x7=w6+x6;    x8=w7+x7;    x9=w8+x8;    break;
  }

  switch (d2) {
    case 0:
      y0=x0;       y1=x1;       y2=x2;       y3=x3;       y4=x4;
      y5=x5;       y6=x6;       y7=x7;       y8=x8;       y9=x9;       break;
    case 4:
      y0=0;        y1=x0+y0;    y2=x1+y1;    y3=x2+y2;    y4=x3+y3;
      y5=x4+y4-x0; y6=x5+y5-x1; y7=x6+y6-x2; y8=x7+y7-x3; y9=x8+y8-x4; break;
    case 6:
      y0=0;        y1=x0+y0;    y2=x1+y1;    y3=x2+y2;    y4=x3+y3;
      y5=x4+y4;    y6=x5+y5;    y7=x6+y6-x0; y8=x7+y7-x1; y9=x8+y8-x2; break;
    case 10:
      y0=0;        y1=x0+y0;    y2=x1+y1;    y3=x2+y2;    y4=x3+y3;
      y5=x4+y4;    y6=x5+y5;    y7=x6+y6;    y8=x7+y7;    y9=x8+y8;    break;
  }

  switch (d3) {
    case 0:
      z0=y0;       z1=y1;       z2=y2;       z3=y3;       z4=y4;
      z5=y5;       z6=y6;       z7=y7;       z8=y8;       z9=y9;       break;
    case 4:
      z0=0;        z1=y0+z0;    z2=y1+z1;    z3=y2+z2;    z4=y3+z3;
      z5=y4+z4-y0; z6=y5+z5-y1; z7=y6+z6-y2; z8=y7+z7-y3; z9=y8+z8-y4; break;
    case 6:
      z0=0;        z1=y0+z0;    z2=y1+z1;    z3=y2+z2;    z4=y3+z3;
      z5=y4+z4;    z6=y5+z5;    z7=y6+z6-y0; z8=y7+z7-y1; z9=y8+z8-y2; break;
    case 10:
      z0=0;        z1=y0+z0;    z2=y1+z1;    z3=y2+z2;    z4=y3+z3;
      z5=y4+z4;    z6=y5+z5;    z7=y6+z6;    z8=y7+z7;    z9=y8+z8;    break;
  }

  return (z0+z1+z2+z3+z4+z5+z6+z7+z8+z9)/m;
}

And here’s the function that computes the same result as pSetsFail() above, again with slightly different parameters:

function pSetsFail(ds) {
  const p0 = pSetFail(ds[ 0], ds[ 1], ds[ 2], ds[ 3]);
  const p1 = pSetFail(ds[ 4], ds[ 5], ds[ 6], ds[ 7]);
  const p2 = pSetFail(ds[ 8], ds[ 9], ds[10], ds[11]);
  const p3 = pSetFail(ds[12], ds[13], ds[14], ds[15]);
  const p4 = pSetFail(ds[16], ds[17], ds[18], ds[19]);
  const p5 = pSetFail(ds[20], ds[21], ds[22], ds[23]);
  const p6 = pSetFail(ds[24], ds[25], ds[26], ds[27]);
  const p7 = pSetFail(ds[28], ds[29], ds[30], ds[31]);
  const p8 = pSetFail(ds[32], ds[33], ds[34], ds[35]);
  const p9 = pSetFail(ds[36], ds[37], ds[38], ds[39]);

  const a1 =          1-p0;

  const b0 = p1*p0;
  const b1 = p1*a1 + (1-p1)*p0;
  const b2 =         (1-p1)*a1;

  const c0 = p2*b0;
  const c1 = p2*b1 + (1-p2)*b0;
  const c2 = p2*b2 + (1-p2)*b1;

  const d0 = p3*c0;
  const d1 = p3*c1 + (1-p3)*c0;
  const d2 = p3*c2 + (1-p3)*c1;

  const e0 = p4*d0;
  const e1 = p4*d1 + (1-p4)*d0;
  const e2 = p4*d2 + (1-p4)*d1;

  const f0 = p5*e0;
  const f1 = p5*e1 + (1-p5)*e0;
  const f2 = p5*e2 + (1-p5)*e1;

  const g0 = p6*f0;
  const g1 = p6*f1 + (1-p6)*f0;
  const g2 = p6*f2 + (1-p6)*f1;

  const h0 = p7*g0;
  const h1 = p7*g1 + (1-p7)*g0;
  const h2 = p7*g2 + (1-p7)*g1;

  const i0 = p8*h0;
  const i1 = p8*h1 + (1-p8)*h0;
  const i2 = p8*h2 + (1-p8)*h1;

  const j0 = p9*i0;
  const j1 = p9*i1 + (1-p9)*i0;
  const j2 = p9*i2 + (1-p9)*i1;

  return j0+j1+j2;
}

When running this on the worst case input with the same benchmarking strategy as before, I get times of 400 nanoseconds.

Why is it so much faster? Well, I can’t be 100% sure without digging deeper, but my guess is this is very JIT-friendly code, and that the CPU code it generates is very cache-friendly. pSetFail is also all integer sums and products until the return statement, which probably helps a bit. Maybe with a bit more digging it could be made even faster. But I think 400ns is pretty fast. That’s a 3000x improvement from where we started. Not bad!

By the way, if you look very closely at pSetFail, you might notice that it looks a lot like the ultimate brute force solution: counting up the possible ways to get different sums.

The future?

We’re still using a naive convolution algorithm, one essentially based directly on the definition. For small sequences this works well enough, but for larger sequences this becomes prohibitive. This type of convolution comes up in signal processing a lot, where you might want to compute the convolution of two sequences that have tens of thousands of elements. In those scenarios you can take advantage of the convolution theorem, which lets you multiply the Fourier transforms of the inputs point-wise and do an inverse Fourier transform to get the same result. This is also the basis of some fast integer multiplication algorithms.

Are our inputs too small to benefit from this knowledge? Probably. But I can’t help but be a little curious. If I ever decide to give it a shot, I’ll be sure to write about it.

Reader Mode: It Just Works

2024-01-29T12:00:00Z

I’m the kind of internet user for whom the Firefox Reader View feature is only occasionally useful. Most mainstream browsers have something like it these days. I’m sure there are people out there who can’t imagine using the internet without it. For me, it’s a lot like having a unique screwdriver that’s a better choice for some jobs than resorting to a suitably-sized flathead and hoping for the best: most of the time you can’t use it, but when you can, you’re really glad to have it.

I’ve often wondered how exactly this feature works. This is partially out of curiosity. Sometimes it gets things slightly wrong, which, sadly, makes me reluctant to trust it. Its shortcomings suggest something really tricky going on under the hood. I can’t help but wonder where in the pinball machine of

tags and CSS class names that particular bit of authorial intent got lost. Sometimes you notice the little Reader View icon in the address bar for pages that can’t plausibly look decent that way, or sometimes it’s missing from pages that by all accounts should have it. But in addition to rubbernecking and taking cheap shots at a brave attempt to address a difficult problem, there’s also a more cooperative motivation: how can I structure the stuff I put online in a way that will reliably look good in Reader View, and reader modes in general?

This is, sadly, a poorly-documented topic. Resources consist mainly of Stack Overflow questions, questions on the Webmasters Stack Exchange, threads on Mozilla’s support forums, now-deleted blog posts, etc. all of which link to each other. The advice seems to boil down to “It applies a bunch of heuristics that work pretty well if your HTML is good.”

Why isn’t it standardized yet?

After some digging I found this excellent series of 4 articles by Daniel Aleksandersen from 2018 about features like Reader View that discusses their history, the failed attempts at standardization, an overview of how these features work and how they differ across implementations, and a plea to finally standardize the dang thing. The fact that Daniel is writing from 2018 is a little disheartening. A lot of what he writes still feels very relatable. It’s a relatively simple problem, at least compared to what’s typical for the web, so how could we not have made any progress on it in almost 6 years?

An uncharitable reading of the situation might conclude that Reader View and features like it are useless, something completely forgotten by browser vendors and content authors alike, the kind of problem that standardization alone can’t solve. But a quick scan of the issues and pull requests for Readability.js (the tool used for implementing the Firefox Reader View) suggest that there are at least a few people who want the feature to be good, and some of them even work at Mozilla. From another perspective, the scattered and wishy-washy advice for how to cooperate with Reader View is a testament to the feature’s overall reliability, an indication that people are content to accept the situation as-is. If it’s not broken, why fix it? How broken is it actually?

To Mozilla’s credit, they, like all browser vendors, are up against some pretty steep odds. These days, most of the HTML on the internet is written with desktop and mobile users in mind, and occasionally will also consider screen readers and dark modes. For everything else, you just have to make do with what you’ve got. A well known fact about standards is that their mere existence doesn’t mean much on its own, and this plays out on the web with discouraging regularity. Many of the web’s standards, such as CSS media queries, semantic HTML, ARIA definitions, and even private standards like AMP, feel hopelessly optimistic when pitted against the chaotic and laissez-faire reality of web content. In a world where quirks mode exists, is it anything but a waste of time to give people the option to ditch the separate “print view” and instead do it all in one page with HTML and CSS? Maybe some problems are doomed by human nature to remain unsolved forever, and a reliable reader mode is one of them.

Unfortunately, the incentives are rarely aligned between browser vendors and web content authors (although the situation is improving), and vendors are significantly outnumbered. There are only a handful of browsers, even including those with just a small fraction of the total market, and they are motivated to be secure, robust, and compliant with all the latest standards. Content authors, however, vary wildly in their goals, and not all of them will necessarily care (or have the time and resources to care) about things like whether the page works on a phone, whether “Print to PDF” looks good, whether CSS grid would have been a better choice than

(what year is it?), whether repeated is the right way to do indentation, etc. If any of these things creates a problem for users, realistically it’s up to the browser vendors to do something about it (usually something hacky) since you can’t pin your hopes of capturing market share on content authors doing the right thing: if CNN looks good in every browser but yours, that’s your problem, even if it’s really CNN’s fault.

Denial

And this is why reader modes are interesting to me. The incentives are aligned here. Browser vendors (with the possible exception of Google) want them to work well, users want them to work well, and, clearly, there are more than a few web content authors that want them to work well for their sites. Furthermore, the idea seems pretty aligned with accessibility. So why does the implementation still feel like a hack? Why has nobody tried to standardize this? At this point it’s clear that reader modes are here to stay, so why not try to make them good? How is this different from AMP, which for a brief time was everywhere even though people hated it? How is this different from the Open Graph protocol? How is this different from Sitemaps or WebP or Flexbox?

Anger

The main difference? A standard won’t help. Not that much, anyway. The web community finds itself in a situation which is all too familiar to software engineers: things are good enough, and the problems aren’t a big deal. The only changes to reader modes that anyone feels are worth the time and energy are the small, incremental ones that gradually improve the situation, and widespread adoption of a standard is simply not one of those things. Ultimately, most of the people who would play nice with a standardized reader mode are already reciting the

and

incantations they got from Stack Overflow.

Bargaining

Furthermore, for the small number of people that want a good reader mode experience for their website but can’t make it work, a standard won’t necessarily help them. Standards can be poorly thought out, can be caught off guard by broader changes, can be merely a restatement of an existing implementation, can specify things that never get implemented, etc. Standards can be bad too, and fixing a broken standard is not an easy task. You’re much more likely to get the

extraction for your page fixed by submitting an issue to a GitHub repo (or fixing it yourself, maybe) than by going through the process of having the standard amended in a way that everyone feels will fix all

extraction everywhere for all time.

Depression

Cynically, a standard could even make things worse. The whole point of a reader mode is to reduce clutter, and a standard would only provide a jumping-off point for innovations in content treachery. The effectiveness of things like search rankings, ad blockers, and tracking prevention rely on website authors not knowing how they work, or not caring enough to know. Perhaps it’s a good thing that reader modes, an obscure, subtly-broken, and poorly-documented feature, function similarly by pure chance. I can’t imagine any ad-supported websites would be particularly enthusiastic to help browsers show users just what they came for, with no ads, no links to other articles, no comments sections, no social media buttons. Imagine the HTML crimes they’d do to sneak them back in. Imagine the cat-and-mouse game that would ensue. Imagine the ways they’d meddle in the standardization process.

Acceptance

This is not to suggest that a lack of a standard is a good thing, or that it’s a purposeful effort by browser vendors to thwart those who would ruin it for everybody. But I also don’t think the lack of a standard is a bad thing either. Standards are nice to have, but they aren’t strictly necessary, and it seems difficult to create a standard that would benefit users for whom the current implementation is inadequate, but not those for whom a large population of reader mode users would be valuable clickbait targets. Perhaps a solution is possible, but it doesn’t seem trivial, both technically and socially, and it really isn’t such a big deal. The reader mode we’ve got isn’t perfect, but it works. It’ll be our little secret.