<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2025-10-26T04:30:07+00:00</updated><id>/feed.xml</id><title type="html">William Thurston’s Blog</title><subtitle>My blog where I write about tech leadership and 15+ years building and scaling foundational platforms in Cloud and AI.</subtitle><entry><title type="html">Language Modeling with Hierarchical Reasoning Models: Lessons from 1M Parameters</title><link href="/ml/language-models/transformers/2025/10/25/language-modeling-with-hierarchical-reasoning-models.html" rel="alternate" type="text/html" title="Language Modeling with Hierarchical Reasoning Models: Lessons from 1M Parameters" /><published>2025-10-25T07:00:00+00:00</published><updated>2025-10-25T07:00:00+00:00</updated><id>/ml/language-models/transformers/2025/10/25/language-modeling-with-hierarchical-reasoning-models</id><content type="html" xml:base="/ml/language-models/transformers/2025/10/25/language-modeling-with-hierarchical-reasoning-models.html"><![CDATA[<p>The <a href="https://arxiv.org/pdf/2506.21734">Hierarchical Reasoning Models (HRM) paper</a> made waves by demonstrating strong performance on ARC-AGI reasoning tasks with very few parameters, and the <a href="https://arxiv.org/pdf/2510.04871">Tiny Recursive Models (TRM) paper</a> pushed it further, setting higher benchmarks with just 7M parameters. The architecture uses hierarchical reasoning through recursion of transformer layers—effectively running the same parameters through multiple forward passes to reach impressive small model performance.</p>

<p>But HRM and TRM were designed for puzzles, not language. I wanted to answer a simple question: can these hierarchical reasoning models generate coherent text at extremely small scales?</p>

<p>Spoiler: Yes, but with some important caveats.</p>

<h2 id="what-i-built">What I Built</h2>

<p>I adapted TRM for autoregressive language modeling and trained several variants on <a href="https://huggingface.co/datasets/roneneldan/TinyStories">TinyStories</a>—a dataset of simple children’s stories. All models were trained at roughly 1M parameters to keep experiments manageable.</p>

<blockquote>
  <p>I decided to run all of these experiments on my Macbook Pro. It’s slow, but workable at very tiny scales, and allowed me to have some fun with model research on a machine I already owned. If you check out the <a href="https://github.com/jhspaybar/TinyRecursiveModels">github</a> you should be able to reproduce what I did here in a few hours on a newer Macbook Pro. I plan to keep pushing the limits of my laptop and writing about my experiences.</p>
</blockquote>

<p><strong>The experiments:</strong></p>

<ul>
  <li><strong>Transformer baseline</strong>: Standard transformer mostly as a comparison</li>
  <li><strong>TRM with dense layers</strong>: Various recursion configurations with dense + wide layers</li>
  <li><strong>TRM with sparse layers</strong>: Various recursion configurations with a sparse MoE at parameter equality</li>
  <li><strong>Recursion with carry</strong>: For both dense and sparse configurations, allow a 4 iteration refinement on carry</li>
</ul>

<p>Each model saw ~2.1M training sequences at 512 length over 31,250 steps with identical hyperparameters (lr=0.01, batch_size=64, warmup + cosine decay).</p>

<p>For TRM with sparse layers, I implemented the approach described in <a href="https://arxiv.org/pdf/2405.16039">MoEUT</a> which shows MoE models can reach parameter equal outcomes by using recursion when compared to dense models without recursion. The claims largely hold up inside the TRM harness at small scales.</p>

<p>There are many pieces of research including the MoEUT paper which show that 2 layers are required for transformer generalization, and the MoEUT paper found that a group size of 2 with additional recursion worked better than larger groups of layers with less recursion. I followed this and similar findings and trained all TRM models with just 2 layers and recursion. This meant wider than standard expansion for dense models, and many experts for our MoE variants.</p>

<h2 id="training-results-simpler-is-better">Training Results: Simpler is Better</h2>

<p>At 1M parameters on a language modeling task, architectural complexity doesn’t help:</p>

<p><strong>Final loss rankings (best to worst):</strong></p>

<ol>
  <li>Transformer baseline</li>
  <li>(Tied) TRM sparse MoE with compute similar recursion</li>
  <li>(Tied) TRM with dense + wide layers</li>
  <li>Carry refinement variants</li>
</ol>

<p>The transformer baseline achieved the lowest training loss though all “bests” were quite close and resulted in usable models. Deep recursion and carry refinement added overhead and additional compute without improving perplexity or generation quality compared to the baseline. Spending more compute did still improve results compared to other TRM variants, but none were able to reach baseline transformer performance.</p>

<h2 id="dense-trm-models-generate-coherent-text">Dense TRM Models Generate Coherent Text</h2>

<p><strong>Sample from a H-cycle=3, L-cycle=6 no carry refinement training run (greedy, temp=0.0):</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Prompt: "Once upon a time, there was a little girl named Lily"

Output: ". She loved to play outside in the park. One day, she
saw a big, red ball on the ground. She wanted to play with it,
but it was too high for her.

Lily's mom saw her struggling and asked, "What's wrong, sweetie?"

"I can't reach the ball, but I can't reach it," her mom said.

Lily was sad, but she knew she had to help the ball. She went
to her mom and said, "I can help you, mommy. I can help you."

Her mom was happy to help and they went back to the park
together. They played on the swing and had lots of fun. When it
was time to go home, they went home and had a happy day at the
park. The end."
</code></pre></div></div>

<blockquote>
  <p>But don’t get too excited! Remember, the transformer baselines reach slightly lower loss at similar size on this dataset. TinyStories dataset is known to allow for reasonable generations at very low parameter sizes and all of our models and variants produce similar output.</p>
</blockquote>

<p>Temperature and top-k sampling work as expected, producing more diverse outputs at higher temperatures.</p>

<p><strong>Sample from same model (temp=0.5, top_k=20):</strong></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Prompt: "The dog ran very fast to"

Output: " the park. He was so excited to see the other kids. He
ran around the park and saw a big, red ball. He wanted to play
with it, but he was too small. He was too small and he couldn't
reach it. He tried to get it, but he couldn't. He was so sad.
He started to cry..."
</code></pre></div></div>

<h2 id="moeut-at-1m-parameters">MoEUT at 1M Parameters</h2>

<p>I implemented MoEUT following the <a href="https://arxiv.org/pdf/2405.16039">original paper</a> with one modification to add <a href="https://arxiv.org/pdf/2408.15664v1">loss-free bias balancing</a> as shown in the DeepSeek paper.</p>

<ul>
  <li>Sigmoid-based expert routing</li>
  <li>Loss-free bias balancing</li>
  <li>8 FFN experts with k=2 active</li>
  <li>2 attention experts with k=1 active</li>
</ul>

<p>The training was stable. Loss-free bias balancing kept expert loads balanced without auxiliary losses. Compared to the dense TRM models, training loss was comparable, and generation quality was comparable. This is expected from the MoEUT paper which shows us MoE models can be made to reach parameter-equal outcomes by employing recursion as described in <a href="https://arxiv.org/pdf/1807.03819">Universal Transformers</a>.</p>

<p><strong>Implementation note on Apple Silicon:</strong> My MoEUT implementation isn’t truly compute-equal to the dense models due to a hardware-specific constraint. On Apple Silicon (MPS backend), sparse tensor operations caused unbounded memory growth during training—likely due to memory fragmentation. To work around this, I had to compute dense projections for all experts and then filter to the top-k, rather than computing only the selected experts. This means MoEUT variants used more compute than they theoretically should. This limitation is specific to the MPS backend and likely wouldn’t occur on CUDA hardware with better sparse operation support.</p>

<p>The MoEUT paper demonstrated advantages at larger scales and would be an interesting path for future work, particularly on hardware that can efficiently handle sparse computations.</p>

<h2 id="carry-refinement">Carry Refinement</h2>

<p>HRM and TRM allow the model to refine previous outputs, improving quality with more compute at inference time. In theory, this should help with difficult tokens.</p>

<p>In practice, at 1M parameters on TinyStories, carry refinement variants consistently lagged behind fully connected recursion in training loss. The carry refinement mechanism added complexity and compute without improving generation quality.</p>

<p>This doesn’t mean carry refinement is flawed—just that for simple narrative generation at small scale, the overhead isn’t justified.</p>

<h2 id="key-takeaways">Key Takeaways</h2>

<p><strong>What I observed:</strong></p>

<p>The hierarchical reasoning mechanisms that excel at puzzles don’t automatically transfer to language modeling—at least not on simple narrative tasks at small scales. The recursion that helps TRM “think through” an ARC-AGI puzzle doesn’t provide comparable benefits when generating children’s stories.</p>

<h2 id="similar-work">Similar Work</h2>

<p>There are a number of papers about reasoning in <a href="https://arxiv.org/pdf/2412.06769">latent space</a> or <a href="https://arxiv.org/pdf/2502.05171">scaling up test compute dynamically</a>. Hierarchical reasoning models appear to operate similarly. In generation of stories, it was possible to perform more or fewer refinements than models were trained with, receiving different outputs. Anecdotally, some story generations appeared to be slightly improved when running 4 carry refinement steps or additional L-cycles at inference. It’s possible I was just looking for this result, but this likely deserves additional follow-up on larger models to further study latent space reasoning on language modeling tasks.</p>

<h2 id="future-work">Future Work</h2>

<p>Several important questions remain unanswered:</p>

<p><strong>Scale:</strong> Does MoEUT show advantages at 40M+ parameters where the original paper demonstrated benefits?</p>

<p><strong>Task complexity:</strong> Would recursive models excel on reasoning-heavy language tasks like GSM8K or other mathematical/logical problems where deeper computation might matter?</p>

<p>I focused on TinyStories because it’s fast to train and well-understood. But hierarchical reasoning might shine on tasks that actually require… reasoning. That’s the obvious next experiment.</p>

<h2 id="conclusion">Conclusion</h2>

<p>Hierarchical Reasoning Models can generate text. At small scales on simple tasks, they’re not better than standard transformers, but they’re not really worse either (when configured correctly). They seem quite similar to universal transformers, and are likely relevant to research into latent space reasoning. At least at the scales I tested, architectural sophistication needs task-appropriate justification. Deep recursion and carry refinement are tools, not magic. They excel where they’re needed and add overhead where they’re not.</p>

<hr />

<p>Code and configs available at: <a href="https://github.com/jhspaybar/TinyRecursiveModels">github.com/jhspaybar/TinyRecursiveModels</a></p>]]></content><author><name></name></author><category term="ml" /><category term="language-models" /><category term="transformers" /><summary type="html"><![CDATA[The Hierarchical Reasoning Models (HRM) paper made waves by demonstrating strong performance on ARC-AGI reasoning tasks with very few parameters, and the Tiny Recursive Models (TRM) paper pushed it further, setting higher benchmarks with just 7M parameters. The architecture uses hierarchical reasoning through recursion of transformer layers—effectively running the same parameters through multiple forward passes to reach impressive small model performance.]]></summary></entry><entry><title type="html">Create Custom AWS ECS Schedulers With ecs_state</title><link href="/aws/ecs/docker/2015/08/20/create-custom-aws-ecs-schedulers-with-ecs-state.html" rel="alternate" type="text/html" title="Create Custom AWS ECS Schedulers With ecs_state" /><published>2015-08-20T07:00:00+00:00</published><updated>2015-08-20T07:00:00+00:00</updated><id>/aws/ecs/docker/2015/08/20/create-custom-aws-ecs-schedulers-with-ecs-state</id><content type="html" xml:base="/aws/ecs/docker/2015/08/20/create-custom-aws-ecs-schedulers-with-ecs-state.html"><![CDATA[<p>In the last couple years Docker and other container technologies have seen a lot of interest and adoption. They provide a simple interface and API for creating self contained applications that once built run pretty much anywhere. Conceptually, a container isn’t much different from a static binary or a super jar. It’s a bundle of files and configurations necessary to run one or more processes, though as implemented today does provide a fair bit of isolation necessary to prevent accidental interference between applications. With this ease of packaging and running applications has come an increase in the speed with which developers expect to move both when creating software and deploying it. This desire to move faster in production has led to a number of cluster management systems focused on deploying Docker containers.</p>

<p>Before we dive deeper, if you’re just interested in seeing the code, head over to Github and checkout out <a href="https://github.com/jhspaybar/ecs_state">ecs_state</a>.</p>

<p>While there are many cluster management systems on the market today, I’ll be focusing on EC2 Container Service(ECS) from AWS. I’m very familiar with it as I was a founding member of the team and wrote large portions of the cluster management and task placement systems. It is a “shared state” cluster management system informed by the learnings shared by Google in their <a href="https://research.google.com/pubs/archive/41684.pdf">Omega paper</a>. The core concept of a system built in this manner is that the cluster manager will share what is happening in the cluster to anyone or anything that asks. In ECS, this means that anyone, or anything which calls the List and Describe APIs can inspect what is happening not just with their own resources, but with all resources available in the cluster. Armed with this information, a decision can then be made about how to modify this state, either by starting or stopping a Docker container somewhere in the cluster. A human performing these actions can be a bit time consuming, but a machine performing these actions is referred to as a “scheduler”. The scheduler reads in data about cluster state, and then stops and starts Docker containers based on a set of rules. In the case where multiple schedulers race for the same resources, the cluster manager must choose a winner and reject the other request for placement in the cluster.</p>

<p>The most fundamental way to place a task in an ECS cluster is to use the List and Describe APIs, apply some logic, and then call StartTask which takes as arguments a TaskDefinition(a manifest containing the Docker image to run, the resources to use, and other configuration), as well as the identifier for the machine where it should be started. By allowing for direct placement, even applications with complex requirements that may only be capable of running on a very specific machine can pick that spot through this process. Many applications however don’t need this form of control, and ECS provides us with some concrete examples of what a scheduler might look like for these applications. In ECS, there is an API called <code class="language-plaintext highlighter-rouge">RunTask</code> which is described in the <a href="http://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html">documentation</a> as “Start a task using random placement and the default Amazon ECS scheduler.” When calling this API we can see it returns quickly with details about the task that was placed, or provides us with an error like <code class="language-plaintext highlighter-rouge">reason: RESOURCE:CPU</code> meaning it could not find a location for the task because the CPU resource constraint could not be satisfied. This is currently a simple scheduler that inspects the state of the cluster, finds which machines in the cluster can accept the task placement, and then randomly places up to 10 tasks on the available machines. It’s simple, but for a quick job like a build task, or image processing, the location of placement may not matter as much as just quickly finding somewhere for the task to run.</p>

<p>Another scheduler that is part of the ECS APIs is the Service Scheduler. It is built to run long running tasks within the cluster, restart them if they stop(possibly if the machine they’re running on crashes), and optionally manage the lifecycle necessary to place those tasks behind an Elastic Load Balancer. It even allows for rolling zero downtime updates to your containers while properly draining connections from the load balancers. There’s really no magic involved though, and nothing is happening that couldn’t be done by a user external to ECS. When starting a service, ECS expects a role to be provided which allows for describing and registering machines behind an Elastic Load Balancer. Currently, if you start a service with a load balancer and have Cloudtrail Logs enabled, you can see the <a href="http://docs.aws.amazon.com/ElasticLoadBalancing/latest/APIReference/API_DescribeInstanceHealth.html">DescribeInstanceHealth</a> and <a href="http://docs.aws.amazon.com/ElasticLoadBalancing/latest/APIReference/API_RegisterInstancesWithLoadBalancer.html">RegisterInstancesWithLoadBalancer</a> calls being made by the ECS scheduler as tasks are started and stopped. Just as with the <code class="language-plaintext highlighter-rouge">RunTask</code> API, the Service Scheduler is inspecting cluster state, and then making decisions about where and when to start and stop tasks, the logic is more complex, but is still just List, Describe, Start, and Stop.</p>

<p>One of the benefits of this shared state model is that each of these schedulers can be developed and released independently. In previous types of cluster management systems, the state was often stored in a single location which would then add more and more specialized logic culminating in a giant ball of spaghetti that was slow to modify as new customer requests arrived. One drawback however is that this logic to inspect state and query it is now duplicated in many separate schedulers. In order to reduce this effort I’m releasing <a href="https://github.com/jhspaybar/ecs_state">ecs_state</a> which is a small Go library that uses the ECS List and Describe APIs to store information about running tasks and available resources in memory in sqlite. There are a set of APIs to allow control over when to refresh state, as well as an API to search for machines with the resources available to accept the task. Further logic and filtering can then be applied in memory before finally calling the <code class="language-plaintext highlighter-rouge">StartTask</code> or <code class="language-plaintext highlighter-rouge">StopTask</code> APIs. The ECS forums have seen quite a few requests for schedulers that run a task once on every machine, or run tasks at specific times like cron, I’m hoping with a little bit of a headstart that these schedulers and others will become simpler and quicker to create.</p>]]></content><author><name></name></author><category term="aws" /><category term="ecs" /><category term="docker" /><summary type="html"><![CDATA[In the last couple years Docker and other container technologies have seen a lot of interest and adoption. They provide a simple interface and API for creating self contained applications that once built run pretty much anywhere. Conceptually, a container isn’t much different from a static binary or a super jar. It’s a bundle of files and configurations necessary to run one or more processes, though as implemented today does provide a fair bit of isolation necessary to prevent accidental interference between applications. With this ease of packaging and running applications has come an increase in the speed with which developers expect to move both when creating software and deploying it. This desire to move faster in production has led to a number of cluster management systems focused on deploying Docker containers.]]></summary></entry></feed>