Language Modeling with Hierarchical Reasoning Models: Lessons from 1M Parameters

The Hierarchical Reasoning Models (HRM) paper made waves by demonstrating strong performance on ARC-AGI reasoning tasks with very few parameters, and the Tiny Recursive Models (TRM) paper pushed it further, setting higher benchmarks with just 7M parameters. The architecture uses hierarchical reasoning through recursion of transformer layers—effectively running the same parameters through multiple forward passes to reach impressive small model performance.

But HRM and TRM were designed for puzzles, not language. I wanted to answer a simple question: can these hierarchical reasoning models generate coherent text at extremely small scales?

Spoiler: Yes, but with some important caveats.

What I Built

I adapted TRM for autoregressive language modeling and trained several variants on TinyStories—a dataset of simple children’s stories. All models were trained at roughly 1M parameters to keep experiments manageable.

I decided to run all of these experiments on my Macbook Pro. It’s slow, but workable at very tiny scales, and allowed me to have some fun with model research on a machine I already owned. If you check out the github you should be able to reproduce what I did here in a few hours on a newer Macbook Pro. I plan to keep pushing the limits of my laptop and writing about my experiences.

The experiments:

Transformer baseline: Standard transformer mostly as a comparison
TRM with dense layers: Various recursion configurations with dense + wide layers
TRM with sparse layers: Various recursion configurations with a sparse MoE at parameter equality
Recursion with carry: For both dense and sparse configurations, allow a 4 iteration refinement on carry

Each model saw ~2.1M training sequences at 512 length over 31,250 steps with identical hyperparameters (lr=0.01, batch_size=64, warmup + cosine decay).

For TRM with sparse layers, I implemented the approach described in MoEUT which shows MoE models can reach parameter equal outcomes by using recursion when compared to dense models without recursion. The claims largely hold up inside the TRM harness at small scales.

There are many pieces of research including the MoEUT paper which show that 2 layers are required for transformer generalization, and the MoEUT paper found that a group size of 2 with additional recursion worked better than larger groups of layers with less recursion. I followed this and similar findings and trained all TRM models with just 2 layers and recursion. This meant wider than standard expansion for dense models, and many experts for our MoE variants.

Training Results: Simpler is Better

At 1M parameters on a language modeling task, architectural complexity doesn’t help:

Final loss rankings (best to worst):

Transformer baseline
(Tied) TRM sparse MoE with compute similar recursion
(Tied) TRM with dense + wide layers
Carry refinement variants

The transformer baseline achieved the lowest training loss though all “bests” were quite close and resulted in usable models. Deep recursion and carry refinement added overhead and additional compute without improving perplexity or generation quality compared to the baseline. Spending more compute did still improve results compared to other TRM variants, but none were able to reach baseline transformer performance.

Dense TRM Models Generate Coherent Text

Sample from a H-cycle=3, L-cycle=6 no carry refinement training run (greedy, temp=0.0):

Prompt: "Once upon a time, there was a little girl named Lily"

Output: ". She loved to play outside in the park. One day, she
saw a big, red ball on the ground. She wanted to play with it,
but it was too high for her.

Lily's mom saw her struggling and asked, "What's wrong, sweetie?"

"I can't reach the ball, but I can't reach it," her mom said.

Lily was sad, but she knew she had to help the ball. She went
to her mom and said, "I can help you, mommy. I can help you."

Her mom was happy to help and they went back to the park
together. They played on the swing and had lots of fun. When it
was time to go home, they went home and had a happy day at the
park. The end."

But don’t get too excited! Remember, the transformer baselines reach slightly lower loss at similar size on this dataset. TinyStories dataset is known to allow for reasonable generations at very low parameter sizes and all of our models and variants produce similar output.

Temperature and top-k sampling work as expected, producing more diverse outputs at higher temperatures.

Sample from same model (temp=0.5, top_k=20):

Prompt: "The dog ran very fast to"

Output: " the park. He was so excited to see the other kids. He
ran around the park and saw a big, red ball. He wanted to play
with it, but he was too small. He was too small and he couldn't
reach it. He tried to get it, but he couldn't. He was so sad.
He started to cry..."

MoEUT at 1M Parameters

I implemented MoEUT following the original paper with one modification to add loss-free bias balancing as shown in the DeepSeek paper.

Sigmoid-based expert routing
Loss-free bias balancing
8 FFN experts with k=2 active
2 attention experts with k=1 active

The training was stable. Loss-free bias balancing kept expert loads balanced without auxiliary losses. Compared to the dense TRM models, training loss was comparable, and generation quality was comparable. This is expected from the MoEUT paper which shows us MoE models can be made to reach parameter-equal outcomes by employing recursion as described in Universal Transformers.

Implementation note on Apple Silicon: My MoEUT implementation isn’t truly compute-equal to the dense models due to a hardware-specific constraint. On Apple Silicon (MPS backend), sparse tensor operations caused unbounded memory growth during training—likely due to memory fragmentation. To work around this, I had to compute dense projections for all experts and then filter to the top-k, rather than computing only the selected experts. This means MoEUT variants used more compute than they theoretically should. This limitation is specific to the MPS backend and likely wouldn’t occur on CUDA hardware with better sparse operation support.

The MoEUT paper demonstrated advantages at larger scales and would be an interesting path for future work, particularly on hardware that can efficiently handle sparse computations.

HRM and TRM allow the model to refine previous outputs, improving quality with more compute at inference time. In theory, this should help with difficult tokens.

In practice, at 1M parameters on TinyStories, carry refinement variants consistently lagged behind fully connected recursion in training loss. The carry refinement mechanism added complexity and compute without improving generation quality.

This doesn’t mean carry refinement is flawed—just that for simple narrative generation at small scale, the overhead isn’t justified.

Key Takeaways

What I observed:

The hierarchical reasoning mechanisms that excel at puzzles don’t automatically transfer to language modeling—at least not on simple narrative tasks at small scales. The recursion that helps TRM “think through” an ARC-AGI puzzle doesn’t provide comparable benefits when generating children’s stories.

Similar Work

There are a number of papers about reasoning in latent space or scaling up test compute dynamically. Hierarchical reasoning models appear to operate similarly. In generation of stories, it was possible to perform more or fewer refinements than models were trained with, receiving different outputs. Anecdotally, some story generations appeared to be slightly improved when running 4 carry refinement steps or additional L-cycles at inference. It’s possible I was just looking for this result, but this likely deserves additional follow-up on larger models to further study latent space reasoning on language modeling tasks.

Future Work

Several important questions remain unanswered:

Scale: Does MoEUT show advantages at 40M+ parameters where the original paper demonstrated benefits?

Task complexity: Would recursive models excel on reasoning-heavy language tasks like GSM8K or other mathematical/logical problems where deeper computation might matter?

I focused on TinyStories because it’s fast to train and well-understood. But hierarchical reasoning might shine on tasks that actually require… reasoning. That’s the obvious next experiment.

Conclusion

Hierarchical Reasoning Models can generate text. At small scales on simple tasks, they’re not better than standard transformers, but they’re not really worse either (when configured correctly). They seem quite similar to universal transformers, and are likely relevant to research into latent space reasoning. At least at the scales I tested, architectural sophistication needs task-appropriate justification. Deep recursion and carry refinement are tools, not magic. They excel where they’re needed and add overhead where they’re not.

Code and configs available at: github.com/jhspaybar/TinyRecursiveModels