How to Map LLM Architecture with a Physical Thinking Tool
The transformer architecture is the foundation of every large language model in production today, GPT, Gemini, Claude, all of them. It was introduced in a 2017 Google paper and has since become the dominant framework for sequence modeling. The architecture has two sides: an encoder that processes the input and a decoder that generates the output, connected by a cross-attention mechanism that lets each side inform the other. Understanding how those components relate is less a matter of memorizing the diagram and more a matter of being able to reason about what each piece does and why the connections between them matter.
Watch the transformer architecture assemble block by block on a wall. The encoder stack is on the left, the decoder stack is on the right, and the cross-attention bridge connects them, until the full architecture is visible and holdable.
Switch-Its makes system architecture something you build
Switch-Its magnetic dry erase blocks let you write each component on its own block, input embedding, positional encoding, multi-head attention, feed forward, add and norm. Place them on a magnetic surface in the arrangement that makes their relationships visible. Moving a block is a claim about how the system works, which makes the architecture an argument you construct rather than a diagram you copy.

Build the encoder and decoder stacks
Each stack starts from the bottom: input and output embeddings, positional encoding, then multi-head attention layers building upward. Placing the blocks in parallel makes the symmetry between the two sides immediately visible and the differences between them equally obvious.

Place the cross-attention bridge
The block connecting encoder to decoder is the most important piece in the architecture. It's the mechanism that lets the decoder attend to the full encoded input while generating output. Placing it physically between the two stacks makes the relationship concrete: this is where the two sides of the model talk to each other.

The full architecture on the wall
With all components placed, the transformer is a physical object you can point at, explain, and reorganize. Any block can be pulled off to ask what happens if that component changes, which turns a static diagram into an active thinking tool for anyone reasoning about how modern AI systems are built.
Complex technical architectures are easier to reason about when they're physical, when you can hold a component, place it in relation to the others, and move it when your understanding shifts. That's the same principle behind visible thinking at work, and it connects directly to the broader case for putting ideas on the wall developed in Put the Plan on the Wall.