Anthropic Performance Take-Home
A performance optimization challenge for a custom VLIW SIMD architecture simulator, originally used by Anthropic for technical interviews.
Overview
This challenge involves optimizing a kernel that traverses binary trees while performing hash operations. The goal is to minimize execution cycles on a simulated VLIW SIMD processor.
Architecture Specifications
The challenge uses a custom VLIW SIMD architecture with five execution engines:
| Engine | Slots/Cycle | Purpose |
|---|---|---|
| ALU | 12 | Scalar arithmetic |
| VALU | 6 | Vector arithmetic (VLEN=8) |
| Load | 2 | Memory reads |
| Store | 2 | Memory writes |
| Flow | 1 | Control flow |
Key Parameters:
- SCRATCH_SIZE: 1536 32-bit words (register file)
- Vector Length (VLEN): 8 elements per vector operation
- Batch Size: 256 items processed per round
- Rounds: 16 iterations total
Performance Benchmarks
| Cycles | Description |
|---|---|
| 147,734 | Baseline (unoptimized scalar) |
| 18,532 | Updated starting point (2-hour version) |
| 1,790 | Best human ~2hr / Claude Opus 4.5 casual |
| 1,487 | Impressive threshold |
| 1,363 | Best known (Claude Opus 4.5 improved harness) |
The Problem
Each round processes a batch of 256 items through these steps:
- Load index and value from memory
- XOR value with tree node value at index
- Apply 6-stage hash function
- Branch left (
idx*2+1) if hash is even, else right (idx*2+2) - Wrap to root if past tree bounds
- Store updated index and value
Optimization Strategies Applied
The optimization journey from 147K to ~1.4K cycles involves:
- Vectorization (8x) - Process 8 batch items simultaneously using VALU
- Instruction Packing (up to 23x) - Fill all engine slots per cycle
- Loops - Replace unrolled code with
cond_jumploops - Memory Batching - Use
vload/vstorefor 8-element transfers - Constant Caching - Pre-load constants to scratch space
- Hash Pipelining - Overlap hash stages across elements
Combined theoretical speedup: 100x+ (8x vectorization * ~13x packing efficiency)
Background Knowledge
For the foundational concepts needed to understand this challenge, see the VLIW SIMD Architecture Notes:
- Scalar vs Vector - Understanding data parallelism
- ALU and Engines - Execution units
- VLIW Architecture - Instruction bundling
- Instruction Set - Available operations
- Optimization Strategies - General techniques
Key Commands
1 2 3 4 5 6 7 8 9 10 11 12 | |
Lessons Learned
- VLIW forces explicit parallelism - You must manually schedule what runs together
- Data dependencies are the bottleneck - Finding independent operations is key
- Vectorization is powerful but constrained - Requires contiguous memory access patterns
- Instruction packing requires careful planning - Scratch space allocation matters