Skip to content

Anthropic Performance Take-Home

A performance optimization challenge for a custom VLIW SIMD architecture simulator, originally used by Anthropic for technical interviews.

Overview

This challenge involves optimizing a kernel that traverses binary trees while performing hash operations. The goal is to minimize execution cycles on a simulated VLIW SIMD processor.

Architecture Specifications

The challenge uses a custom VLIW SIMD architecture with five execution engines:

EngineSlots/CyclePurpose
ALU12Scalar arithmetic
VALU6Vector arithmetic (VLEN=8)
Load2Memory reads
Store2Memory writes
Flow1Control flow

Key Parameters:

  • SCRATCH_SIZE: 1536 32-bit words (register file)
  • Vector Length (VLEN): 8 elements per vector operation
  • Batch Size: 256 items processed per round
  • Rounds: 16 iterations total

Performance Benchmarks

CyclesDescription
147,734Baseline (unoptimized scalar)
18,532Updated starting point (2-hour version)
1,790Best human ~2hr / Claude Opus 4.5 casual
1,487Impressive threshold
1,363Best known (Claude Opus 4.5 improved harness)

The Problem

Each round processes a batch of 256 items through these steps:

  1. Load index and value from memory
  2. XOR value with tree node value at index
  3. Apply 6-stage hash function
  4. Branch left (idx*2+1) if hash is even, else right (idx*2+2)
  5. Wrap to root if past tree bounds
  6. Store updated index and value

Optimization Strategies Applied

The optimization journey from 147K to ~1.4K cycles involves:

  1. Vectorization (8x) - Process 8 batch items simultaneously using VALU
  2. Instruction Packing (up to 23x) - Fill all engine slots per cycle
  3. Loops - Replace unrolled code with cond_jump loops
  4. Memory Batching - Use vload/vstore for 8-element transfers
  5. Constant Caching - Pre-load constants to scratch space
  6. Hash Pipelining - Overlap hash stages across elements

Combined theoretical speedup: 100x+ (8x vectorization * ~13x packing efficiency)

Background Knowledge

For the foundational concepts needed to understand this challenge, see the VLIW SIMD Architecture Notes:

Key Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Run submission tests (official validation)
python tests/submission_tests.py

# Run cycle count test
python perf_takehome.py Tests.test_kernel_cycles

# Generate and view trace
python perf_takehome.py Tests.test_kernel_trace
python watch_trace.py  # Open browser, click "Open Perfetto"

# Validate tests unchanged (important!)
git diff origin/main tests/

Lessons Learned

  1. VLIW forces explicit parallelism - You must manually schedule what runs together
  2. Data dependencies are the bottleneck - Finding independent operations is key
  3. Vectorization is powerful but constrained - Requires contiguous memory access patterns
  4. Instruction packing requires careful planning - Scratch space allocation matters