Designing a Small Workflow for AI Accelerator Exploration

A short note on how I like to structure early-stage accelerator exploration before committing to RTL.

3 min read · May 18, 2026

2026 · #ai-accelerator #hardware-design #workflow · blog

When I start exploring a new AI accelerator idea, I usually try to avoid jumping directly into RTL. The first few days are much more productive when the workflow is structured around fast feedback: define the problem, set a few measurable targets, and build a lightweight model that makes trade-offs visible.

This is especially important for systolic and dataflow-driven architectures, where small choices in tiling, buffering, or operand movement can dominate the final performance and energy story.

1. Start with the bottleneck, not the block diagram

It is tempting to begin with a clean architecture drawing. In practice, I get better results by starting from one question:

What is the real bottleneck for this workload under a realistic resource budget?

For example, in a transformer-oriented accelerator study, the answer is often not simply “MAC throughput.” It may be:

on-chip SRAM capacity
activation movement between tiles
reduction latency
utilization collapse on irregular shapes

Once that bottleneck is clear, the architectural discussion becomes much sharper.

2. Build a small model before building detailed hardware

My preferred first step is a compact simulator or analytical model that can answer:

cycle estimate
buffer requirement
PE utilization
data reuse opportunity
communication pressure

It does not need to be perfect. It only needs to be fast enough to compare design choices honestly.

For instance, a simple configuration table like this is already useful:

workload:
  m: 4096
  n: 4096
  k: 4096

array:
  rows: 16
  cols: 16

mapping:
  tile_m: 128
  tile_n: 128
  tile_k: 64
  dataflow: output-stationary

With a setup like this, I can quickly test whether a change in tile size improves reuse or only increases control complexity.

3. Keep the evaluation metrics simple and comparable

At the beginning, I try to track only a few metrics consistently:

total cycles
estimated memory traffic
utilization
latency to first output
basic area proxies such as buffer count or array size

Even rough metrics are valuable if they are measured in a consistent way across all candidates.

4. Delay implementation detail until the design space becomes narrower

I have learned that early over-implementation can hide weak ideas behind a lot of code. If the model still shows unclear trade-offs, it is usually too early to optimize controllers, interfaces, or verification infrastructure.

Once two or three promising candidates remain, then it makes sense to move toward:

a cleaner simulator
RTL microarchitecture sketches
interface definition
verification planning

That transition is much easier when the earlier exploration already established why a design should exist.

Closing thought

Good accelerator work is not only about building fast hardware. It is also about building a workflow that helps us reject bad assumptions early.

That is the kind of note I want to keep sharing here: small, practical ideas from architecture exploration, hardware implementation, and research-in-progress.

#ai-accelerator #hardware-design #workflow blog

Bochen Ye(叶波辰)

1. Start with the bottleneck, not the block diagram

2. Build a small model before building detailed hardware

3. Keep the evaluation metrics simple and comparable

4. Delay implementation detail until the design space becomes narrower

Closing thought