Here's the uncomfortable thing about rendering your UI to a GPU canvas: to a screen reader, you've drawn nothing. There's one <canvas> element and a blob of pixels. A blind user tabbing through your app hears silence. This is the wall every canvas UI hits — Flutter's web target, Figma, anything game-engine-shaped — and most of them either ship a half-hearted shadow DOM or quietly don't solve it.
Vel renders everything on the GPU through Lume, so I had the same problem. The way out was to stop treating the pixels as the UI, and treat them as one projection of it.
One tree, two backends
A Vel widget already knows how to describe itself for layout and paint. The insight was that it can describe itself a second way — not "here's how I look" but "here's what I am": a role, a label, a value, some state. Pixels for the eye; semantics for everything else.
So every widget fills out a SemNode in a describe() method, and a single collector walks the tree:
enum class SemRole {
None, Group, Text, Heading, Button, Link,
Checkbox, Radio, Switch, Slider,
TextField, TextArea, // editable — single & multi-line
Image, ProgressBar, Alert, List, ListItem,
};
struct SemNode {
SemRole role = SemRole::None;
std::string label; // accessible name (button text, field label…)
std::string value; // current value (field text, slider readout…)
bool focusable = false; // participates in Tab order
bool editable = false; // mount a real <input> for IME/keyboard
bool checked, selected, disabled;
bool hasRange; float valueMin, valueMax, valueNow; // slider/progress
Rect rect; // logical-pixel bounds == CSS px on the web
int id; // stable for one frame
// …
};
std::vector<SemNode> collectSemantics(Widget& root);
The roles are deliberately small and intent-based — they map 1:1 to ARIA/DOM roles on the web and to platform accessibility roles elsewhere. A Btn reports Button with its text as the label. A Slider reports Slider with hasRange and the live valueNow. Decorative containers report None and get dropped from the output entirely, so the semantic tree is flatter and cleaner than the visual tree — assistive tech doesn't care about your spacer boxes.
collectSemantics walks the widget tree in paint order and produces a flat list, stamping each node with its post-layout rect and a stable id. That flat list is the whole accessibility model. Everything downstream is a consumer of it.
The DOM mirror
On the web, a small JS layer takes that flat list and maintains a hidden DOM tree positioned exactly over the canvas — a <div role="button"> at the button's rect, an <input> at each editable field, an <h2> for headings. It's visually invisible (the pixels are the real UI) but it's structurally real: screen readers read it, Tab moves through the focusable nodes in order, and focus rings track the actual element.
The editable flag is the one that earns its keep. When a node is editable, the mirror mounts a real, transparent <input>/<textarea> over it. That single decision hands you the entire native text stack for free: IME composition for CJK, dictation, autocorrect, the OS clipboard, mobile virtual keyboards. I am not reimplementing input method editors in C++ — I'm letting the browser do what it's extremely good at, on an element the user can't see, while the canvas renders the caret and selection.
This is also why the playground's code editor is canvas-drawn rather than a <textarea>: keeping the visual surface on the canvas lets it render crisply at any zoom and stay one coherent scene, while the hidden mirror still gives it TextArea semantics and real keyboard/IME. You get the rendering control of a canvas and the accessibility of a DOM input, instead of choosing one.
A few smaller things that matter and are easy to forget: the framework honors prefers-reduced-motion (the eased transitions collapse to instant), and focus is a first-class widget state, not a paint afterthought — so a keyboard user always sees where they are.
The same projection is machine-readable UI
Here's the part I didn't expect when I started. The semantic tree serializes to compact JSON:
std::string semanticsToJson(const std::vector<SemNode>& nodes);
That JSON is exposed to the page (and to the WASM host) as vel_dump_semantics. It was built for the DOM mirror — but it turns out to be exactly what an AI agent needs to operate the UI: read the interface as a small structured document, find the node labeled "Save" by role and label, act on it by id. No screenshot, no pixel-OCR, no vision model guessing at button boundaries. The accessibility tree and the agent-automation tree are the same tree, because both are asking the same question — "what is actually here and what can I do with it?" — that pixels can't answer.
It also means the initial HTML is crawlable and the UI degrades to something legible with no JS, which I get without any extra work.
What it costs
The honest tradeoffs:
- Widgets have to opt in. A widget that forgets to
describe()itself is invisible to the mirror exactly the way the raw canvas is. Accessibility isn't automatic — it's a second method every widget owes, and the discipline to fill it in is on the framework author. The registry widgets do; a careless custom widget might not. - It's a projection, so it can drift. The semantic rect is the widget's logical bounds, recomputed each frame. If a widget lies about what it is, a screen reader believes the lie. There's no pixel-level verification that the mirror matches what's drawn — they're consistent only because they come from the same node.
- Coordinate-anchored overlays are fiddly. Positioning hidden DOM exactly over GPU-drawn rects, across DPR changes and scrolling, is the kind of thing that's correct until a transform sneaks in.
But the result is the one that mattered: a GPU-rendered UI where the screen reader announces real buttons, Tab lands where you'd expect, CJK input works, and an agent can read the whole interface as JSON — all from one extra description per widget. The canvas accessibility wall isn't a law of physics. It's just what happens when pixels are the only thing you project.
Next on this front: richer live-region announcements (so toasts and async results get spoken), and per-platform native accessibility backends — the same SemNode tree, but feeding macOS's NSAccessibility and Windows UIA directly instead of only the web DOM.