Are we missing a middleware layer between LLM agents and the web?

Question

I&rsquo;ve been experimenting with browser agents (OpenClaw Browser, agent-browser, Playwright setups with Claude/Cursor).Even with: - accessibility snapshots - element references (E1, E2) - semantic locators - session isolationthey still feel fundamentally fragile.LLMs are reasoning over DOM trees step by step. It works &mdash; but barely. Small UI changes break everything.It feels like we&rsquo;re missing an abstraction layer.What if instead of agents operating on markup, websites exposed structured &ldquo;interaction surfaces&rdquo; &mdash; something closer to tools or world models rather than DOM nodes?Instead of: - parse DOM - guess selector - click elementIt would be: - request action - receive structured state - operate over stable semantic primitivesIs this already being explored somewhere beyond MCP experiments? Or is everyone still stuck in DOM-land?Curious if others see the same limitation &mdash; and whether a middleware &ldquo;site-agent&rdquo; layer makes sense.Would love to hear your thoughts

andsoitis · Accepted Answer

Another approach is for LLMs to operate software, such as a web browser, like a human would.For instance, see &ldquo;Computer use&rdquo; in the recent Sonnet 4.6 announcement: https://www.anthropic.com/news/claude-sonnet-4-6