The AI compute crisis is being solved backwards
Hyperscalers are bolting GPU racks to people's houses to meet demand for AI inference. Consumer devices with capable neural engines are being ignored. There's a better way to think about this.
Last month, SPAN and NVIDIA announced something that deserves more scrutiny than it has received. The company, best known for smart electrical panels, launched XFRA: a distributed data centre network that places compute nodes directly in residential homes and small businesses. Not a Raspberry Pi. Not a router. A Dell PowerEdge server loaded with 16 NVIDIA RTX Pro 6000 Blackwell GPUs, 4 AMD EPYC CPUs, and 3TB of RAM, bolted to your house next to the air conditioning unit, drawing power from whatever headroom is left on your residential electrical service.
NVIDIA is a launch partner. PulteGroup, one of America's largest homebuilders, is already designing new homes around it.
This is where the AI capacity race has brought us.
How we got here
The numbers are insane. US data centres consumed 183 terawatt-hours of electricity in 2024, over 4% of the country's total consumption. By 2030, that figure is projected to exceed 9%. The grid infrastructure required to support this scale takes a decade or more to permit and build. Some projects already underway have been waiting years for grid interconnection approval.
Meanwhile, inference, the act of running AI models rather than training them, is set to account for more than half of all AI workloads by 2030. It is not the exotic research problem people imagine. Inference is autocomplete. It is summarisation. It is the chatbot answering your support ticket. It is routine, repetitive, and happening billions of times a day.
So hyperscalers and neoscalers are running out of places to put the compute. Their answer, apparently, is your neighbourhood.
SPAN's pitch is that the average home operates at only 40% of its peak electrical capacity, leaving headroom that can be harvested for inference workloads. Homeowners get a free smart panel and discounted electricity. The AI cloud gets a distributed GPU node. Everyone wins.
Except this is a strange solution to a problem that is partly of the industry's own making.
The device in your pocket already has a neural engine
The device you are reading this on almost certainly has dedicated AI hardware built into it. Apple has shipped Neural Engine silicon in every iPhone since 2017. The latest Apple Silicon chips perform tens of trillions of operations per second on-device. Qualcomm's Snapdragon X Elite, shipping in a growing number of Windows laptops, includes a dedicated NPU capable of over 45 TOPS. Intel's Core Ultra series has neural processing units built in. Google's Tensor chips power on-device AI features across the Pixel lineup. Even mid-range Android devices now routinely include machine learning accelerators.
We are not in a world where consumer hardware is incapable of AI inference. We are in a world where the industry has largely ignored that capability in favour of a centralised cloud model that generates more controllable, measurable, billable API calls.
Sending an inference request to a data centre (and now, apparently, to a GPU rack strapped to someone's eaves) is an architectural and economic choice, not a technical necessity. For a large class of workloads, it is not the right choice.
The Chrome backlash has it backwards
This week, Google Chrome became the subject of widespread outrage for silently downloading Gemini Nano, a 4GB on-device language model, to users' computers without clear consent. The file, buried in a folder called OptGuideOnDeviceModel, redownloads itself if you delete it. On systems with multiple user profiles, Chrome installs a separate copy for each one. The coverage has been critical across The Register, TechSpot, and Digital Trends. HackerNews has been reliably furious.
The anger about consent is entirely justified. Silently consuming 4GB of someone's disk, particularly on a budget laptop with a small SSD, without any notification is a legitimate grievance. Google deserves the criticism for how it did this.
But the idea behind it is not wrong.
The alternative to running Gemini Nano locally is sending those same requests, draft emails, browsing summaries, scam detection checks, across the internet to Google's data centres. There, they are processed on GPU clusters drawing hundreds of megawatts, cooled by industrial systems, and served back over a network with non-trivial latency and energy overhead. Every single request.
A 4GB model downloaded once, stored locally, and used indefinitely for on-device inference is, in energy terms, dramatically cheaper than the equivalent cloud workload at population scale. Across Chrome's roughly four billion users, the aggregate energy savings of even a modest shift from cloud to on-device inference are significant.
The outrage about disk space is understandable. But the conversation should be about consent and transparency, not about reversing the direction of travel. On-device is the right direction. Sneaking onto people's drives without asking is not.
Applying progressive enhancement to AI
There is a concept in web development called progressive enhancement. The principle is straightforward: start by delivering the best experience the current environment can support, then layer on additional capability as it becomes available. You do not send a full JavaScript-heavy SPA to a browser that cannot handle it. You start with what works where the user is, and enhance from there.
AI workloads could work the same way.
Imagine a tiered model:
Tier 1: On-device. The request is handled entirely locally. No network round-trip. No data leaving the device. Instant response, works offline, zero marginal cost, and full privacy. This is the right tier for the majority of common inference tasks: text summarisation, translation, autocomplete, image tagging, voice recognition, and a rapidly growing list of others. Modern NPUs can handle this today.
Tier 2: Local network. If the on-device model is not sufficient, because the task is too complex, the model too large, or the context too long, the request escalates transparently to a more capable device on the same local network. Your laptop hands off to your home server, a NAS, or a desktop with a discrete GPU a few metres away. Still local. Still private. No data centre involved.
Tier 3: Data centre. Only when the workload genuinely requires it, for training, very large model inference, or tasks with no local analogue, does the request leave your network and head to centralised compute. By this point, you have already filtered out the overwhelming majority of everyday inference requests.
This is not a radical idea. It is the architecture that already governs caching, CDNs, and edge computing in every other domain of software infrastructure. We built a whole discipline around the principle that compute should happen as close to the consumer as possible. Then we threw all of that out the window for AI.
The waste is real
The current model is not just architecturally inelegant. It is wasteful along almost every axis that matters.
Energy. A simple summarisation request routed to a data centre travels across the internet, through routing infrastructure, to a GPU shared across thousands of concurrent users, then back again. That round trip consumes orders of magnitude more energy than running the same inference on the NPU already sitting inside the device making the request.
Latency. Round-trip network latency is not a solved problem, particularly for users outside major metro areas or in parts of the world with less developed internet infrastructure. On-device inference has, by definition, zero network latency.
Privacy. Every request sent to a cloud endpoint is a potential data exposure event. On-device inference is inherently private: no logs, no telemetry, no terms of service governing what gets done with your query.
Resilience. A system that degrades gracefully under network failure, falling back to local inference when the cloud is unavailable, is a more robust system. The current architecture has a single point of failure built into every interaction.
The right frame is accessibility
SPAN's framing for XFRA is clever: unlocking underutilised grid capacity, a win for homeowners and utilities alike (though we haven't really heard what the utility providers actually think about this yet). There are genuine use cases where distributed edge inference makes sense, particularly for latency-sensitive applications or as a bridge while grid infrastructure catches up to demand.
The question is why are we optimising the supply side of AI compute so aggressively while almost completely ignoring the demand side?
Accessibility thinking, designing for the most constrained user first and then enhancing upward, forces you to ask different questions. Instead of "how do we get more GPU capacity faster?", you ask "why does this inference request need to leave the device at all?" Instead of building distributed GPU networks in residential streets, invest in better on-device model compression, quantisation tooling, and developer APIs that make local inference the default path rather than the exotic edge case.
The industry has the hardware. The M4 MacBook Pro sitting on someone's desk right now is, for a wide range of workloads, a better inference endpoint than a GPU node bolted to the outside of a house three streets away. The software stack that treats it that way largely does not exist yet outside of the hobbyist AI community.
What a saner architecture looks like
None of this requires abandoning data centres. Large model training will always require centralised, high-density compute. There are inference workloads, very long context windows, multimodal tasks at scale, real-time processing of large media streams, that genuinely exceed what local hardware can support today.
But "today" is moving fast. The NPUs shipping in consumer devices in 2026 are dramatically more capable than those from 2023. The trajectory is clear.
A sane architecture for AI compute would:
- Treat on-device inference as the default, not the fallback. If a task can be handled locally, it should be.
- Define transparent escalation criteria. Complexity thresholds, context length limits, and latency requirements that determine when a request moves to the next tier, automatically, without user friction.
- Build local network discovery into the AI stack. Devices should be able to collaborate. The capable GPU on your workstation should be a first-class inference endpoint for your phone.
- Reserve cloud compute for workloads that genuinely need it, and when it does go to the cloud, be honest about the cost in energy, latency, and privacy terms.
Google's instinct with Chrome was right. The execution was wrong: silent, unconsented, and resentful of deletion. The lesson is not "do not put AI on devices." It is "ask first, and make the case clearly." The case is strong. A model you download once and run locally is vastly more efficient than one you invoke remotely a hundred times a day.
The AI industry is extraordinarily good at scaling supply. It is considerably less good at questioning whether the demand it is building to serve could be met more efficiently by looking at what is already in people's hands, rather than what can be installed into to their (already strained) power grid.
Sources: SPAN/NVIDIA XFRA press release · mGrid XFRA technical analysis · The Register: Chrome silently installs 4GB LLM · 9to5Google: Chrome 4GB AI storage details
This is the approach we took when building Clozyx. AI inference that runs on your device, with no data sent to our servers, is not just better for privacy: it is the architecturally correct choice. Read more about how on-device AI works in Clozyx.
Clozyx is a local-first, AI-powered CRM built on exactly these principles. Join the waitlist.