PDF & Image Tools

Running a background-removal neural net in the browser, with no two devices treated the same

By Swathik··10 min read
webgpuwebassemblymachine-learningprivacy
A phone in sharp focus on a warm wooden desk, with a tablet showing an image in an editing app and a laptop running a photo library behind it

Every "AI background remover, 100% in the browser" demo you've ever clicked runs beautifully. On the author's MacBook. Then someone opens it on a three-year-old Android, the tab spins for forty seconds, and either the model never loads or the page goes white and the OS quietly reaps it. The author never sees this, because the author has a MacBook, and a MacBook is structurally incapable of reproducing the bug.

I shipped one of these tools. It removes the background from an image entirely on-device: no upload, works offline after the first load. Getting it running on a desktop GPU took an afternoon. Getting the same idea to survive contact with actual phones took weeks, and the fix was never "optimize the model." It was: stop pretending one engine fits every device.

That's the part the demos skip. So here's the real routing, why each branch exists, and the one specific combination of headers and threads that reliably hangs an iPhone.

The afternoon version (and why it lies)

The happy path really is short. transformers.js will load an ONNX segmentation model and run it on WebGPU with fp16 weights in about six lines:

import { AutoModel, AutoProcessor, env } from "@huggingface/transformers";

env.allowLocalModels = false;

const MODEL_ID = "studioludens/birefnet-lite-512"; // ~98MB fp16, Apache-2.0

const processor = await AutoProcessor.from_pretrained(MODEL_ID);
const model = await AutoModel.from_pretrained(MODEL_ID, {
  dtype: "fp16",
  device: "webgpu",
});

On a desktop with a real GPU this flies. The fp16 weights halve both the download and the VRAM, the mask comes out clean, and you feel like a genius. Ship it. Tweet about it.

Then the bug reports land. They're all from phones, and no two of them describe the same failure. One phone loads the model and freezes the entire UI mid-inference. Another never finishes booting the worker at all. A third goes white and reloads itself, like it's embarrassed. None of it reproduces on the machine you built it on. That's the worst flavor of bug, the kind where your test setup physically cannot see what's wrong.

A few things turn out to be true at the same time, and you don't get to ignore any of them:

  • Most phones don't have usable WebGPU in the browser. iOS Safari is the cruel one here: it claims support, then hangs WebKit/ONNX inference instead of failing loudly. A hang is worse than a crash, because a crash you can at least catch.
  • The fallback, multi-threaded WASM, needs SharedArrayBuffer. Which needs cross-origin isolation. Which is the exact thing that makes some phones refuse to start the worker in the first place.
  • A mid-range phone has a sliver of your laptop's RAM and memory bandwidth, so the tempting "just run it on the main thread" shortcut turns a 4-second job into a 40-second frozen tab.

So the real architecture isn't "the browser version" of anything. It's a router that hands out a different model, a different runtime, a different threading model, and different HTTP headers depending on what's holding the page open.

Routing, not optimizing

The decision happens once, at worker-spawn time, off a user-agent sniff. UA sniffing is deeply unfashionable and I'd take a cleaner signal in a heartbeat. But the failure modes here are device-class-specific (iOS WebKit, Android RAM ceilings), and there is no feature flag for "this phone is about to OOM." So, user-agent it is. I'm not proud, I'm just shipping.

function getSharedWorker(): Worker {
  const ua = navigator.userAgent;
  const isPhone = /iPhone|iPod|Android/i.test(ua);

  // iPad reports a Mac-class UA, so detect it by touch + Mac UA, not /iPad/ alone.
  const isIPad =
    /iPad/i.test(ua) ||
    (/Macintosh|Mac OS X/i.test(ua) && navigator.maxTouchPoints > 1);

  const useMobileEngine = isPhone || isIPad;

  return useMobileEngine
    ? new Worker(new URL("./bg-remover-mobile-worker.ts", import.meta.url))
    : new Worker(new URL("./bg-remover-worker.ts", import.meta.url));
}

That iPad branch earns its rent. Modern iPads ship a desktop "Macintosh" user-agent, so /iPad/ matches nothing and /Macintosh/ matches a real Mac and an iPad. The tiebreaker is maxTouchPoints: a real Mac reports 0, an iPad reports more than 1. A touchscreen Windows laptop has a non-Mac UA, so it won't false-positive. It takes three conditions to answer the question "is this an iPad," which sounds insane until you remember Apple is the one who decided iPads should lie about being Macs. I'm just cleaning up after that decision.

Two engines fall out of this.

Desktop: transformers.js on WebGPU

Desktop runs the transformers.js worker and pulls birefnet-lite-512 at fp16, preferring WebGPU and dropping to multi-threaded WASM only on a machine with no usable GPU. The device preference is, literally, a list with a fallback:

function preferredDevices(): ("webgpu" | "wasm")[] {
  if (iosForceWasm) return ["wasm"];           // iOS: skip WebGPU, it hangs
  return hasWebGPU() ? ["webgpu", "wasm"] : ["wasm"];
}

iosForceWasm is a belt-and-suspenders flag: iOS and iPadOS Safari say yes to WebGPU and then hang. That hang is the whole reason Apple's touch devices never run this worker at all. They get their own engine, which is the next section.

iPad and phones: onnxruntime-web 1.18, one thread, no transformers.js

iPads and phones get their own worker, their own runtime, and their own pair of models. Not transformers.js. Plain onnxruntime-web, pinned to 1.18, aliased in package.json so two versions of the runtime can live in the same project without fighting:

import * as ort from "ort-stable"; // onnxruntime-web@1.18, NON-threaded build

ort.env.wasm.wasmPaths = "/ort-stable/";
ort.env.wasm.numThreads = 1;        // single thread. on purpose.

const session = await ort.InferenceSession.create(modelUrl, {
  executionProviders: ["wasm"],
  graphOptimizationLevel: "all",
});

numThreads = 1 is not a TODO I forgot to come back to. It's load-bearing. One thread means no SharedArrayBuffer, which means no cross-origin isolation, which means the worker actually boots on the exact phones that rejected the isolated build. The model still runs inside a Web Worker, so the UI never locks up. It just grinds away on a single CPU core instead of several. Slower per image, sure. But it finishes, and it doesn't take the whole tab down with it, which turns out to be the bar that matters.

The model itself also forks by phone:

const MODEL: MobileModel = /iPhone|iPod/i.test(UA) ? "isnet" : "birefnet";

iPhone gets imgly's MIT-licensed ISNet (fp16, ~88MB). Android and iPad get BiRefNet-Lite (fp16, ~94MB), since the iPad reports a "Macintosh" UA and falls to the non-iPhone branch. This split is empirical, not some elegant principle: those were the pairings that gave the cleanest masks per platform without blowing the memory budget. The iPad landed on this lean engine for a different reason than phones: a phone needs it because the isolated, threaded build won't even boot there, while an iPad will boot the heavier transformers.js engine and then run itself out of memory mid-batch. Same destination, two different cliffs. One gotcha I'll write down so you don't lose an afternoon to it like I did: ISNet runs a sigmoid inside the graph. If you sigmoid the output again on the way out, you get a washed-out, half-transparent mask, and you will swear up and down the model is broken. The model is fine. You did the sigmoid twice. (I did the sigmoid twice.)

The combination that hangs iOS

Here's the trap that ate the most time, because it's the one every tutorial walks you cheerfully into.

To run multi-threaded WASM you need SharedArrayBuffer. To get SharedArrayBuffer you need the page cross-origin isolated, which means serving these two headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: credentialless

Every "run ONNX in the browser with threads" guide tells you to set them. The guides are right, for desktop. The trouble starts when you set them globally and a phone wanders in. Isolation plus a threaded worker is precisely the cocktail that hangs iOS Safari and helps Android workers decide not to boot. It isn't one header. It isn't one thread. It's the combination of isolation and a threaded worker running on a memory-pinched WebKit or Chrome-on-phone runtime.

So the headers are conditional, set in the proxy by user-agent, and only for the workspace route:

const ua = request.headers.get("user-agent") ?? "";
const isPhone = /iPhone|iPod|Android/i.test(ua);

if (!isPhone) {
  // Desktop isolates so threaded WASM / SharedArrayBuffer works. (iPad matches this branch
  // and gets the headers too, but it runs single-threaded, so for it they're a no-op.)
  res.headers.set("Cross-Origin-Opener-Policy", "same-origin");
  res.headers.set("Cross-Origin-Embedder-Policy", "credentialless");
}
// Phones: send NO isolation headers. The single-thread 1.18 build
// needs no SharedArrayBuffer and only boots when un-isolated.

That's the whole secret, and it's a genuinely annoying one: the same headers that switch on the desktop engine break the phone engine. There is no single config that satisfies both. You serve isolation to the devices that need it and you withhold it from the devices it kills, and you make your peace with the asymmetry.

I had to carve this rule into my own notes, because I tried to "simplify" it back into one config more than once and re-broke every phone each time: phones and iPad stay on ONNX 1.18, single-thread, in a worker. Phones stay non-isolated; the iPad gets the isolation headers but doesn't use them. Only the desktop runs the threaded / WebGPU transformers.js engine. Do not merge them. Future me, this means you.

What I'd do differently

A few honest ones, because the writeups that only list wins aren't useful to anybody.

I'd stop trusting supportsWebGPU-style checks a lot sooner. The entire iOS detour exists because Safari answers "yes" to WebGPU and then hangs, which makes a capability check that returns true strictly more dangerous than one that returns false. I treat iOS as WASM-only by policy now, not by feature detection. I paid for that lesson in evenings.

I'd also accept earlier that there's a memory ceiling I simply cannot see from JavaScript. A low-end Android can OOM on cold model load, and there's no error to catch, no event, no nothing. The OS just takes the tab. And I removed navigator.deviceMemory on purpose, because it's a fingerprinting signal and the entire point of this tool is privacy, so I can't even read the RAM to bow out gracefully. The mitigation is unglamorous: smaller fp16 models, one image at a time, free the WASM-heap tensors the instant inference returns. Nothing clever. Just stop being greedy with memory you can't measure.

And the limit nobody puts on the landing page: these lightweight models find a salient subject. Aim one at a clean portrait and it nails it. Aim it at a cluttered gym-mirror selfie and it might hand you a near-empty mask, because it genuinely can't decide who the subject is among the towels, mirrors, and three other people. A bigger model would handle that better. A bigger model is also the exact thing that won't load on the phone you're bending over backwards to support. That tradeoff doesn't resolve. You just choose where to stand on it and say so out loud, which is most of what honest engineering is.

The takeaway

"AI in the browser" isn't one feature. It's a fan-out. The model is the easy 10%. The other 90% is that a WebGPU desktop and a lean single-thread engine for the iPad, iPhone, and Android each want a different runtime version and model, and the desktop and phones even want mutually exclusive HTTP headers. The only way to know which device you're dealing with is to sniff the user-agent and accept that you are, in fact, treating no two of them the same.

If your in-browser ML demo has exactly one code path, it doesn't run everywhere. It runs on your machine. Those are not the same claim, and the gap between them is most of your users.

Built by Swathik, solo, in India. The tool lives at pdfandimagetools.com/background-remover if you want to open your own Network tab and watch nothing get uploaded.

Every tool on PDF & Image Tools runs entirely in your browser. Your files never leave your device.

← All posts