PDF & Image Tools

Compressing to an exact target size in the browser, with a binary search

By Swathik··10 min read
algorithmspdfjavascriptimage-processing
Vintage precision measuring tools, a pair of calipers and a compass, arranged on a dark slate surface

Someone fills out a passport application and the upload box says "PDF must be under 200 KB." Their scan is 4 MB. So they search "compress pdf to 200kb", land on the first result, upload their passport scan to a server they have never heard of, and hope for the best.

I wanted to do that last part in the browser, with the file never leaving the device. Shrinking a PDF locally is easy. The word that makes it hard is "exact." "Under 200 KB" is a hard constraint, not a vibe, and the obvious approaches all quietly fail it.

Here is the algorithm I landed on. It is a binary search, but the interesting part is what you search over and why you cannot just compute the answer up front.

You can't know the size until you render

If you have never touched image or PDF compression, the first instinct is a formula. "I want 200 KB, the file is 4 MB, so set quality to 0.05 and ship it." That instinct is wrong, and it is wrong in a way that took me embarrassingly long to accept.

JPEG quality does not move linearly with output size. Dropping from 0.9 to 0.8 might cut you 40%. Dropping from 0.5 to 0.4 might cut 4%. The curve depends entirely on the content. A page of dense text-as-image compresses one way, a beach photo another way, a mostly-white form a third way. There is no closed-form size = f(quality) sitting there waiting to be solved.

So you cannot predict the size. You can only measure it, and measuring means actually re-encoding the thing. Re-encoding a whole multi-page PDF is the expensive part. Re-render every page from scratch on every guess and a 30-page document at six guesses costs you 180 full page renders. On a phone, that is a reliable way to get the tab killed.

The whole trick is separating the two costs:

  • Rendering the PDF pages to pixels. Expensive, do it once.
  • Re-encoding those pixels to JPEG at some quality. Cheap-ish, do it many times.

Once that split clicks, the algorithm basically writes itself.

Step 1: render every page once, hold the bitmaps

The "maximum compression" path turns each page into an image and rebuilds the PDF out of JPEGs. (Yes, the text stops being selectable. There is an opt-in OCR layer that draws an invisible recognised-text layer back on top, but that is a separate story for a separate post.) Rasterising is the only mode that gives you a continuous quality knob to search, so it is the only mode that can hit a target.

I render each page with pdf.js at a memory-budgeted ceiling DPI and keep the result as an ImageBitmap. One page at a time, free the canvas the instant I am done with it, so peak memory stays pinned to a single page:

// Budget the ceiling DPI so all the held bitmaps fit a memory cap.
const budgetDpi = 72 * Math.sqrt(BITMAP_BUDGET_PX / totalAreaPt);
const ceilingDpi = Math.max(48, Math.min(opts.ceilingDpi, budgetDpi));

const bitmaps = [];
for (let i = 1; i <= total; i++) {
  const page = await doc.getPage(i);
  const scale = ceilingDpi / 72;
  const viewport = page.getViewport({ scale });

  const canvas = document.createElement("canvas");
  canvas.width = Math.round(viewport.width);
  canvas.height = Math.round(viewport.height);
  const ctx = canvas.getContext("2d");
  ctx.fillStyle = "#ffffff"; // JPEG is opaque; paint white behind transparency
  ctx.fillRect(0, 0, canvas.width, canvas.height);

  await page.render({ canvas, canvasContext: ctx, viewport }).promise;
  page.cleanup();

  bitmaps.push(await createImageBitmap(canvas));
  canvas.width = canvas.height = 0; // free the backing store now, not at GC's leisure
}

That canvas.width = canvas.height = 0 line looks like cargo-cult superstition. It is not. On iOS Safari, canvas backing stores do not get freed when you would expect them to, and a 30-page render quietly walks you into an out-of-memory kill with no error you can catch. Zeroing the dimensions hands the memory back immediately. Ask me how I found this out. Actually, don't.

Now I have an array of bitmaps sitting in memory. Every guess from here is just "draw bitmap to a scratch canvas at some scale, encode JPEG at some quality, read off the size." No more page rendering. That is the whole point.

Step 2: estimate cheaply, before you commit

For a single guess, I do not want to build an entire PDF just to learn it came out too big. Building the PDF (embedding the JPEGs, writing the object table) carries its own overhead. So the search runs on a cheap estimate: sum each page's standalone JPEG size, then add a modeled container overhead.

async function canvasJpegSize(canvas, q) {
  const blob = await new Promise((res) =>
    canvas.toBlob(res, "image/jpeg", q)
  );
  return blob ? blob.size : Number.MAX_SAFE_INTEGER;
}

async function estimate(scale, q) {
  let sum = 0;
  for (const bm of bitmaps) {
    drawScaled(bm, scale);              // bitmap -> scratch canvas at `scale`
    sum += await canvasJpegSize(scratch, q);
  }
  // xref table + per-image stream dicts, roughly
  return sum + containerOver;
}

toBlob re-encodes straight from the already-rendered bitmap, which is fast. The container overhead is modeled rather than measured (1024 + pages * 400 bytes, in the right ballpark) because the search only needs to land close. The real guarantee comes later, and it does not trust this number.

Step 3: the actual binary search

Two knobs move the size: dimension scale (how much to downsample the held bitmap) and JPEG quality. I want the best-looking file that still fits, so the preference order is largest scale first (keep resolution as long as you can), and within a scale, the highest quality that fits.

So I walk a ladder of scales from 1.0 downward. For each scale, I first check whether even the lowest quality fits the target. If it does not, that scale is hopeless, skip it. If it can fit, binary-search the largest quality whose estimate lands at or under the target:

const SCALE_LADDER = [1.0, 0.85, 0.72, 0.6, 0.5, 0.42, 0.34, 0.27, 0.2, 0.15];

let pick = null;
for (const scale of SCALE_LADDER) {
  const estMin = await estimate(scale, minQ);
  if (estMin > target - tolerance) continue; // even lowest quality overshoots

  // This scale fits. Binary-search the highest quality that still fits.
  let lo = minQ, hi = maxQ;
  if ((await estimate(scale, maxQ)) <= target - tolerance) {
    lo = maxQ; // even max quality fits, take it
  } else {
    for (let k = 0; k < MAX_Q_ITERS; k++) { // ~6 iterations
      const mid = (lo + hi) / 2;
      if ((await estimate(scale, mid)) <= target - tolerance) lo = mid;
      else hi = mid;
    }
  }
  pick = { scale, q: lo };
  break;
}

Six iterations over the [0.4, 0.92] quality range narrows you to within roughly 0.008 of the boundary. That is plenty. JPEG quality steps finer than that are not visually meaningful, so chasing them just burns iterations. The tolerance (a few KB below target) keeps the result off the exact line, which matters the moment you remember the estimate is not the real size.

Step 4: verify for real, because estimates lie a little

This is the part that separates a demo from something you can actually ship. The estimate is a sum of standalone JPEG sizes plus a guessed container overhead. The real PDF, after pdf-lib embeds everything and writes the cross-reference table with object streams, is not exactly that number. Usually close. Sometimes a hair over.

If I trusted the estimate and the real file came out at 203 KB against a 200 KB target, I would have broken the one promise the user actually cares about. The upload box does not grade on a curve. So after the search picks a (scale, quality), I do real pdf-lib saves and step down until the measured bytes genuinely fit:

let chosen = pick ?? smallestFallback;
let bytes = await realSave(chosen.scale, chosen.q);
let saves = 1;

while (bytes.byteLength > target && saves < MAX_REAL_SAVES) {
  if (chosen.q > qFloor + 0.06) {
    chosen = { scale: chosen.scale, q: chosen.q - 0.12 }; // drop quality first
  } else if (scaleIdx < SCALE_LADDER.length - 1) {
    chosen = { scale: SCALE_LADDER[++scaleIdx], q: maxQ };  // then drop resolution
  } else {
    break; // already at the floor; this is as small as it gets
  }
  bytes = await realSave(chosen.scale, chosen.q);
  saves++;
}

Total cost: one render pass, a handful of cheap toBlob estimates during the search, and at most six real saves. On a typical document it is one or two real saves, because the estimate was close to begin with. The verify loop is insurance, not the main road.

The honesty rules (the part I'm actually proud of)

Compression algorithms are easy to make look impressive and dishonest. Two rules keep this one straight.

Rule 1: never hand back a bigger file. It sounds absurd that "compression" could inflate a file, but it absolutely can. A crisp text-based PDF rasterised into JPEGs can come out larger than the original, because you threw away the efficient text representation and replaced it with photographs of letters. So there is a guard. If the output lands within 97% of the input size, throw the result away and return the original bytes with a note that it is already about as small as it gets. Sometimes a no-op is the correct answer.

const KEEP_ORIGINAL_THRESHOLD = 0.97;
if (outputSize >= inputSize * KEEP_ORIGINAL_THRESHOLD) {
  return { blob: originalBlob, unchanged: true,
           note: "Already optimised. Nothing more to compress." };
}

Rule 2: when the target is impossible, say so plainly. Sometimes 200 KB just is not happening. A 50-page document has a floor, and below some point you are encoding mush. The search bottoms out at the smallest scale and the lowest quality, and that is the floor, full stop. When the verify loop hits the bottom of the ladder and still cannot meet the target, I do not quietly return something twice the size and call it done. I return the smallest achievable file and tell the truth:

const reachedTarget = size <= target;
// ...
if (!reachedTarget) {
  result.note = `Smallest we can make this is ${formatBytes(size)}.`;
}

"The smallest we can make this is 340 KB" is a genuinely useful sentence. It tells the user to drop a page, or split the PDF, or just accept it. A spinner that runs forever, or a 600 KB file labeled "done" when they asked for 200, helps nobody and burns the trust you needed to keep them.

What I'd do differently

A few honest dents I will own:

  • It runs on the main thread. The whole loop lives on the UI thread today, kept survivable by one-page-at-a-time rendering, a 16-megapixel canvas clamp, and freeing canvases eagerly. The proper home for this is an OffscreenCanvas worker, and I shaped the function signature so that move is a drop-in later. I have not actually needed it yet, so per my own "do not add a worker until you measure a freeze" rule, I have not. Future me may disagree, loudly, from inside a janky tab.
  • The container-overhead model is a fudge. 1024 + pages * 400 is empirical, not derived. It is close enough that the search lands tight and the verify loop mops up the rest, but a from-first-principles estimate would shave a real save or two off the worst case.
  • The scale ladder is hand-tuned. Those ten values are a sensible roughly-geometric sequence I picked by eye. A content-aware first guess (seed the search near where similar documents tend to land) would converge faster, but it would add state and complexity to a loop that already finishes in well under a second on most files. Not worth it yet.

The shape of the solution travels well past PDFs, though. Any time you are hitting a target on an output you can only measure and not predict, the pattern holds: do the expensive part once, search over a cheap proxy, then verify the real thing and step down until you are honestly under the line. And when you cannot get under the line, the most useful thing your code can do is admit it out loud.

The version I described runs entirely client-side, no upload, on pdfandimagetools.com if you want to throw a weird PDF at it and watch the Network tab stay empty. Try to break the target search. I would genuinely like to know what does.

Every tool on PDF & Image Tools runs entirely in your browser. Your files never leave your device.

← All posts