Tutorial: Pagination and backoff for large result sets

When you need to walk a large result set — say, every transaction of the last 12 months — naive pagination will trip over rate limits and consistency issues. This tutorial gives you a battle-tested loop that:

  • Uses the maximum page size (100).
  • Honors Retry-After on 429.
  • Survives transient 5xx with exponential backoff.
  • Avoids reading duplicates when records are inserted while the loop is running.

Read Pagination and Rate limits for the policies first; this is the practical recipe.

The robust loop

// Node.js — works in any modern runtime
const BASE = "https://api.cardda.com";
const PAGE = 100;
const MAX_RETRIES = 6;

async function* paginate({ apiKey, companyId, path, query = {} }) {
  let start = 0;
  let attempt = 0;

  while (true) {
    const url = new URL(BASE + path);
    for (const [k, v] of Object.entries(query)) url.searchParams.set(k, v);
    url.searchParams.set("_start", start);
    url.searchParams.set("_end", start + PAGE);

    const res = await fetch(url, {
      headers: {
        Authorization: `Bearer ${apiKey}`,
        "company-id": companyId,
      },
    });

    if (res.status === 429) {
      const wait = Number(res.headers.get("Retry-After") ?? "1");
      await sleep(wait * 1000);
      continue; // do not bump attempt — rate limit is not a "failure"
    }

    if (res.status >= 500 && res.status <= 599) {
      attempt += 1;
      if (attempt >= MAX_RETRIES) throw new Error(`Gave up after ${MAX_RETRIES} 5xx`);
      const baseSec = Math.min(2 ** attempt, 30);
      await sleep((baseSec + Math.random() * 0.3 * baseSec) * 1000);
      continue;
    }

    if (!res.ok) throw new Error(`HTTP ${res.status}: ${await res.text()}`);

    attempt = 0;
    const items = await res.json();
    if (items.length === 0) return;
    yield* items;
    if (items.length < PAGE) return; // last page
    start += PAGE;
  }
}

const sleep = ms => new Promise(r => setTimeout(r, ms));

Use it:

for await (const tx of paginate({
  apiKey: process.env.API_KEY,
  companyId: process.env.COMPANY_ID,
  path: "/v1/banking/bank_transactions",
  query: { "status": "authorized", "_order": "ASC", "_field": "created_at" },
})) {
  console.log(tx.id, tx.amount);
}

The cursor-style alternative

Offset pagination has one problem: if records are inserted at the front while you walk, you can read the same record twice. For long-running backfills, switch to cursor style — paginate by created_at ascending with id as a tie-breaker, and remember both the last seen timestamp and the last seen id. Using created_at alone can drop records when several rows share the same timestamp at a page boundary.

import requests, time

def paginate_cursor(api_key, company_id, base_path, query=None):
    H = {"Authorization": f"Bearer {api_key}", "company-id": company_id}
    page = 100
    last_created_at = "1970-01-01T00:00:00Z"  # watermark from last successful run
    last_id = None                             # tie-breaker for rows sharing a timestamp

    while True:
        params = dict(query or {})
        # Use $gte on created_at and $gt on id when timestamps tie; this keeps
        # rows with identical created_at from being skipped at page boundaries.
        params["created_at[$gte]"] = last_created_at
        if last_id is not None:
            params["id[$gt]"] = last_id
        params["_start"] = 0
        params["_end"] = page
        params["_order"] = "ASC"
        params["_field"] = "created_at"  # API also tie-breaks on id server-side

        r = requests.get(f"https://api.cardda.com{base_path}", headers=H, params=params, timeout=30)
        if r.status_code == 429:
            time.sleep(int(r.headers.get("Retry-After", "1")))
            continue
        r.raise_for_status()
        items = r.json()
        if not items:
            return
        for item in items:
            # Defensive dedupe: skip the row we used as the cursor on the previous page.
            if item["created_at"] == last_created_at and last_id is not None and item["id"] <= last_id:
                continue
            yield item
            last_created_at = item["created_at"]
            last_id = item["id"]
        if len(items) < page:
            return

Why this is better:

  • Idempotent. Replaying from the last cursor never reads a record twice.
  • Resumable. Save (last_created_at, last_id) to disk; restart from there after a crash.
  • Avoids deep offsets. Some databases struggle past OFFSET 100000; cursor pagination keeps every query O(page_size).

Estimating "how big is this query?"

Read X-Total-Count from the first response:

curl -i "https://api.cardda.com/v1/banking/bank_transactions?_start=0&_end=1" ...
HTTP/1.1 200 OK
X-Total-Count: 12345
Content-Range: items 0-0/12345

Use the value to size your worker pool, set a progress bar, etc.

Parallelism

If you have multiple machines crunching the same dataset:

# Worker N takes pages where (page_index % N == worker_id)
def parallel_paginate(api_key, company_id, base_path, worker_id, n_workers):
    PAGE = 100
    page_index = 0
    while True:
        if page_index % n_workers != worker_id:
            page_index += 1
            continue
        # fetch page page_index ...

Two pitfalls to avoid:

  • Rate-limit buckets are keyed by company-id, not by worker. All workers handling the same company-id share a single 10 req/s budget — coordinate globally instead of assuming each worker has its own headroom.
  • Don't try to "speed up" with smaller page sizes. 100 records in one round-trip is much cheaper than 100 round-trips of one record.

Quick reference

SymptomCauseFix
Loop hangs foreverYou forgot to bail when items.length < PAGEAdd the early return.
Duplicate recordsRecords inserted at the front during walkSwitch to cursor pagination.
429 stormsMultiple workers, same company-idCoalesce through a token bucket on your side.
Content-Range: items 0--1/N (negative)_end < _startValidate before the request.

Related