Tutorial: Pagination and backoff for large result sets
When you need to walk a large result set — say, every transaction of the last 12 months — naive pagination will trip over rate limits and consistency issues. This tutorial gives you a battle-tested loop that:
- Uses the maximum page size (100).
- Honors
Retry-Afteron429. - Survives transient
5xxwith exponential backoff. - Avoids reading duplicates when records are inserted while the loop is running.
Read Pagination and Rate limits for the policies first; this is the practical recipe.
The robust loop
// Node.js — works in any modern runtime
const BASE = "https://api.cardda.com";
const PAGE = 100;
const MAX_RETRIES = 6;
async function* paginate({ apiKey, companyId, path, query = {} }) {
let start = 0;
let attempt = 0;
while (true) {
const url = new URL(BASE + path);
for (const [k, v] of Object.entries(query)) url.searchParams.set(k, v);
url.searchParams.set("_start", start);
url.searchParams.set("_end", start + PAGE);
const res = await fetch(url, {
headers: {
Authorization: `Bearer ${apiKey}`,
"company-id": companyId,
},
});
if (res.status === 429) {
const wait = Number(res.headers.get("Retry-After") ?? "1");
await sleep(wait * 1000);
continue; // do not bump attempt — rate limit is not a "failure"
}
if (res.status >= 500 && res.status <= 599) {
attempt += 1;
if (attempt >= MAX_RETRIES) throw new Error(`Gave up after ${MAX_RETRIES} 5xx`);
const baseSec = Math.min(2 ** attempt, 30);
await sleep((baseSec + Math.random() * 0.3 * baseSec) * 1000);
continue;
}
if (!res.ok) throw new Error(`HTTP ${res.status}: ${await res.text()}`);
attempt = 0;
const items = await res.json();
if (items.length === 0) return;
yield* items;
if (items.length < PAGE) return; // last page
start += PAGE;
}
}
const sleep = ms => new Promise(r => setTimeout(r, ms));Use it:
for await (const tx of paginate({
apiKey: process.env.API_KEY,
companyId: process.env.COMPANY_ID,
path: "/v1/banking/bank_transactions",
query: { "status": "authorized", "_order": "ASC", "_field": "created_at" },
})) {
console.log(tx.id, tx.amount);
}The cursor-style alternative
Offset pagination has one problem: if records are inserted at the front while you walk, you can read the same record twice. For long-running backfills, switch to cursor style — paginate by created_at ascending with id as a tie-breaker, and remember both the last seen timestamp and the last seen id. Using created_at alone can drop records when several rows share the same timestamp at a page boundary.
import requests, time
def paginate_cursor(api_key, company_id, base_path, query=None):
H = {"Authorization": f"Bearer {api_key}", "company-id": company_id}
page = 100
last_created_at = "1970-01-01T00:00:00Z" # watermark from last successful run
last_id = None # tie-breaker for rows sharing a timestamp
while True:
params = dict(query or {})
# Use $gte on created_at and $gt on id when timestamps tie; this keeps
# rows with identical created_at from being skipped at page boundaries.
params["created_at[$gte]"] = last_created_at
if last_id is not None:
params["id[$gt]"] = last_id
params["_start"] = 0
params["_end"] = page
params["_order"] = "ASC"
params["_field"] = "created_at" # API also tie-breaks on id server-side
r = requests.get(f"https://api.cardda.com{base_path}", headers=H, params=params, timeout=30)
if r.status_code == 429:
time.sleep(int(r.headers.get("Retry-After", "1")))
continue
r.raise_for_status()
items = r.json()
if not items:
return
for item in items:
# Defensive dedupe: skip the row we used as the cursor on the previous page.
if item["created_at"] == last_created_at and last_id is not None and item["id"] <= last_id:
continue
yield item
last_created_at = item["created_at"]
last_id = item["id"]
if len(items) < page:
returnWhy this is better:
- Idempotent. Replaying from the last cursor never reads a record twice.
- Resumable. Save
(last_created_at, last_id)to disk; restart from there after a crash. - Avoids deep offsets. Some databases struggle past
OFFSET 100000; cursor pagination keeps every query O(page_size).
Estimating "how big is this query?"
Read X-Total-Count from the first response:
curl -i "https://api.cardda.com/v1/banking/bank_transactions?_start=0&_end=1" ...HTTP/1.1 200 OK
X-Total-Count: 12345
Content-Range: items 0-0/12345Use the value to size your worker pool, set a progress bar, etc.
Parallelism
If you have multiple machines crunching the same dataset:
# Worker N takes pages where (page_index % N == worker_id)
def parallel_paginate(api_key, company_id, base_path, worker_id, n_workers):
PAGE = 100
page_index = 0
while True:
if page_index % n_workers != worker_id:
page_index += 1
continue
# fetch page page_index ...Two pitfalls to avoid:
- Rate-limit buckets are keyed by
company-id, not by worker. All workers handling the samecompany-idshare a single 10 req/s budget — coordinate globally instead of assuming each worker has its own headroom. - Don't try to "speed up" with smaller page sizes. 100 records in one round-trip is much cheaper than 100 round-trips of one record.
Quick reference
| Symptom | Cause | Fix |
|---|---|---|
| Loop hangs forever | You forgot to bail when items.length < PAGE | Add the early return. |
| Duplicate records | Records inserted at the front during walk | Switch to cursor pagination. |
| 429 storms | Multiple workers, same company-id | Coalesce through a token bucket on your side. |
Content-Range: items 0--1/N (negative) | _end < _start | Validate before the request. |
Related
- Pagination — formal reference.
- Rate limits — backoff strategy.
- Tutorial: Filter transactions — what to put in
query.
Updated 1 day ago
