What is the best library for bypassing Cloudflare in 2026?

Camoufox is the strongest open-source option for bypassing Cloudflare in 2026, achieving a 100% pass rate in March 2026 benchmarks. It patches Firefox at the C++ level using Mozilla's Juggler protocol, making it undetectable via JavaScript inspection. For HTTP-only scraping, curl_cffi with impersonate='chrome131' handles most Cloudflare targets without a full browser.

What is JA4+ TLS fingerprinting and how does it affect web scraping?

JA4+ is a TLS fingerprinting standard that identifies scrapers before any HTTP headers are exchanged. It hashes the TLS ClientHello fields (cipher suites, extensions, ALPN) in a sort-stable way that survives Chrome's extension order randomisation. Cloudflare deploys JA4 in a Rust crate at CDN edge, Akamai in an EdgeWorker. Python's requests library has a unique JA4 hash that gets blocked instantly. The fix is curl_cffi, which impersonates real Chrome TLS down to HTTP/2 SETTINGS frames.

How do you bypass Akamai Bot Manager in 2026?

Akamai Bot Manager in 2026 probes 60 Chrome extension URLs via fetch() to detect headless browsers. Real Chrome always has at least a few extensions installed. LinkedIn does a particularly aggressive version of this — they probe for Grammarly, 1Password, and other popular extensions by fetching resources at known extension paths. Your scraper never touches the DOM, but the DOM tells on you anyway. The fix: use CloakBrowser (loads real extension profiles) or Camoufox (uses Juggler protocol, no CDP artifacts). Combine with residential or ISP proxies since Akamai flags datacenter ASNs instantly. Set geoip=True in Camoufox to align WebRTC, DNS, and timezone with your proxy exit country.

What is the difference between residential and datacenter proxies for web scraping?

Datacenter proxies are fast and cheap but carry known ASNs (AWS AS16509, GCP AS15169) that anti-bots flag immediately. Residential proxies route through real ISP connections, making traffic look like genuine users. In 2026, most protected sites block datacenter IPs outright. ISP proxies (static residential) offer the best of both: residential IP authority with datacenter speeds. Use datacenter for unprotected APIs, ISP for medium targets, rotating residential for the hardest targets like Cloudflare-protected e-commerce.

How do you scrape JavaScript-rendered websites with Python in 2026?

For JavaScript-rendered websites in 2026, use Camoufox (Python, patches Firefox at C++ level, bypasses Cloudflare), PatchRight (undetected Playwright drop-in, bypasses Kasada), or scrapy-stealth middleware (adds TLS fingerprinting and browser engine to Scrapy). For AI-powered extraction, Crawl4AI (60K stars) and Firecrawl (111K stars) convert pages to clean Markdown. Avoid plain Playwright without stealth patches — navigator.webdriver=true is trivially detected by all major anti-bots.

What is curl_cffi and why is it better than requests for web scraping?

curl_cffi is a Python library that wraps libcurl with BoringSSL patches to produce exact Chrome and Firefox TLS fingerprints. Unlike Python requests (which has a unique JA4 hash that anti-bots recognise instantly), curl_cffi sends a ClientHello identical to a real browser including HTTP/2 SETTINGS frames. It is 10-50x faster than browser automation and works as a drop-in requests replacement: curl_cffi.requests.get(url, impersonate='chrome131').

How do I intercept mobile app API traffic for scraping?

To intercept mobile app API traffic: install Android Studio and create a virtual device with API 30+, root it using rootAVD and Magisk, install HTTP Toolkit to intercept HTTPS traffic and bypass SSL pinning automatically. Once you capture the API request, replicate it with curl_cffi for production scraping. Mobile APIs serve the same data as the website but with far weaker anti-bot protection — no Cloudflare, no JA4 fingerprinting.

What is the best Scrapy anti-bot middleware in 2026?

scrapy-stealth is the most complete Scrapy anti-bot middleware in 2026. It adds TLS fingerprint spoofing, HTTP/2 impersonation, proxy rotation, fingerprint cycling, and a real browser engine via CDP — all unavailable in scrapy-playwright, scrapy-splash, or scrapy-selenium. It supports per-request engine switching via request.meta, so easy URLs use the fast HTTP engine while protected pages use the browser engine. Install with: pip install scrapy-stealth

Why does nodriver beat patched Playwright forks for anti-bot bypass?

The difference is automation-protocol fingerprinting, not fingerprint patches. Patched Chromium forks still drive the browser over the Chrome DevTools Protocol (CDP), which leaves detectable traces in how the browser is controlled. nodriver avoids the standard CDP automation surface, so on targets that fingerprint the automation protocol itself (rather than navigator properties), it passes where a heavily patched fork is still blocked. In a 7-tool benchmark across 31 protected targets, nodriver was the only tool with zero blocks.

What is fingerprint harvesting and why does it matter for scraping in 2026?

Fingerprint harvesting is a commercialised practice where real browser fingerprints are collected from genuine users and replayed against anti-bot systems. Tools embed JavaScript on real sites to capture authentic device profiles (canvas, WebGL, TLS, audio) and sell them for injection into automated sessions. The consequence for scrapers: when a stealth browser passes a canvas probe, it may be replaying a real harvested hash rather than spoofing one. Anti-bot vendors respond by adding replay-resistant signals like WASM SIMD CPU timing that require genuine hardware.

Here is everything I know about scraping

They built walls.
I spent 7 years finding doors.

I started scraping in 2018. Since then I have worked across five companies, built hundreds of production spiders, and fought every major anti-bot system that exists. This guide is everything that actually worked.

JA4+ TLS Fingerprinting Scrapy Playwright Akamai Cloudflare DataDome Kasada F5 Shape AI Scraping MCP Tools curl_cffi Camoufox Scrapling

Anti-bots

Libraries

Detection layers

Decision steps

Paste into Claude, ChatGPT, or Cursor — full guide as LLM context

Scroll

★ About the author

7 years of production scraping

Asad Ikram
Data Engineer & Scraping specialist

I started scraping the web in 2018. Since then I have worked at five companies including Fix.com, Dubizzle Labsand M+C Saatchi Fluencybuilding production scrapers at scale across MENA and Europe.

Currently Data Engineer at M+C Saatchi Fluency and co-founder of ArtemisAI Ltd. Chevening Scholar 2024/25, MSc Data Analytics with Distinction.

I built this guide to share everything I know about scraping properly, the bypasses, the failures, the patterns that hold up in production. No guesswork, no generic tutorials.

Portfolio LinkedIn

🕷️

500+

Production spiders built

📊

50M+

Data points extracted

🏢

Companies with production scrapers

🎓

2024

Chevening Scholar, UK Govt

01 Attack strategy

The scraping
decision flow

Walk steps in order. Stop at the first win. Complexity and cost increase right. Most production scraping is solved at steps 1–3.

Recon first (Step 0): Before picking a step from the flow, capture a real session through Burp Suite and run PortSwigger's MCP server with Claude Code. One prompt traces the entire cookie lifecycle (_abck, cf_clearance, datadome, reese84), identifies sensor payload endpoints, and tells you which step from the flow below will actually work for this target. What used to be a 4-hour manual walk through HTTP history is now a 2-minute prompt.

Asad's Priority Order, start left, move right only when needed

Step 1

📱

Mobile API

HTTPToolkit
Frida · mitmproxy

Step 2

🔍

XHR Endpoint

Chrome DevTools
Burp + MCP · webclaw

Step 3

🗃️

JSON in HTML

__NEXT_DATA__
chompjs · Parsel

Step 4

⚡

HTTP Scraping

curl_cffi
Scrapy · Scrapling

Step 5

🌐

C++ Browser

Camoufox
CloakBrowser

Step 6

☁️

Managed API

Bright Data
Zyte · Firecrawl

Rule #1, Asad's priority: Never start at Step 5. The mobile app often hits the same backend with zero anti-bot. Confirmed on a major retailer, a direct GraphQL endpoint bypassed all HTML anti-bot protection entirely. Find the API first, always.

Before we go deeper: The flow above tells you what order to try things. But to understand why those steps exist in that order, and what happens when you skip one, you need to understand how detection actually works. The next section breaks down every signal anti-bots collect, starting at the TCP handshake.

The six steps above tell you what order to try. But to know which step to stop at, and why skipping ahead costs you days, you first need to understand how the detection actually works. Let's go deeper.

02 The anatomy of detection

Before you send a single byte,
you've already been judged.

The moment your scraper opens a TCP connection to a CDN, a fingerprinting pipeline triggers. By the time your HTTP request body arrives, four independent scoring systems have already assigned you a trust score. Here's exactly what each one measures, and why defeating just one is never enough.

The fundamental insight: Anti-bots don't make binary decisions. They assign a continuous trust score across all four layers simultaneously. A perfect TLS fingerprint with a datacenter IP and machine-like mouse movement still fails, just at a different layer. The only winning strategy is addressing all four at once.

Layer 1, TLS Fingerprinting: The Handshake That Betrays You

This fires before a single HTTP byte is exchanged. Understanding it is non-negotiable.

Origin 2017 · Salesforce Research

JA3, The First Fingerprint

When any HTTPS client connects, it sends a TLS ClientHello message. JA3 extracts five fields from it and MD5-hashes the combination:

TLS Version + Cipher Suites + Extensions + Elliptic Curves + Curve Formats

This produced a stable 32-char hex hash. Python's requests library has always had the same JA3 hash. Every major anti-bot catalogued it. By 2021, your Python scraper was identifiable before the first HTTP header.

JA3's weakness: Chrome started randomising TLS extension order in 2022. Same browser, different JA3 every session. The fingerprint became unstable and unreliable.

2023 · FoxIO · Replaces JA3

JA4+, The Unbreakable Standard

JA4 was engineered specifically to survive Chrome's randomisation. Instead of hashing raw extension order, it sorts extensions alphabetically and removes GREASE values before hashing. The result is stable regardless of Chrome's ordering.

JA4 format: t13d1516h2_8daaf6152771_b0da82dd1658
, t13 = TLS 1.3, d = DTLS, 1516 = cipher count+length hash, h2 = ALPN (HTTP/2), remainder = extension hash

JA4+ extends this with: JA4H (HTTP header fingerprint), JA4X (X.509 certificate), JA4SSH (SSH handshake), JA4T (TCP window + options). Cloudflare deployed it in a Rust crate at CDN edge. Akamai in an EdgeWorker. Both fire before your request reaches origin.

HTTP/2 · Wireshark Observable

HTTP/2 Frame Fingerprinting

Even with a perfect JA4 hash, HTTP/2 itself leaks your client identity. The SETTINGS frame that every HTTP/2 client sends at connection start has parameters that vary by implementation:

HEADER_TABLE_SIZE, MAX_CONCURRENT_STREAMS, INITIAL_WINDOW_SIZE, MAX_FRAME_SIZE, MAX_HEADER_LIST_SIZE

Chrome's exact values are documented. Python's httpx sends different values. curl sends different values. The ordering of these settings, the window update frame sizes, and the HPACK compression decisions all create a secondary fingerprint that cannot be spoofed without rewriting the HTTP/2 clientwhich is exactly what curl_cffi does.

2024+ · Emerging Standard

QUIC / HTTP/3 Fingerprinting

As HTTP/3 adoption grows, JA4Q and QUIC Initial packet fingerprinting are being deployed. QUIC's handshake carries its own fingerprint surface: connection ID length, transport parameters, initial packet number, token presence.

Chrome's QUIC stack differs from libcurl's QUIC implementation differs from Python's aioquic. Each leaves a unique signature in the Initial packets.

Current status: JA4+ covers QUIC. Cloudflare has begun collecting QUIC fingerprints. Not yet widely enforced for blocking, but the infrastructure is live. Tools like curl_cffi are actively implementing QUIC parity.

python

# Test your actual JA4 fingerprint against tls.browserleaks.com
import requests
from curl_cffi import requests as cffi

# ❌ requests, exposes Python/urllib3 JA4, blocked immediately
r1 = requests.get("https://tls.browserleaks.com/json")
print(r1.json()["ja4"])
# → t13d1516h2_8daaf6152771_b0da82dd1658  (Python fingerprint, catalogued, blocked)

# ✓ curl_cffi, emits Chrome 124's exact JA4 hash, HTTP/2 frames, cipher order
r2 = cffi.get(
    "https://tls.browserleaks.com/json"–
    impersonate="chrome124"  # also: chrome110, chrome107, safari17
)
print(r2.json()["ja4"])
# → t13d1517h2_c4b4b4b4b4b4_aaaaaaaaaa   (Chrome 124 fingerprint, passes)

# Also check HTTP/2 fingerprint
print(r2.json()["http2"])  # Chrome's exact SETTINGS frame values

Practical · How to actually spoof TLS in 2026

From theory to working code

All the JA4+ research is academic until you ship it. Three tiers of solution, in order of how often you should reach for each:

Tier 1 · 80% of cases

Use a TLS-impersonating HTTP client

curl_cffi (Python), tls-client (Go), noble-tls, hrequests. One line of code, exact Chrome/Firefox JA4. Drop-in replacement for requests.

curl_cffi.requests.get(url, impersonate="chrome131")

Tier 2 · Scrapy projects

Plug a stealth middleware in

scrapy-stealth adds TLS + HTTP/2 fingerprinting + proxy rotation + fingerprint cycling to existing Scrapy spiders via DOWNLOADER_MIDDLEWARE. Per-request engine switching keeps simple URLs fast.

meta={"stealth": {"profile": "chrome_147"}}

Tier 3 · Hardest targets

Browser with C++ patches

When TLS spoofing alone fails (Akamai extension probes, Kasada toString checks, behavioural ML), reach for Camoufox, rayobrowse, or CloakBrowser. C++ binary patches ship a real-browser TLS stack along with everything else.

Cost: 200MB+ memory per browser instance

⚠ Common mistakes

1. Spoofing User-Agent without TLS. If your UA says Chrome but JA4 says Python urllib3, you flag faster than no spoofing at all, the mismatch is the signal.
2. Forgetting HTTP/2 SETTINGS frames. Even perfect JA4 fails if your HTTP/2 SETTINGS (header table size, max concurrent streams, initial window size) do not match the browser you claim to be. curl_cffi and tls-client handle this; rolling your own usually does not.
3. Using stale impersonation profiles. Chrome 120 fingerprints in 2026 are themselves suspicious, real users rolled forward. Keep impersonate="chrome131" or newer.

What "impersonate a browser" means at the byte level, and why a real browser fingerprint is not one value but a set of consistent tells. The reason a default HTTP client gets a 403 it cannot explain is that the ClientHello is built by its TLS library (OpenSSL for requests, Go's crypto/tls for Go clients) with that library's own cipher order, extension set, and ordering, a fingerprint that matches no shipping browser. Tools in the curl-impersonate lineage (and the Go uTLS library that pioneered this) do not tweak that ClientHello, they construct it manually, byte by byte, to reproduce a specific browser's exact handshake. Knowing the tells they have to match is what lets you tell a convincing impersonation from a leaky one.

The per-browser discriminators

GREASE presence, curves, and extension order

A handful of stable signals separate the real stacks. GREASE (RFC 8701: random sentinel values like 0x0a0a a client injects so servers stay tolerant of unknown values) is present in Chrome, Edge, and Safari, but Firefox sends none, so a ClientHello that claims to be Chrome with no GREASE is self-contradictory. Safari advertises curves the others usually omit (for example secp521r1) and orders its suites and extensions distinctly. Chrome and Edge are both Chromium but not byte-identical. The renegotiation-info extension (0xff01) is present in stock browsers and a common omission in spoofs. Each TLS library has its own defaults (Chrome uses BoringSSL, Firefox uses NSS), and the gap between those defaults and a browser's exact bytes is exactly what a fingerprint database scores.

Why pinning a profile is fragile

Match the whole stack, and keep it current

Impersonation libraries pin to a named profile (HelloChrome_131 and the like) because a known-good fingerprint is safer than auto-updating to an untested one, but that stability cuts both ways. Real Chrome rolled forward and randomises its extension order per connection since 2023, so a frozen profile drifts from the live browser, and named uTLS CVEs have shown pinned fingerprints leaking subtle inconsistencies (an ECH cipher choice that real Chrome never makes) on a fraction of connections. Two rules follow. Keep the profile current, a 2024 Chrome hash in 2026 is itself a tell. And keep it coherent across the whole stack: the TLS fingerprint, the HTTP/2 SETTINGS frame, the header wire order, and the User-Agent must all name the same browser, because a Chrome JA3 under a Firefox User-Agent is an instant contradiction.

Layer 2, JavaScript Fingerprinting: The Page That Interrogates You

2026 update: Microsoft Edge now silently returns navigator.webdriver = false for AI-agent-driven Playwright sessions, and Google patched out the most common CDP-detection technique in V8. The browser vendors that wrote the automation-transparency rules quietly stopped enforcing them. Any bypass or detection strategy pivoting on these flags should treat them as unreliable. See the Innovation Feed card "The Vendors That Wrote the Detection Rules" for details.

Once your TLS passes, the page loads its anti-bot script. This is a 500KB+ obfuscated interrogation that runs dozens of tests in parallel.

Most stable signal · GPU dependent

Canvas + WebGL Fingerprinting

The page draws invisible shapes, gradients, and text using canvas.getContext('2d') then calls canvas.toDataURL(). The exact pixel output varies by:

, GPU manufacturer and model (NVIDIA vs AMD vs Intel)
, Driver version and sub-pixel rendering
, OS-level font rendering (Windows ClearType vs macOS CoreText)
, Canvas size and DPI scaling

A headless Chromium with no GPU produces a software-rendered canvas with a known hash. Botaaurus and CloakBrowser spoof this at the C++ level by injecting slight noise into the pixel values before toDataURL() returns, enough to vary the hash while remaining visually identical.

GPU vendor string · Renderer string

WebGL Fingerprinting

WebGL exposes the GPU through gl.getParameter(gl.RENDERER) and gl.getParameter(gl.VENDOR). Real Chrome returns something like ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0).

Headless Chrome returns a generic string or crashes on WebGL entirely. Anti-bots cross-reference: if WebGL says "Intel UHD 620" but Canvas hash shows software rendering, that's a contradiction, you're flagged.

WebGL extensions list is also fingerprinted. Real GPUs expose 30–40 extensions. Software renderers expose a different subset. The exact combination is GPU-specific and stable across sessions.

Deterministic · Hard to spoof

AudioContext Fingerprinting

The page creates an AudioContextgenerates a sine wave through an OscillatorNoderuns it through a DynamicsCompressorNodeand reads the output buffer values. The floating-point output depends on:

, CPU architecture (x86 vs ARM floating-point precision)
, Operating system audio stack
, Audio driver implementation

Headless environments often return 0.0 across the buffer (no audio context), or a software-emulated value that differs from hardware. CloakBrowser patches this at the Chromium C++ audio rendering layer.

Runtime patches exposed · 2026 standard

Function.toString() Detection

This is why playwright-stealth fails against Kasada in 2026.

When JS patches a native function, for example, navigator.webdriverit replaces the getter with a custom function. Calling Function.prototype.toString.call(getter) on the patched function returns function () { [custom code] } instead of function () { [native code] }.

Kasada specifically tests dozens of native functions this way. playwright-stealth patches them in JavaScript, so toString() reveals the patch. PatchRight fixes this at the Python source levelbefore Chrome even starts. There's no JS to inspect.

The harvest-and-replay threat (Castle Research, April 2026): Real canvas hashes from real Macs are now commercially traded. Services like Bablosoft's PerfectCanvas render canvas on a real remote GPU and inject the result into headless sessions. A "passing" canvas hash is not proof of a real browser, it may be a replayed hash from a legitimate device. Detection systems are responding by pairing canvas probes with harder-to-replay signals: WASM SIMD CPU timing, behavioural Bezier-curve physics, and controlled variability in their own probe scripts. Full report.

Akamai specific · 60 probes in 2026

Chrome Extension Probing (Akamai)

Akamai's sensor.js fetches 60 known Chrome extension resource URLs using fetch('chrome-extension://[id]/manifest.json'). Real Chrome browsers have at least a few extensions installed (ad blockers, password managers, etc.).

A headless browser returns net::ERR_FAILED on all 60 requests simultaneously, a statistically impossible result for a real user. The extension IDs probed include:

cjpalhdlnbpafiamejdnhcphjbkeiagm (uBlock Origin)
hdokiejnpimakedhajhdlcegeplioahd (LastPass)
nngceckbapebfimnlniiiahkandclblb (Bitwarden)

Fix: CloakBrowser loads real extension profiles. You install 1Password or Bitwarden into it so some probes return real manifest data.

CDP timing leaks · Protocol signals

Headless / CDP Detection

Beyond navigator.webdriverCDP-controlled browsers expose themselves through subtler signals:

Timing: CDP's Runtime.enable command leaves a timing gap between page parse and script execution that doesn't exist in real Chrome.
Execution context: window.cdc_adoQpoasnfa76pfcZLmcfl_Array and similar artifacts left by ChromeDriver are checked.
Permission API: Real Chrome returns realistic permission states. ChromeDriver returns defaults inconsistent with a "normal" browser.
Plugins: Headless Chrome has zero plugins. Real Chrome always has at least the PDF viewer plugin.

Camoufox's solution: Uses Mozilla's Juggler protocol, which sits below CDP entirely, none of these artifacts exist.

How the top tier weaponises the gap between two layers: the Akamai escalation trap. The layers above are usually described as independent checks you pass one at a time. The hardest deployments do the opposite, they cross-check one layer against another so that passing each in isolation still fails. A cold-start agentic mapping of a live Akamai sensor laid this out cleanly, and it is worth internalising because it explains why a technically perfect client payload can still earn a permanent block.

The trap

TLS at the edge cross-checks the behavioural payload

Akamai evaluates the connection's TLS fingerprint at the edge before any client JavaScript runs. A consumer TLS stack (for example Safari emulated through curl_cffi) gets a 200 carrying the sec-cpt challenge; an automation stack (stock Node or Puppeteer Chromium) gets an immediate 403 and never sees the challenge. The trap is what happens next. Even if a real Chromium driven by a stealth tool solves sec-cpt and serialises a mathematically perfect roughly 2.2KB behavioural payload to validate the _abck cookie, the edge still refuses it when the TLS fingerprint on that same connection reads as the automation tool. Perfect behaviour over the wrong TLS escalates to a permanent 403 for the session. The lesson the guide repeats applies at its sharpest here: every layer must agree at once, a flawless payload over a mismatched handshake is worse than useless.

The anti-hook

The "clean iframe" defeats window-object patches

Stealth tools commonly patch native APIs on the page's window (overriding navigator.webdriver, wrapping DOM methods in a Proxy, replacing toString). Akamai sidesteps all of it by injecting a hidden zero-size iframe and pulling pristine API references straight from iframe.contentWindow, so it can call the original native functions even when the main window is heavily spoofed. It pairs this with Function.prototype.toString checks for the [native code] marker and Object.getOwnPropertyDescriptor inspection of getters that should not exist, then ships an integrity hash to a separate anti-hook endpoint before the main telemetry is even sent. Practical consequence: patching the visible window is not enough, your patches must survive being compared against a clean realm the page can conjure at will.

The reference open-source fingerprinter: FingerprintJS, and the one line in its own README that every scraper should read. If you want to actually see what a fingerprinter collects rather than read about it, the canonical starting point is FingerprintJS, the most widely used open-source (MIT) browser-fingerprinting library. It runs entirely client-side, queries a broad set of browser and device attributes, and hashes them into a single stable visitorId, and the property that makes fingerprinting matter at all is easy to demonstrate with it: open its demo, then reopen in incognito or after clearing browser data, and the identifier is unchanged, because none of the signals it reads live in cookies or local storage. Running it against your own scraper is the fastest honest audit of how identifiable your setup is, and it pairs naturally with the local BotD and CreepJS harness described just below.

Identification is not detection

FingerprintJS vs BotD: same company, different jobs

Keep two of this vendor's libraries straight, because they answer different questions. FingerprintJS does identification, is this the same browser I saw before, producing a stable visitor ID for deduplication and tracking. BotD, from the same team and referenced just below, does bot detection, is this browser automated at all. A scraper is exposed on both axes at once: FingerprintJS-style signals link your sessions together across IPs (the rotation-inversion problem from the proxy section), while BotD-style checks flag the automation itself. Beating one does nothing for the other, which is why a coherent identity and a clean automation surface are separate pieces of work.

The admission in the README

Client-side fingerprints can be spoofed, so the hard version moves server-side

The most useful thing in the FingerprintJS documentation, from a scraper's point of view, is its own stated limitation: because the fingerprint is generated and processed in the browser, it is openly described as vulnerable to spoofing and reverse engineering, and the open-source accuracy is "significantly lower" than the commercial tier. That commercial version closes the gap precisely by moving work out of reach, combining the client signals with server-side processing over 100-plus signals plus network-level data, and validating that signals were not tampered with or replayed. That single design choice is the whole detection arms race in miniature: anything computed in the browser you can eventually forge, so serious systems anchor the verdict where your code cannot reach, which is the same lesson as the tanh and coherence sections, viewed from the vendor's side.

A measured warning: the JavaScript-surface score tells you almost nothing about whether you are caught. It is tempting to judge a stealth setup by how clean it looks to an in-page fingerprint probe, but that score and the actual block decision are only loosely related. The signal that flips many verdicts lives below JavaScript, in the driver binary. A practitioner who stood up the open-source detectors locally (BotD and CreepJS) and ran stock Selenium, selenium-stealth, and undetected-chromedriver against real Chrome found the gap directly: selenium-stealth cleaned up roughly 94 percent of the JS-surface checks, yet BotD still labelled it "selenium," because the stealth layer spoofs in-page properties but never strips the ChromeDriver binary's $cdc_ markers, which is exactly what BotD keys on. Only undetected-chromedriver, which patches those markers out of the binary, passed outright.

The counterintuitive part

Naive stealth can make you more detectable

The same harness showed selenium-stealth's getter-based spoofing adding two CreepJS "lies" that plain Selenium did not have. When you override a property with a JavaScript getter, the override is itself a detectable artifact: a function where a native value belongs, a toString that does not read as [native code], a descriptor that should not exist. A consistency-checking detector counts each of those as evidence of tampering. So a half-measure that patches the visible surface while leaving the binary tell in place can score worse than no stealth at all, because you have removed nothing that mattered and added new lies to catch.

The operating rule

Fix the layer the detector actually reads

This is why the library section rates binary-patching and below-CDP tools (undetected-chromedriver while it held, nodriver, the source-patched browsers) above JS-injection stealth on hard targets. The rule generalises past Selenium: identify the layer a given detector actually keys on (driver binary markers, the TLS handshake, the automation protocol) and fix it there, rather than polishing a JavaScript surface the detector may barely weight. And measure it: a local BotD or CreepJS harness turns "I think this is stealthy" into a before-and-after number, which is the only way to notice when a patch made things worse.

A new hardware tell shipping in Chrome: the CPU Performance API. Sites have always wanted to know how powerful your device is, historically by running a timing benchmark, which is noisy and easy to perturb. Chrome is standardising a shortcut. Reading navigator.cpuPerformance returns a stable integer tier from 1 (low) to 4 (high), with 0 for unknown, a coarse hardware class handed over with no benchmark, no opt-out, and no cross-origin limit. It is a WICG proposal, deliberately bucketed (each tier must cover a few hundred CPU models) to cap the entropy it adds, but Mozilla still estimates it at one to three extra bits per user and WebKit declined to implement it, which tells you how the other engines weigh the tradeoff. There is even a Chrome Enterprise policy to override the value (0 to 4).

Why a fixed number helps the defender

A tier you cannot benchmark your way out of

For a real user the tier is just a stable fact about their machine. For automation it is a fresh consistency check that costs the detector nothing. Because the value is declared by the browser rather than measured, a stealth setup cannot earn it by behaving well, it either matches the rest of the device story or it does not. And because the buckets are time-invariant, the same machine reports the same tier across visits and across origins, so it composes with every other signal a fingerprint already carries. A single coarse number is weak alone, which is the whole design, but it is one more axis that has to agree.

Where it bites a spoofed box

Cross-check the tier against the rest of the silicon story

The interesting move for a detector is to cross the advertised tier against the other hardware signals and look for a machine telling two stories about itself. Report navigator.hardwareConcurrency of 16 (spoofed) on a 2-vCPU cloud box whose cpuPerformance lands at a low tier, and the two claims disagree loudly. Push further and a smart WASM micro-benchmark (SIMD timings, feature detection) fingerprints the actual silicon and either corroborates the tier or exposes the lie. The practical lesson for a scraper is the one this section keeps arriving at: if you spoof one hardware property you now own all of them, the tier, the core count, and the measured compute all have to tell a single coherent story, and a headless fleet on undersized cloud instances is exactly where they stop doing so.

A worked example of what "spoof one thing and you now own all of them" looks like in practice. It is worth seeing the coherence principle as a real remediation report reads it, because the failures are never where a beginner looks. Take a headless desktop Chromium configured to present as an Android mobile Chrome (claimed identity: Chromium / Android / mobile / Chrome, with the CPU architecture field left empty). To a first-order fingerprint it looks plausible. A full audit finds it contradicting its own story in at least seven independent places at once, and every one of them is a separate subsystem the operator forgot had an opinion.

The contradictions an audit surfaces

Seven subsystems, one incoherent device

The CPU architecture, exposed through a WASM relaxed-SIMD instruction-selection oracle, reads as x86 desktop, not the ARM an Android device would run. navigator.maxTouchPoints is 0, but a phone reports one or more touch points. The Opus audio codec ships its desktop complexity default, a mobile build tunes it differently. The Widevine DRM stack and PaymentRequest service are missing or stubbed, where a real Android Chrome exposes both. The WebGL renderer string names desktop silicon, and the media power-efficiency profile matches a plugged-in desktop, not a phone. The automation protocol (CDP) is independently detectable on top. No single one of these is exotic; together they describe a device that cannot exist.

Why the fix order is the lesson

Config is cheap, coherence is architectural

The instructive part of such a report is the remediation ladder. The cheapest wins are config and build flags: disabling the CDP automation domain, patching the Opus complexity constant, forcing maxTouchPoints to a mobile value, shipping a real Widevine CDM. But the deep fixes are architectural and cannot be faked from JavaScript: to make the SIMD CPU oracle agree with the ARM claim you have to actually run on ARM64 hardware or a VM, and a JS override of PaymentRequest or the GPU renderer leaves its own tamper artifact that a coherence check then catches. That is the whole thesis of this section in one page: the shallow tells are a checklist you can grind through, but the ones that decide the verdict require the environment to genuinely be what it claims, which is why a desktop box pretending to be a phone loses no matter how many properties it patches.

The purest example of "it is harder to lie than to tell the truth": your browser does math differently on each operating system. Most fingerprint signals leak because a browser exposes a property it could, in principle, be patched to fake. This one is different, because the browser is not the thing computing the answer. IEEE 754 fixes how a floating-point number is stored but does not require transcendental functions like sin, cos, or tanh to be correctly rounded to the last bit. Every operating system ships its own math library (glibc on Linux, libsystem_m on macOS, UCRT on Windows) with its own polynomial coefficients and rounding, so the same call returns bits that differ in the last place depending on the OS underneath.

Why tanh in particular

The one JavaScript math function that leaks the OS

For most of Math, V8 bundles its own implementation, statically linked and identical on every OS, so sin, cos, and pow give the same bits everywhere and leak nothing. Math.tanh is the exception. Since Chrome 148, V8 stopped computing tanh with its own bundled routine and now calls std::tanh, which reads the host OS math library. The result: on Chrome 148 and later, Math.tanh of the right input returns Linux bits on Linux and Mac bits on Mac, and a Linux server spoofing macOS is caught the instant a page evaluates it, because the browser genuinely cannot change the answer, the OS produced it. Chrome 147 and earlier do not leak here, which itself pins a version range.

The wider surface

CSS trig leaks everywhere, and the edges are checked

JavaScript Math is a tell in essentially one place, but CSS trig functions leak everywhere, because the rendering engine calls the host math library directly for every sin, cos, tan, and inverse. A defender can probe those and also the domain edges, where implementations diverge more loudly: asin(2) is out of domain and resolves to zero on a real Mac (the NaN is clamped), not the ninety degrees a naive reproduction returns. The lesson for a scraper is the sharpest version of this whole section's theme. You cannot patch your way to a coherent lie here, because the signal is produced below the browser, by hardware and the OS, so the only setup that survives a math probe is one whose claimed OS is the OS it runs on. Spoofing the User-Agent is free; making the silicon agree is not.

Layer 2.6, Side Channels And State Leaks: The Signals Nobody Thinks To Patch

Timing side channel · navigator.storage

Incognito Detection By Write Speed

A recent revival of incognito detection skips feature checks entirely and times the storage backend. In a normal window navigator.storage is backed by disk, in a private window it is backed by RAM, and writing to RAM is measurably faster than writing to disk. Write and flush a single byte, time it, repeat a few times to weed out noise, and a flush under roughly a tenth of a millisecond gives away a private session.

The lesson is broader than incognito: a detector does not need to read a property you spoofed, it can measure a physical consequence of your environment that no fingerprint patch touches. Timing is not on most people's spoofing checklist.

Implementation leak · IndexedDB ordering

When An Undefined Edge Case Becomes A Fingerprint

A 2026 Firefox flaw is the cleanest example of how a tiny implementation detail turns into an identifier. The browser returned IndexedDB database names in hash-table iteration order rather than a canonical sorted order. That ordering was stable and unique enough to correlate a user across sessions, across private windows, even across a Tor "New Identity" reset, all without a single cookie.

The takeaway for anyone building scrapers: stealth is not a setting you flip on. Any API behaviour, any implementation quirk, any undefined edge case can become a fingerprint. You cannot enumerate them in advance, which is why coherence across the whole environment matters more than patching individual tells.

The countermeasure that actually holds

Isolate At The Process Level

Both signals above survive the usual defence of a fresh browser profile per identity, because they leak from shared process state, not from the profile. Memory-state artifacts, timing baselines, and hash-table orderings live below the profile boundary. If several scraping identities share one browser process, those hidden signals quietly become a correlation key tying your "separate" identities together.

The fix is to stop isolating at the profile level and isolate at the process level: one identity, one process, and ideally one host with a coherent fingerprint, rather than many personas multiplexed through a single long-lived browser. It costs more to run. It is also the only isolation the deeper signals respect.

Hardware timing · No permission needed

Disk Latency As A Tracking Vector

The timing family goes deeper than storage backends. A technique making the rounds measures the microscopic latencies of the machine's SSD itself. Every site and active tab generates a slightly different load pattern on the disk subsystem, and ordinary JavaScript can read those timing variations through default web APIs. In principle a script can infer activity in other tabs from the disk-contention signature alone.

The two properties that make this matter for a scraper: it needs zero permissions, browsers do not prompt for disk access, and neither an ad blocker nor a private window mitigates it. Until vendors fuzz or round these timings there is no clean client-side fix. The general lesson repeats: hardware leaves a signature your fingerprint patch never touches, and a farm of identical VMs can look too identical at the disk layer.

Spoof contradiction · Client Hints

sec-ch-ua Is Deterministic, Not Random

The sec-ch-ua header looks like noise: "Chromium";v="149", "Not A Brand";v="24". It is not random, it is a pure function of the Chromium major version. The brand ordering, the greasey "Not A Brand" label, and the version permutation are all seeded from the major version number over fixed lookup tables, no rand(), no timestamp. Every machine on the same Chromium version produces an identical header.

That determinism is a trap if you spoof carelessly. Hand-roll a sec-ch-ua that does not match the algorithm for the Chrome version you are claiming, and you have manufactured a contradiction a detector can check with one lookup. If you set the version, derive the header from it rather than copying one from a different build.

The pattern across all of these

Why These Keep Appearing

Storage timing, disk latency, IndexedDB ordering, Client Hints seeding: none of these is on a typical spoofing checklist, and that is exactly why they work. They are second-order signals, derived from physics or from an implementation detail rather than from a property you thought to override.

You cannot enumerate them all in advance. The defensible posture is not to chase each new tell, it is to keep the whole environment internally consistent and let real hardware speak for itself wherever you can, rather than presenting a hand-assembled identity that has to be right on every one of a thousand axes at once.

Layer 2.5, WebAssembly Fingerprinting: The Layer Below Your Stealth Browser

Build-time leak · Nobody patches this

Hyphenation Dictionary Detection

Chromium needs a hyphenation dictionary bundled at build time on Windows and Linux (Android and macOS handle it at OS level). Most people forking Chromium don't know this. The benefit for detection: many stealth Chromium forks literally cannot hyphenate words, and that is visible from JavaScript.

The probe: set hyphens: auto on a narrow container, render a known word like "hyphenation", read the rendered width or screenshot via Canvas. A real Chrome on Windows produces hy-phen-ation. A custom fork without the dictionary produces no break, or the wrong break.

Affected stealth browsers: anything built from a custom Chromium source that skipped the hyphenation step, which is most of them. Real CloakBrowser and properly-built forks include it, hand-rolled patches usually don't.

Mitigation: confirm your build ships the dictionary for every language you claim to support, or run a real Chrome binary under XVFB. Verify with the live PoC: joe12387.github.io/hyphenation-dictionary-poc · github source

CPU fingerprinting · DataDome internal research

WASM SIMD probes the CPU itself

WebAssembly SIMD (Single Instruction Multiple Data) gives browsers access to 128-bit vector operations that map directly to CPU instructions. Anti-bots ship tiny WASM modules that execute SIMD ops in deterministic patterns and time them. The results reveal vector register width, NEON vs SSE vs AVX availability, and microarchitecture quirks unique to the CPU model.

Why this matters: stealth browsers like Camoufox, CloakBrowser, PatchRight patch what the browser reports. WASM SIMD probes the actual CPU. A real Mac with M2 chip can't be spoofed to look like an Intel laptop because the SIMD timing fingerprint is generated by the silicon, not by the browser.

Source: Anthony Manikhouth (DataDome bot detection engineer), blog.azerpas.com, May 2026.

High-resolution timing · The enabling primitive

SharedArrayBuffer via WASM gives 17× timer precision

Chrome floors performance.now() at 100µs on non-isolated pages to prevent Spectre-style timing attacks. But one line of JavaScript breaks that: new WebAssembly.Memory({shared:true}).buffer returns a real SharedArrayBuffer on any page, no special headers required.

Paired with a MessageChannel ping-pong loop in a hidden iframe driving Atomics.add(), you get a counter incrementing at 100,000 Hz, distinguishing steps around 6µs. That's 17× finer than the timer Chrome intends you to have.

Why anti-bots love this: micro-timing patterns (canvas render time, JS jitter, animation frame variance) differ between humans and bots at sub-millisecond scale. WASM shared memory makes that measurable on every page, not just cross-origin-isolated ones. Reported to Chrome as crbug 40057687, marked Won't Fix.

Source: Manuel (brokenbrowser.com).

Bypass implications

Why your stealth browser still leaks

WASM fingerprinting sits in a blind spot of most stealth tools. The signal flow:

1. Anti-bot ships a WASM module with SIMD ops and a high-resolution timer built from WebAssembly.Memory({shared:true}).
2. The module runs natively, no JS hooks to intercept, no Function.toString() traces to leak.
3. CPU microarchitecture + timing patterns are POSTed back as part of the bot scoring payload, often alongside the canvas hash.

What this defeats: Camoufox (Firefox C++ patches), CloakBrowser (49 Chromium patches), PatchRight, undetected-chromedriver, Nodriver, Pydoll. All of them patch JS APIs and binary internals, but none patches the WASM execution layer.

What still works: real hardware diversity. Different physical machines produce different SIMD fingerprints naturally. The future of stealth scraping is less about better lies and more about real hardware in real consumer locations, which is exactly what residential proxies on real ISP IPs already approximate.

CVE-2026-6770 · April 2026 · Process memory leak

IndexedDB Iteration Order: The Fingerprint Nobody Patched

Firefox stored IndexedDB database names using internal UUID mappings in a global hash table shared across all origins within the same browser process. When a site calls indexedDB.databases(), the names come back in hash table iteration order, which is deterministic and stable for the lifetime of that process. Two unrelated sites see the same ordering and can use it to silently link a user's activity across domains — no cookies, no shared storage, no user interaction required.

The fingerprint persisted across reloads, new private windows, and even Tor Browser's "New Identity" resets. Only a full browser restart cleared it. Fixed in Firefox 150 / Tor Browser 15.0.10 (April 21, 2026).

Scraper implication: If you run multiple scraping identities inside the same browser process (shared Camoufox instance, same Firefox PID), an anti-bot can correlate them using this ordering as a stable session token — regardless of proxy rotation, cookie isolation, or fingerprint patching. The signal is below every stealth layer.

Rule: Isolate scraping identities at the process level, not just the profile level. One identity = one browser process. Verify your Camoufox build is on Firefox 150+.
Ref: CVE-2026-6770 · mfsa2026-30 · SecurityAffairs writeup

Layer 3, Network Identity: The Five Vectors That Must Agree

This is the layer most people underinvest in. Beginners pour effort into header spoofing and fingerprint patching while running through a flagged IP, and then wonder why nothing works. It is backwards. If your IP reputation is bad, no amount of header spoofing will save your requests, the request is scored down before your carefully crafted headers are ever read. Infrastructure beats code here: a plain HTTP client on a clean residential IP routinely outperforms a perfectly patched browser on a flagged datacenter IP. Get the network identity right first, then worry about everything above it.

Primary signal

IP Reputation & ASN

Anti-bots check your IP against ASN databases. AWS (AS16509), GCP (AS15169), Azure (AS8075) are immediately flagged. DigitalOcean, Linode, Vultr, all known. Even residential proxy networks from DataCenter IPs in the 24.105.x.x range are flagged if the ASN is a known proxy provider. Genuine ISP residential or 4G carrier IPs are the only reliably clean option.

Browser API · Often overlooked

WebRTC IP Leak

JavaScript can query WebRTC ICE candidates which reveal your real local and public IP, even through a proxy. If your browser has a US proxy but WebRTC reveals a Pakistani local address, or the ICE candidate is from a different subnet than the HTTP request IP, that's an immediate flag. Camoufox's geoip=True aligns WebRTC candidates with the proxy exit country.

All five must agree

The Coherence Test

Anti-bots run a coherence check across: IP country, timezone, Accept-Language, WebRTC candidate, DNS resolver location. A US proxy with Accept-Language: ur-PK fails immediately. All five must tell a consistent geographic story. This is why setting geoip=True in Camoufox is critical, it auto-configures all five to match the proxy's exit country.

Layer 3.5, DOM Honeypots: The Trap Doesn't Care About Your Fingerprint

Hidden DOM elements

Honeypot Fields and Links

Invisible form fields and hidden links that humans never see but bots fill in or click. Triggered = bot detected = IP banned. Common patterns: display:none, visibility:hidden, opacity:0zero-dimension elements, off-screen positioning, fields with tabindex="-1"or links placed after the closing </body> tag.

Data poisoning

Fake Data Served to Suspected Bots

More dangerous than blocking, sites detect a scraper and silently serve different prices, fake reviews, wrong stock counts. You think you're scraping successfully but your dataset is corrupted. Defence: compare scrapes from 2+ different IP profiles for the same URL. Mismatched data = poisoning. Always check element visibility (getBoundingClientRect()) before interacting.

Layer 4, Behavioural ML: You Can't Fake Being Human

Physics-based · Gaussian jitter

Mouse Movement Curves

Human mouse movements follow Bezier curves with Gaussian noise applied to velocity. The mouse decelerates as it approaches a target (Fitts's Law), overshoots slightly, then corrects. Scrapers that click elements directly (teleporting the mouse to x,y) create a trajectory signature that's statistically impossible for a human. DataDome's 35-signal behavioural model catches this immediately. Botasaurus generates physically realistic curves with randomised velocity profiles.

Sub-millisecond precision · ML scored

Timing Analysis

Transformer ML models trained on millions of sessions measure: time between page load and first interaction, scroll acceleration curves, inter-keystroke timing variance, navigation dwell time, and micro-timing of JS event handlers at <1ms precision. A scraper that immediately calls document.querySelector() after DOMContentLoaded looks nothing like a human who reads the page for 2.3 seconds first. Warm-up navigation (visiting homepage before target) significantly improves behavioural scores.

A sharp edge most stealth setups forget: behaviour is judged server-side too, from request timing alone, with no JavaScript and no fingerprint involved. The behavioural layer is usually discussed as client-side mouse and keyboard telemetry, which is why a real browser feels like a safe haven. But a server watching only the sequence and timing of your requests can still catch automation. A self-hosted open-source detector demonstrated this by flagging a genuine Chrome being driven by an LLM over the Chrome DevTools Protocol, with no client-side fingerprinting and no TLS interception in play. The tell was subtle: the automated session requested resources in an order and rhythm that did not match its own cache state, fetching things as though the cache were cold while it was actually warm. Real browsing has a characteristic shape, conditional requests, parallelism, think-time, asset ordering driven by the renderer, that a request sequencer reproduces imperfectly.

Why it survives a perfect fingerprint

Timing is orthogonal to identity

TLS, JA4, canvas, and Client Hints describe what you are. Request timing describes how you behave over time, and the two are independent. You can present a flawless real-Chrome fingerprint and still emit a request rhythm no human produces. A patient detector does not even need to be certain on request one: it can lower confidence, watch the next few requests more closely, and converge, which is exactly what a behavioural memory model is built to do.

What it means for your crawler

Make the cache and the cadence honest

Drive real navigations rather than firing a hand-built request list: let the browser fetch sub-resources in its natural order, honour caching so a warm session does not re-request what it already holds, and keep think-time and parallelism in a human range. The same logic that beats client behavioural ML, move like a person, applies one layer down at the request-sequence level. Server-side timing is cheap to run, needs no client cooperation, and is the layer a fingerprint patch never touches.

The 2026 direction of travel: continuous session-long behavioural validation. The behavioural checks above are mostly evaluated in bursts, at a challenge or on a scored request. The newer enterprise approach (Cloudflare shipped a productised version in mid-2026) drops that model and validates behaviour continuously across the whole session. A dynamically injected script quietly collects interaction signals as you use the site, cursor motion, scroll dynamics, typing cadence, clipboard actions, how long the page is actually visible, and streams them back to be scored in real time, with the verdict compounding as the session goes on. The design premise is precise: modern automation can execute JavaScript, run a real browser, and pass a single CAPTCHA without raising a flag, so a point-in-time check catches less and less. What stays hard is producing consistent human behaviour over an entire visit.

Why bursts stop working

You cannot reset a session by refreshing

A per-request or per-challenge check is a door you pass once; a session-long model is a companion that never leaves. Because the score compounds with context across the visit, refreshing the page or navigating away does not wipe the behavioural signature, the session carries forward, so the cheap escape hatches (reload on block, spin a new page) stop helping. For automation this changes the target completely: it is no longer enough to look human for the one interaction being scored, the whole arc of the visit has to hold together, which is far more expensive to simulate and far less reliable to keep up at scale.

Where synthetic behaviour cracks

Human movement is not just "noisy"

The common way to fake human input is to add Gaussian noise or uniform random delays to mouse paths and keystroke timing. That defeats a naive "is this too perfect" check but fails a model that knows the real texture of human motion, which has structure noise does not, the acceleration and correction profile of a real cursor, the rhythm of real typing. The other half is cross-signal coherence: does pointer activity line up with when the page was actually visible, is a text field genuinely focused during the typing events it claims. Random jitter satisfies none of those joint constraints. The defensive lesson mirrors the offensive one from the session-stickiness section, coherence across signals over time is the wall, and a bolt-on layer of randomness is exactly what it is built to catch.

The story so far: You now understand the full detection stack, TLS fingerprints at the network layer, JS interrogation in the browser, IP reputation checks per request, and ML behavioural analysis across the session, including the server-side request-timing variant that needs no client signal at all. The next sections show you exactly which anti-bot vendors use which combination of these layers, and the specific bypass strategies for each.

Layer 5, Fingerprint Replay: The Game Stopped Being About Spoofing

Harvested assets · Not synthetic spoofs

Your Fingerprint Can Be Harvested And Replayed

For years the model was simple: anti-bots read your fingerprint, you spoof the attributes they read. That model is now incomplete. There is a parallel economy that collects real browser fingerprints from genuine user traffic and replays them inside automation. Disposable email sites, CAPTCHA farms, and sneaker bots quietly embed collectors that pull canvas, WebGL, audio, fonts, and device attributes from every real visitor.

Real fingerprints trade as operational assets. I have seen forum threads pricing them around five dollars per thousand, and one ecosystem scan found roughly one in eight bot-adjacent sites running a fingerprinting collector. The implication for you: a perfectly coherent fingerprint is not necessarily a spoofed one. It may be a real environment, lifted from a real person, replayed at scale.

Remote render · Inject real output

Canvas Rendered On Real Hardware

The most interesting harvesting technique splits fingerprint generation from the automation client. Instead of synthetically faking a canvas hash (which leaves detectable artifacts), the canvas is rendered on a real remote machine and the output injected into the automation browser. The hash you present came from genuine hardware, so the usual "this canvas looks computed" tells disappear.

The harvested collectors I have looked at do not stop at raw attributes. They mirror vendor-specific challenge logic: Akamai-style feature bitmasks, PerimeterX canvas markers, payment-provider canvas seeds, hashed feature keys aligned with named CAPTCHA vendors. The dataset is dual layer: raw device signals plus vendor-shaped outputs. That is what makes replay viable against a specific wall rather than just "looking human."

The defensive countermove

Why This Changes Your Strategy

If real fingerprints can be replayed, defenders stop trusting any single client-side signal in isolation. The countermeasures they reach for are exactly the ones that break naive scrapers: rotate the fingerprinting logic itself so harvested payloads go stale, randomise payload schemas so an offline-generated payload no longer matches, bind payloads to per-session nonces so a replayed body is invalid, and cross-check redundant signals across main page, iframe, and worker so a partially patched environment contradicts itself.

The takeaway I keep coming back to: coherence and freshness now matter more than any one spoofed attribute. A clean canvas hash from last week, replayed today against a vendor that rotated its logic, fails harder than no spoofing at all. Treat your fingerprint as a living thing that has to agree with itself across every context, every request.

Now you know the detection layers, four signal families plus the replay economy sitting on top of them. Every vendor below is just a different weighting of those same signals, some prioritise TLS, others behaviour, others network identity. Knowing the layer tells you which tool to pick. Here are the six walls.

03 The vendors

Six companies built the walls.
Here's every key.

Each vendor applies the detection layers differently, different weights, different signals, different architectures. What bypasses Cloudflare has zero effect on Kasada. You need to know exactly which wall you're facing before you choose a tool.

Step 0, Before anything else

Identify which anti-bot you're facing

Wrong strategy on the wrong vendor wastes hours. Before writing a single line of code, spend 30 seconds identifying exactly what's protecting the target.

1 Wappalyzer Chrome Extension Install free ↗

Visit the target site, click the Wappalyzer icon in your toolbar. It instantly shows all detected technologies, including the anti-bot vendor. Shows Akamai, Cloudflare, DataDome, PerimeterX, Kasada and more with a single click.

2 Check response cookies

_abckAkamai

cf_clearanceCloudflare

datadomeDataDome

_px3PerimeterX

x-kpsdk-ctKasada

_fs_ch_st_Fastly

reese84F5 Shape

dd_cookie_testDataDome

bm_szAkamai

Open DevTools → Application → Cookies. Match any cookie name to identify the vendor. Multiple vendors can run on the same site. For CLI scanning at scale: wafw00f https://target.com identifies WAF + anti-bot vendor in one command.

3 Check response headers

DevTools → Network → any request → Response Headers. Look for x-datadome, server: cloudflare, x-akamai-request-idor challenge redirect URLs containing vendor names.

🔍 Wappalyzer

Free Chrome + Firefox extension. One click on any site shows:

Anti-bot / security vendor
CDN provider
CMS, framework, analytics
Server technology

Install Wappalyzer Free ↗ Firefox version ↗

Also useful

wappalyzer.com ↗ builtwith.com ↗ whatcms.org ↗ wafw00f (CLI) ↗ WhatWaf (CLI) ↗

01/06 · Airlines · Banks · ~30% Fortune 500

Akamai

Bot Manager v3+ injects sensor.js (~512KB, fully obfuscated) into every protected page. Unlike Cloudflare which checks at CDN edge, Akamai runs its full fingerprint suite inside your browser via this script. It collects 500+ signals over multiple requests, trust accumulates across the session, not just on the first hit. The critical 2026 signal: 60 chrome-extension:// URL probes. Zero passing = instant bot score regardless of all other signals. JA4+ is checked at EdgeWorker before HTML is served.

_abck cookie bm_sz 60 ext probes Battery API Multi-req scoring

Bypass strategy

Step 1: Check for GraphQL/XHR API first, a direct endpoint bypasses HTML anti-bot entirely

curl_cffi impersonate="chrome124" handles TLS + HTTP/2 layer

CloakBrowser with 49 C++ patches handles sensor.js interrogation

Load Bitwarden + 1Password extensions to pass 60 extension probes

ISP/static residential proxy, never rotate mid-session (trust accumulates)

Homepage warm-up → 2–3s human dwell → scroll → navigate to target

Script size

~512KB

Re-obfuscated per rotation

Ext probes

Zero passing = instant block

Fortune 500

~30%

Retail, airlines, finance

Scoring

Multi-req

Trust builds across session

02/06 · 20% of all internet traffic · 200+ countries

Cloudflare

Cloudflare's uniqueness is infrastructure-level deployment. JA4 is computed in a Rust crate running on every Cloudflare edge node, your request is fingerprinted before it reaches any application server. The ML bot score (1–99) is trained on Cloudflare's view of 20% of all internet traffic, giving it an unmatched baseline for what "real" browsers look like. Turnstile (their CAPTCHA replacement) submits a 79-parameter POST including Canvas hash, font measurements, SHA-256 proof-of-work, and TEA-encrypted timing data.

cf_clearance __cf_bm JA4 Rust edge Turnstile 79 params ML score 1–99

Bypass strategy

Origin IP bypass: check SecurityTrails DNS history, many sites had Cloudflare added later, origin IP is in old A records

Camoufox with geoip=True, 100% pass rate Mar 2026 on Instagram, Reddit, X, LinkedIn

Scrapling's StealthyFetcher solves Turnstile natively and automatically

Turnstile HTTP bypass possible: solve the PoW + Canvas hash without a browser in ~0.27s

Camoufox uses Juggler (not CDP), zero CDP timing artifacts that Cloudflare's ML scores heavily

Web coverage

20%

All internet traffic

Turnstile params

Canvas + PoW + TEA crypto

Camoufox

100%

Pass rate Mar 2026

ML training

Global

20% of all traffic

03/06 · 5 trillion signals/day · 1,200+ clients

DataDome

DataDome's architecture is fundamentally different from the others: it deploys 85,000 separate ML modelsone per protected site. There is no universal bypass. What works on Grainger.com may not work on Le Monde. It runs at the application server level (not CDN), so origin IP bypass is impossible. The WASM boring_challenge is a Rust-compiled state machine that cannot be emulatedit requires actual browser execution to produce valid tokens. IP reputation alone accounts for 25–30% of the total trust score.

datadome cookie WASM boring_challenge Picasso device FP 35+ behavioural 85K per-site models

Confirmed bypass, Grainger.com ✓

Always try first: find __NEXT_DATA__ in HTML source, Grainger had 110KB of product data in it, bypassing DataDome entirely

curl_cffi chrome124 + residential proxy → confirmed 200 OK (Grainger.com)

Mobile carrier IP (T-Mobile, Vodafone 4G), highest trust score, hardest to flag

Camoufox + geoip=Truealigns all 5 identity vectors with proxy exit country

2ms real-time response means every request is independently scored

ML models

85,000

One per protected site

Response

2ms

Real-time, app server

IP weight

25–30%

Of total trust score

Universal bypass

None

Per-site models

04/06 · HUMAN Security · 3 billion devices

PerimeterX

After merging with HUMAN Security, PerimeterX gained the most powerful network effect in anti-bot. It verifies 15 trillion interactions per week across 3 billion devices. The critical risk: get detected on any one of 29,650+ protected sites and your fingerprint is flagged across the entire network. Nike, Walmart, Zillow, StubHub all share reputation data. Its 5-vector unified score (TLS + IP + HTTP headers + JS fingerprint + Behaviour) requires all five to pass simultaneously, fixing only one vector has zero effect.

_px3 cookie _pxde cookie 5-vector score 29,650 site network Human Challenge

Bypass strategy

All 5 vectors must pass simultaneouslyCamoufox + residential proxy addresses all of them

Generate a fresh fingerprint per session, never reuse fingerprints across different target domains

SeleniumWire can intercept the _px3 token generation flow for token replay

Scrapfly's ASP flag handles all 5 layers automatically at managed API level

Never use burned IPs, the network effect means cross-site reputation

Sites

29,650+

Nike, Walmart, Zillow

Weekly verif.

15T

3B devices/month

Vectors

5/5

All must pass

Network effect

Global

Reputation shared

05/06 · No CAPTCHA · Gatekeeper proxy architecture

Kasada

Kasada operates as a gatekeeper proxyevery request flows through it before reaching origin. Its JavaScript (ips.jsrenamed polymorphically each deployment) issues proof-of-work challenges that require real CPU cycles and browser APIs to solve. There are no CAPTCHAs, failures are silent 403s or 429s with no explanation. The critical 2026 fact: Kasada specifically fingerprints playwright-stealth by calling Function.prototype.toString() on patched native functions. The patch signatures are catalogued.

x-kpsdk-ct x-kpsdk-cd ips.js PoW polymorphic JS toString() inspection

Bypass strategy

Never use playwright-stealthKasada has its toString() signatures catalogued and blocks it outright

PatchRight patches at Python source level, nothing in the JS runtime to inspect via toString()

SeleniumBase UC mode, removes webdriver flag and auto-handles PoW challenges

Residential proxy essential, datacenter IPs receive near-zero trust regardless of browser

PoW tokens are single-use, never replay, always generate fresh per request

Block style

Silent

403 no explanation

playwright-stealth

Detected

Catalogued signatures

Challenge

JS PoW

Real CPU required

JS file

Polymorphic

Renamed each deploy

06/06 · $1 billion acquisition · Most sophisticated

F5 Shape

F5 acquired Shape Security for $1 billion in 2020and the price reflects what they built. Shape runs a custom JavaScript virtual machine. The bytecode that executes in the browser is not standard JavaScript, it's a proprietary instruction set that you cannot reverse-engineer with standard tooling. Session tokens expire in minutes. The challenge payload is re-generated with every rotation. For production scraping at scale, DIY bypass is economically irrational, the engineering cost of maintaining a bypass exceeds the cost of Bright Data's API within weeks.

reese84 cookie TS cookie custom JS VM minute-cadence rotation $rsc= params

Bypass strategy

First: check if mobile app uses a weaker backend, Shape is often only on the web frontend

Only reliable option at scale: Bright Data (98.44%) or Zyte (93.14%) managed APIs

DIY reverse engineering: deobfuscate VM bytecode, takes weeks per rotation

Cost-justify: >2 days/month of maintenance time → managed API is cheaper

The custom VM produces tokens that can be replayed for a few minutes, session pooling can reduce API costs

Acquisition

$1B

F5 Networks 2020

Token expiry

Minutes

Tight rotation cadence

VM type

Custom

Proprietary bytecode

DIY viability

None

Use managed API

Forter

Fraud / Behavioural

Focuses on behavioural analysis and device fingerprinting for fraud prevention. Monitors checkout speed, typing rhythm, and device profile. Common on e-commerce checkouts. Bypass: headless browser with randomised timings, diverse residential proxy pool, replay real user interaction sequences.

BehaviouralDevice FPCheckout fraud

Riskified

Fraud / Behavioural

Monitors shopping and payment behaviour alongside device fingerprinting. Flags anomalies in purchase flow, typing patterns, and system details. Bypass: Playwright Stealth with realistic interaction replay, residential proxies, maintain full session cookies across the purchase flow.

BehaviouralDevice FPPayment flows

Imperva Incapsula

WAF · IP reputation · JS challenges

Enterprise WAF used by Fortune 500 financial and government sites. Focuses on IP reputation databases + JavaScript challenges + behavioural analysis. Less aggressive than DataDome on TLS but harsh on flagged IPs. Bypass: residential proxies (datacenter IPs nuked instantly), Camoufox or fortified browser, slow request pacing.

IP reputationJS challengeEnterprise/finance

AWS WAF

Cloud-native · Bot Control · Captcha

Amazon's managed WAF with Bot Control add-on. Three protection levels: Common (signature-based), Targeted (behaviour + JS challenge), Custom rules. Used by AWS-hosted apps. Bypass: rotate residential IPs (Common tier blocks AWS IPs themselves), browser automation for Targeted tier, request rate ≤ 5/sec to avoid trigger thresholds.

AWS-nativeBot ControlCAPTCHA

Anubis

Open-source · PoW · Anti-AI scraper

Self-hosted Web AI Firewall (15k+ stars on GitHub, MIT licensed, written in Go by Xe Iaso/TecharoHQ). Sits as a reverse proxy and issues JavaScript proof-of-work challenges before serving requests. Built specifically against AI scrapers that ignore robots.txt. Used by Codeberg, FFmpeg, the Linux kernel source, Sourcehut, and most non-Cloudflare FOSS projects. Recognisable by its anime "Anubis" mascot illustration during the challenge. Bypass: headless Chromium with JS enabled (it'll solve the PoW naturally, just slower), or persist the verification cookie across requests. Codeberg confirmed in mid-2025 that AI scrapers already learned to solve Anubis challenges, so it slows scraping but doesn't stop a determined operator.

Proof-of-workSelf-hostedAI-targetedFOSS

Fastly Bot Management

CDN-native · JS Proof-of-Work · Next-Gen WAF

Fastly runs bot detection at its CDN edge, layered on the Next-Gen WAF (the former Signal Sciences engine). It does the now-standard stack, JA3/JA4 TLS fingerprinting and HTTP header-order analysis before any HTML, IP reputation against datacenter and proxy ranges, and behavioural scoring, but its defining mechanic is the dynamic client challenge. A non-interactive JavaScript Proof-of-Work challenge tests that the client is a real JS-executing browser, escalating to an interactive CAPTCHA only on suspicion, and it can also issue Private Access Tokens. The tell is the cookie pair issued from the customer's own domain: _fs_ch_st_* marks a challenge starting and _fs_ch_cp_* marks it solved, with _fs_cd_cp_* appearing when advanced client-side detection is enabled. A solved challenge yields a token cookie (default one hour) that later requests must carry. Because the challenge is a JS PoW rather than a heavy obfuscated VM, a real JS-executing browser engine clears it where a plain HTTP client cannot, which puts it closer to Anubis in difficulty than to Kasada. The usual caveat holds: identify it from the _fs_ch_* cookies first, then match a real browser stack end to end (TLS, header order, and JS execution) rather than reaching for a heavier tool than the challenge needs.

_fs_ch_st__fs_ch_cp_JS PoWJA3/JA4Header orderPAT

Quick identification reference

What you see	Anti-bot	Key cookie/header	Detection method
"Pardon Our Interruption" page	Akamai block	`_abck`	Wappalyzer · response body
CF-Ray header · Turnstile iframe	Cloudflare challenge	`cf_clearance`	Response header `CF-Ray`
JSON with `datadome` key	DataDome block	`datadome`	Response header `x-datadome`
`_px3` or `_pxde` set	PerimeterX block	`_px3`	Cookie inspection
Silent 403 · no body	Kasada silent	`x-kpsdk-ct`	Response headers · `ips.js` in source
`reese84` or `TS` cookie	F5 Shape block	`reese84`	Cookie names · Shape JS reference
Anime mascot "weighing your soul" page	Anubis challenge	`techaro.lol-anubis-auth`	JS PoW challenge · Anubis HTML title
302 redirect to a virtual waiting room	Queue-It queue	`Queue-it token cookie`	X-Queueit-Connector header · queue-it.net redirect

The through-line: Every anti-bot vendor is defending against the same thing, automated access that looks like a machine. The difference is which layer they weight most heavily. Akamai weights browser execution (sensor.js). Cloudflare weights TLS + global ML. DataDome weights per-site behaviour + IP. PerimeterX weights the network effect. Kasada weights PoW + JS integrity. F5 Shape weights token validity via a proprietary VM. The tools in the next section exist as direct countermeasures to each of these specific approaches.

One thing worth holding in mind before the tools: "99% of bots blocked" is the most comforting line in any security stack, and it is the one to be most suspicious of. Real bot protection is not a checkbox bundled next to caching and DDoS. It is a research field. Application-layer detection alone means shipping obfuscated challenge code inside a custom VM and staying ahead of the people who devirtualise it for a living, knowing which canvas and timing artifacts separate real Chrome from automation and which an attacker can cheaply fake. Below it sit entirely separate specialisms: TLS fingerprinting, HTTP/2 frame ordering, reading the OS and tunnelling from raw TCP. Real detection holds all of them current at once against an adversary who ships patches in hours, and the moment a vendor leans on a single layer, that is the layer that breaks. The practical consequence for a scraper is the inverse of the marketing: a great deal of deployed protection was built once and quietly stopped being touched. A toggle implies the problem is solved and static; it is neither. When a defence feels immovable, the useful question is how recently anyone actually worked on it.

Six walls. Now the tools. Every library below exists as a direct response to one of those six systems, curl_cffi was built because JA4 broke Python's TLS. Camoufox because CDP leaks signal automation. PatchRight because Kasada fingerprints JS patches. The arms race made this arsenal.

04 Field notes

How I approached real-world bypasses

The theory above tells you what anti-bots do. These notes tell you what I did when I hit them on a production job. Each is a full day or two of work distilled to: what I tried, why it failed, what finally worked, and the decision tree I'd use next time.

Read these as snapshots, not recipes. Each case is dated and reflects what worked on a specific target at a specific time. Anti-bot systems are probabilistic and adapt continuously, so the exact stack that worked for me may behave differently for you, on a different target, today. The durable value is the reasoning, not the recipe. See Legal & Reality for the full caveat.

Production bypass · ~1 day No browser · 0 blocks in 500+ requests · 24 req/min sustained

Akamai v3 in 2026: cracking it without a browser

Field notes from a production scraping job. The story of what I tried, why each thing failed, and the exact approach that finally got clean 200 responses with zero browser overhead.

The core problem · _abck ~-1~ won't flip

Akamai's _abck cookie has two states. ~-1~ means unvalidated, full bot score, blocked. ~0~ means validated, trust granted. The cookie is set immediately on any page load, but only flips to ~0~ after sensor.js (a 512KB obfuscated fingerprinting script) executes, collects signals, and POSTs them to /_bm/data.

Signals that matter most: canvas fingerprint (pixel-level hash of GPU-rendered shapes and text), WebGL renderer (exact GPU model via WEBGL_debug_renderer_info), AudioContext (floating-point sine wave through a compressor node), Chrome extension probes (60 chrome-extension:// URLs fetched via fetch(), zero passing = instant bot score), mouse/scroll trajectory physics, and navigator properties cross-checked against the fingerprint.

The kicker: validation is multi-request. Trust accumulates across the session, not just on the first hit.

What failed (and why)

× Attempt 1 — Headless Chrome variants (undetected-chromedriver, Pydoll)

Standard starting point. Tried Chrome headless with undetected-chromedriver (uc) routed through a Comcast ISP proxy. Then switched to Pydoll — CDP automation without the usual webdriver flags. Both behaved identically. _abck set immediately as ~-1~, never flips. Waited 60 seconds, scrolled, dispatched JS mouse events. Nothing.

Why: headless Chrome has no GPU. gl.getContext('webgl') returns null. Sensor.js sees zero WebGL context and assigns maximum bot score before the session even starts.

× Attempt 2 — SwiftShader software GPU

Tried --use-angle=swiftshader --use-gl=angle. WebGL works. Canvas renders. AudioContext works. Renderer: ANGLE (Google, Vulkan 1.3.0 (SwiftShader Device (Subzero) (0x0000C0DE))).

Why: 0x0000C0DE is SwiftShader's device ID, in public lists of virtual GPU IDs. Akamai checks the unmasked renderer against a blocklist. SwiftShader is on it. The canvas hash it produces is also deterministic and known.

× Attempt 3 — Camoufox (Firefox with C++ patches)

Camoufox is excellent for Cloudflare. Real canvas hash, coherent device profile, no CDP artefacts because it uses Mozilla's Juggler protocol. geoip=True aligns WebRTC, DNS, and timezone with the proxy exit country. Set up a session, pointed it at the target, ran a few warm-up requests.

Why: Camoufox patches the browser, not the network. The TLS fingerprint it produces is Firefox, but on this particular Akamai deployment, even Firefox's real JA4 wasn't enough — the IP reputation scoring on my exit IP and the multi-request trust accumulation Akamai uses meant validation never completed. Camoufox does great work where Cloudflare scoring is the gate; here the gate was earlier and harder.

× Attempt 4 — JS prototype patching via CDP

Inject before page load via Page.addScriptToEvaluateOnNewDocument: patch WebGLRenderingContext.prototype.getParameter to return "Intel Iris OpenGL Engine". Patch navigator.platform to "MacIntel", deviceMemory to 8, battery API, chrome.runtime. Result: 2327-byte error page before sensor.js runs.

Why: Akamai's EdgeWorker fires at the TLS layer before HTML is served. JS injection patches don't affect the TLS handshake. The site also detects the timing signature left by Page.addScriptToEvaluateOnNewDocument. The prototype tampering itself is detectable via Function.prototype.toString().

× Attempt 5 — CDP UA override only

Use Network.setUserAgentOverride with full userAgentMetadata to spoof macOS Chrome 148. No JS injection. Same error page.

Why: the UA override changes HTTP headers and what navigator.userAgent returns, but not the TLS ClientHello fingerprint. Akamai's EdgeWorker sees the JA4 hash (still Linux Python automation) and blocks at the network layer before the page loads.

× Attempt 6 — Xvfb virtual display + non-headless Chrome

The hypothesis: headless detection is at the GPU level. Give Chrome a virtual display, it thinks it has a real screen. xvfb-run -a, Chrome launches in headless=False. Pages load fully (1.1MB real HTML, all images, category navigation).

Why: Xvfb has no GPU either. glxinfo shows Mesa software rasterizer. Canvas hash from Mesa llvmpipe is different from SwiftShader but still a known server software renderer, also flagged. _abck stays ~-1~ for 60+ seconds regardless of scrolling.

× Attempt 7 — Inject real Mac canvas hashes via CDP

If the server's canvas hash is flagged, why not inject a real Mac hash? Requires Page.addScriptToEvaluateOnNewDocument again. Same problem as Attempt 3.

Why: the injection itself is the signal, regardless of what values we inject. Akamai detects the patch through CDP timing artefacts and Function.toString() inspection.

The breakthrough · I was fixing the wrong layer

Every failed attempt above tried to fix the browser layer. The fundamental insight: most Akamai-protected sites never reach the deep sensor.js evaluation if the request looks like real Chrome at the network layer first.

Akamai scores in five layers:

Layer 1 · TLS JA4fires before HTML is served

Layer 2 · HTTP/2 SETTINGSHEADER_TABLE_SIZE, WINDOW_SIZE

Layer 3 · ALPN + header orderh2 vs h3, Chrome order

Layer 4 · sensor.jscanvas, WebGL, audio, extensions

Layer 5 · Behaviouralmouse Bezier, scroll timing

Python's requests, httpx, even curl_cffi with a wrong impersonation profile all fail at Layer 1. The JA4 hash doesn't match Chrome 148's actual ClientHello. Fix Layers 1-3 correctly and you often never reach Layer 4.

✓ What worked · a Go library reimplementing Chrome's TLS stack

A Go library, akamai-v3-sensor, reimplements Chrome's exact TLS stack at the C level: cipher suite order, GREASE values, extension ordering, ALPN, HTTP/2 SETTINGS frames, HTTP/3 QUIC parameters. The JA4 fingerprint it produces is indistinguishable from real Chrome 148 because it is Chrome 148's cipher suite, implemented in Go.

// One session, one proxy, one request
s := sensor.NewSession("chrome-148",
    sensor.WithSessionProxy("http://user:pass@comcast-ip:port"),
    sensor.WithSessionTimeout(30*time.Second),
)
resp, _ := s.Get(context.Background(), "https://target-site.com/")
// Status: 200, Protocol: h2, _abck: ~0~ (validated)

// Then GraphQL directly on the same session
gql, _ := s.DoWithBody(ctx, req, bytes.NewReader(payload))
// Status: 200, 30KB product data, zero blocks

No browser process. No GPU. No canvas hash. No sensor.js execution. Just a TLS handshake that matches Chrome 148 exactly because it uses Chrome 148's cipher suites.

Production architecture

Scrapy spider
  → GoProxyMiddleware (urllib, ~35ms round trip)
      → Go HTTP server :8765 (4-session pool)
          → Go TLS library sessions
              → ISP proxy (Comcast AS7015, static residential)
                  → Target site

Session rotation logic: 206 or GenericError triggers the next session in the pool. Three errors on one session triggers a background re-warm (new TLS handshake, new session state). All 4 sessions blocked returns 503; Python middleware waits 5s and retries up to 3× before falling back to curl_cffi.

24 req/min sustained

0 blocks in 500+ requests

0 browser processes

4 session pool

Key takeaways

Canvas fingerprinting cannot be fixed at the JS layer. Patching toDataURL() or getParameter() in JavaScript is detectable via Function.prototype.toString(). The only real fix is at the C++ level, either a real GPU or a library that bypasses the browser entirely.

SwiftShader's 0x0000C0DE device ID is permanently flagged. Don't bother. It's in Akamai's blocklist and the deterministic canvas hash is also known. Same for Mesa llvmpipe.

Page.addScriptToEvaluateOnNewDocument is itself a signal. Akamai's EdgeWorker detects the timing gap left by CDP's Runtime.enable command. The injection runs, but the metadata around it is visible.

The TLS layer is the one that matters first. Fix JA4, HTTP/2, ALPN, and header order before worrying about canvas or WebGL. Most deployments never even reach sensor.js if the TLS fingerprint doesn't match.

A clean ISP proxy IP matters. Comcast AS7015 static residential is what worked. Datacenter IPs fail at the IP reputation layer regardless of TLS quality. Rotating residential proxies break session trust accumulation, Akamai scores per-session not per-request.

Practical decision tree · Akamai in 2026

01 Find the mobile / GraphQL API first. Often zero anti-bot. Same data, no sensor.js. Look for /graphql, /api/v1/, mobile traffic intercepted via HTTP Toolkit.

02 curl_cffi chrome131 + ISP residential proxy. Works on ~60% of Akamai targets where sensor.js scoring is light.

03 Go TLS library (akamai-v3-sensor) + ISP proxy. For targets with heavier sensor.js where curl_cffi's impersonation doesn't pass JA4 at the EdgeWorker.

04 CloakBrowser (49 C++ patches, loads real extensions). For targets requiring a real canvas hash and passing the 60-extension probe. The kill-switch for sites where TLS spoofing alone is not enough.

05 Managed API (Scrapfly, Bright Data). For Bot Manager Premier targets running pixel challenges. Engineering cost exceeds managed API cost.

05 The arsenal

Every tool built to fight
every wall we just described.

Now that you understand the detection stack and the six anti-bot vendors, every library below makes sense in context. curl_cffi exists because of JA4. Camoufox exists because of CDP leaks. PatchRight exists because of Kasada's toString() inspection. The arsenal wasn't built randomly, each tool is a direct countermeasure to a specific detection innovation.

One distinction decides your tool choice more than any other: which layer does your target actually gate on? There are three separable surfaces, and most stealth tooling only solves the first two. TLS fingerprinting keys on the handshake, a Firefox-shaped TLS via Camoufox or a Chrome-shaped TLS via curl_cffi both answer it. JavaScript-layer detection reads navigator properties after the page loads, where a current unpatched Chromium already passes most panels. The third surface is the one that quietly defeats expensive tooling: automation-protocol fingerprinting, which detects how the browser is being driven rather than what it claims to be.

This is the layer that does not care how good your fingerprint patches are. Anything driving Chrome through Playwright leaves a recognisable shape in the control protocol at startup, the Runtime.enable and Target.setAutoAttach handshake sequence. A fork can rewrite navigator properties all day without touching it. The tools that clear this layer are the ones that remove the standard automation framework from the control plane entirely: nodriver drives system Chrome over a direct CDP connection with no Playwright shim, which is why it walks through Cloudflare Turnstile gates that every patched Playwright fork fails. The practical lesson from running these side by side: identify the gate's layer first, then pick the cheapest tool that covers it. A twenty-line curl_cffi wrapper can match a 130MB patched Chromium fork on a TLS-and-JS target, and lose entirely on an automation-protocol target where only a non-Playwright control plane gets through. Patches are not the lever you think they are, the control plane is.

Before the table · pick the right language first

Scraping is no longer Python-only.

Python still dominates the open-source ecosystem (Scrapy, curl_cffi, Camoufox), but the hardest 10% of targets in 2026 reach for Go, TypeScript, or Rust. Here's when each language earns its place, and why mixing them in one pipeline is the production-grade move.

Python Default

Largest ecosystem. Scrapy + curl_cffi + Camoufox cover 80% of targets. Best for data engineers already running pandas, Airflow, dbt downstream.

Use when: default choice, fast prototyping, ML pipelines, when team is Python-native

Key libs: Scrapy, curl_cffi, Camoufox, PatchRight, Crawlee-python, scrapy-stealth

Go TLS & concurrency

Closest language to OS-level TLS control. Akamai-grade JA4 spoofing is a Go specialty. Native goroutines beat asyncio at 10,000+ concurrent connections. Single binary deploys.

Use when: Akamai v3, F5 Shape, TLS fingerprint precision, 10K+ concurrent crawls, edge deployments

Key libs: akamai-v3-sensor, tls-client, colly, surf, rod, cycletls

TypeScript / Node.js

First-class Playwright and Puppeteer. Better browser automation primitives than Python ports. Strongest ecosystem for AI agents and Chrome extension work. Apify's Crawlee was Node-first for a reason.

Use when: browser automation is the bottleneck, AI agents, Computer Use Agents, working with Chrome DevTools deeply, full-stack JS team

Key libs: Crawlee, Playwright (TS), Puppeteer, Stagehand, Browser Use, ScrapingBee SDK, got-scraping

Rust Emerging

Lower-level than Go, even more control over TLS internals and memory. Used where you need both performance and Chrome-level fingerprint precision. 67% fewer tokens than equivalent Python in some benchmarks. Steeper learning curve.

Use when: webclaw, custom TLS work, MCP servers, performance is critical, you already have Rust on the team

Key libs: webclaw, rquest, rnet, scraper, fantoccini

The pragmatic move in 2026:

Run Python for orchestration (Scrapy is still the best framework for crawl logic, queues, deduplication, and ML downstream). Drop down to Go via a sidecar HTTP service only for the requests that need true Chrome JA4 (Akamai v3, F5 Shape). Use Node + Crawlee or Playwright (TS) when the work is genuinely browser-automation-heavy, especially for AI agents. The Akamai case study above shows this exact pattern: Scrapy spider in Python, calling a Go server via urllib for the protected requests only.

Master comparison table, all 75 libraries & tools

Library (click to expand)	Type	Lang	TLS spoof	TLS detail	MCP
curl_cffi ⚡	HTTP	Python	Chrome JA4+	Akamai, DataDome	–
⚡ HTTP Under the hood: libcurl C library with custom TLS patches. Emits exact Chrome/Safari/Firefox TLS ClientHello at the C level, cipher suites, extensions, ALPN, GREASE all match real browsers. ✓ Pros Fastest HTTP option. Pure HTTP speed, no browser overhead Confirmed DataDome + Akamai bypass in 2026 Asyncio support via AsyncSession Simple requests-compatible API ✗ Cons No JavaScript execution, useless for JS-rendered pages Cannot solve CAPTCHA or Turnstile challenges TLS fingerprint only, no behaviour or canvas spoofing
Scrapling ⚡	HTTP	Python	Chrome TLS	Cloudflare Turnstile	38k
⚡ HTTP Under the hood: Wraps curl_cffi for stealth HTTP + integrates Camoufox for browser mode. StealthyFetcher uses a real patched Firefox under the hood when needed. ✓ Pros StealthyFetcher solves Cloudflare Turnstile natively Async spider v0.4, pause/resume, per-domain throttling Dual-mode: HTTP for speed, browser for hard targets Active development, 38K stars ✗ Cons Higher complexity than plain curl_cffi Browser mode adds Camoufox overhead when triggered
webclaw ⚡	HTTP	Rust	Chrome TLS	Medium targets	–
⚡ HTTP Under the hood: Rust HTTP client with TLS fingerprint spoofing. Emits browser TLS signatures from Rust, fast and low-memory. ✓ Pros Rust speed, very low CPU/memory overhead TLS fingerprinting at Rust level Good for high-volume HTTP scraping ✗ Cons Rust, no Python API Less widespread adoption No JS rendering
httpx ⚡	HTTP	Python	None	Unprotected only	–
⚡ HTTP Under the hood: Modern Python HTTP library with async support and HTTP/2. ✓ Pros Async + sync in one library HTTP/2 support unlike requests Type hints, modern API ✗ Cons TLS fingerprint still Python-default, detectable Not as stealthy as curl_cffi without patching
requests ⚡	HTTP	Python	None	Unprotected only	52k
⚡ HTTP Under the hood: Pure Python HTTP library. Sends HTTP/1.1 requests with standard Python TLS. ✓ Pros Simple API, universally known Synchronous, easy to debug ✗ Cons TLS fingerprint is instantly detectable (Python urllib3) No async, slow for concurrent scraping No anti-bot capability
tls-client ⚡	HTTP	Go/Py	Chrome/Firefox TLS	Akamai, DataDome	–
⚡ HTTP Under the hood: Go/Python wrapper around a Go TLS client that mimics browser fingerprints. Predecessor to cycle-tls. ✓ Pros Python bindings available Bypasses JA3/JA4 fingerprinting Lighter than curl_cffi ✗ Cons Less actively maintained than curl_cffi No JS rendering Binary dependency
Playwright 🌐	Browser	Py/JS	CDP (detectable)	Medium (CDP leaks)	68k
🌐 Browser Under the hood: Chromium DevTools Protocol (CDP). Microsoft-maintained. Drives real Chromium, Firefox, or WebKit browsers over CDP socket. ✓ Pros Best JS execution support, renders any SPA 68K stars, massive ecosystem and docs Cross-browser: Chrome, Firefox, Safari (WebKit) Screenshot, PDF, network intercept built-in ✗ Cons CDP is detectable, needs C++ wrapper (PatchRight/Camoufox) Heavy: launches a full browser process per session Slow vs HTTP, ~10× more memory per concurrent task
Camoufox 🌐	Browser	Python	C++ Firefox Juggler	Cloudflare 100%, Akamai	–
🌐 Browser Under the hood: Forked Firefox with C++ binary patches to Juggler protocol (below CDP). Patches navigator, canvas, WebGL, fonts, window.chrome at binary level. ✓ Pros 100% Cloudflare pass rate as of March 2026 geoip=True aligns all 5 identity vectors automatically Below-CDP, invisible to JS-level detection Async context manager, drop-in playwright replacement ✗ Cons Firefox only, no Chrome/Safari Heavier than curl_cffi Occasional site-specific quirks with Firefox fingerprint ⚠ CVE-2026-6770 — Process-Level Fingerprint Leak (patched Firefox 150) Firefox below v150 returned IndexedDB database names in hash table iteration order rather than sorted order. Because the hash table is shared across all origins within the same browser process, the ordering became a stable, high-entropy process-lifetime fingerprint — consistent across tabs, sites, private windows, and even Tor Browser's "New Identity" resets. Anti-bot systems could use this to correlate multiple scraping identities running in the same browser process, regardless of proxy rotation or profile switching. Fix: Use Camoufox built on Firefox 150+ (patched April 2026). Verify your version. Scraper lesson: Always isolate identities at the process level, not just the profile level. Multiple sessions in one browser process can be correlated through memory-state artifacts like this even when fingerprint patching is otherwise perfect. Ref: CVE-2026-6770 · mfsa2026-30 · Fixed Firefox 150 / Tor Browser 15.0.10
CloakBrowser 🌐	Browser	Python	49 C++ patches	Akamai, reCAPTCHA v3 0.9	–
🌐 Browser Under the hood: 49+ C++ binary patches to Chromium itself. Patches webdriver, chrome object, plugins, permissions, WebGL, Canvas, AudioContext, and extension probe responses at the C++ level, not JavaScript. Repo: github.com/CloakHQ/CloakBrowser. ✓ Pros reCAPTCHA v3 score 0.9, the highest of any tool I have tested Passes Akamai's 60-URL extension probe (loads real 1Password / Bitwarden / LastPass profiles) Real extension fingerprint database built-in, no manual setup C++ level patches survive `Function.prototype.toString()` inspection (the kill-switch for JS stealth tools like puppeteer-stealth and patchright) Solves the hard problems Camoufox and PatchRight cannot: canvas hash at the C++ rendering layer, AudioContext at the audio pipeline layer, real extension probe responses Now open source on GitHub, active development ✗ Cons Higher resource cost than HTTP-only tools (200MB+ per instance) Chromium only, no Firefox variant Newer than Camoufox, smaller community Overkill for sites that curl_cffi + ISP proxy already handle
PatchRight 🌐	Browser	Python	Py source patches	Kasada, Cloudflare	–
🌐 Browser Under the hood: Patches Playwright Python source files at install time. Removes CDP signatures, webdriver property, and stealth tells from the JS layer. ✓ Pros Open source, free, Kasada bypass confirmed Drop-in Playwright replacement, zero API changes Patches JS layer without C++ recompilation ✗ Cons JS-level patches only, determined adversary can detect at binary level Less robust than Camoufox on Cloudflare 5-second challenge Requires Playwright to be installed first
Puppeteer 🌐	Browser	Node	CDP (detectable)	Medium targets	89k
🌐 Browser Under the hood: Node.js CDP driver for Chromium. Google-maintained. The original headless browser automation library. ✓ Pros 89K stars, largest ecosystem Native Google product, Chromium compatibility guaranteed Good for CI/CD screenshot and PDF generation ✗ Cons CDP is easily detectable (webdriver=true, window.chrome absent) Node.js only, no Python No anti-bot stealth built-in
Selenium 🌐	Browser	Multi	webdriver=true	Weak (legacy)	29k
🌐 Browser Under the hood: WebDriver protocol (W3C standard). Drives any browser via standardised JSON protocol. The original browser automation framework. ✓ Pros Multi-language: Python, Java, C#, Ruby, JS Supports all browsers including IE and Safari Huge ecosystem, well-documented ✗ Cons navigator.webdriver=true is trivially detectable Slowest option, WebDriver adds round-trip latency Requires ChromeDriver binary management
SeleniumBase UC 🌐	Browser	Python	UC removes WD flag	Kasada, general stealth	10k
🌐 Browser Under the hood: SeleniumBase with undetected-chromedriver mode. Patches Chrome binary to remove webdriver flag and CDP signatures. ✓ Pros UC mode removes webdriver=true flag Passes basic Cloudflare and PerimeterX Built-in test framework, good for QA teams ✗ Cons Not as strong as Camoufox/PatchRight on hard targets Chrome binary patches can break on updates Slower than Playwright equivalent
Selenium-Driverless 🌐	Browser	Python	CDP no WebDriver	Medium targets	–
🌐 Browser Under the hood: Direct CDP connection without ChromeDriver binary, no webdriver flag set. Async Python API. ✓ Pros No ChromeDriver binary needed No webdriver=true flag Async Python native ✗ Cons Newer, less battle-tested than nodriver Chrome only Some CDP signatures still detectable
nodriver 🌐	Browser	Python	Raw CDP async	Medium targets	–
🌐 Browser Under the hood: Controls Chrome via its internal DevTools socket without using CDP's standard automation flag. Chrome doesn't know it's being driven. ✓ Pros Chrome does not set automation flags Passes many sites that detect standard CDP Lightweight, lower overhead than full Playwright ✗ Cons Relatively new, less battle-tested Python only Some sites still detect via other JS signals
pydoll 🌐	Browser	Python	Async CDP	Medium targets	–
🌐 Browser Under the hood: Pure Python browser automation using Chrome DevTools Protocol directly. No external driver. ✓ Pros No ChromeDriver dependency Fast startup, no driver process Pure Python, easy to install ✗ Cons CDP still potentially detectable Less mature than Playwright Smaller community
Botright 🌐	Browser	Python	CAPTCHA solving	CAPTCHA targets	–
🌐 Browser Under the hood: Playwright wrapper focused on CAPTCHA solving and stealth. Uses AI to solve CAPTCHAs during automation. ✓ Pros Auto-solves reCAPTCHA and hCAPTCHA inline Stealth patches on top of Playwright Good for CAPTCHA-heavy targets ✗ Cons Heavier than raw Playwright CAPTCHA AI may be rate-limited Less control over fingerprinting
Botasaurus 🌐	Browser	Python	Gaussian mouse	DataDome behaviour	–
🌐 Browser Under the hood: Playwright wrapper that adds Gaussian mouse movement, realistic typing, scroll physics, and session management. ✓ Pros Gaussian mouse curves, passes behavioural ML checks Handles DataDome behavioural scoring Session persistence and rotating profiles built-in ✗ Cons Browser-based overhead Overkill for targets without behavioural analysis Less control than raw Playwright
rayobrowse 🌐	Browser	Py/Docker	Real device FP DB	Hard targets	–
🌐 Browser Under the hood: Docker-based stealth Chromium browser from Rayobyte. C++ level patches (not JS-level), exposed via CDP so Playwright/Puppeteer/Selenium can connect natively. Self-hosted = free and unlimited; managed Cloud version available. ✓ Pros Free and unlimited self-hosted (Docker), Cloud version managed C++ level patches survive Function.toString() inspection Coherent device profile: UA, WebGL, Canvas, AudioContext, fonts all match Native CDP, drop-in for Playwright/Puppeteer/Selenium Used by Rayobyte to scrape millions of pages/day in production ✗ Cons Still in beta, results vary by target site Windows + Android profiles strongest, macOS/Linux less mature Closed source (license restricts certain organizations) Canvas/WebGL FP coverage still evolving
undetected-chromedriver 🌐	Browser	Python	Removes WD flag	Medium targets	5k
🌐 Browser Under the hood: Patches ChromeDriver binary to remove webdriver=true and CDP automation flags at binary level. ✓ Pros Removes most obvious webdriver signals Simple: just replace webdriver.Chrome with uc.Chrome ✗ Cons Chrome binary patches break on updates frequently Not as robust as Camoufox on modern Cloudflare Maintenance has slowed
⭐ Scrapy ⚡	Framework	Python	Via curl_cffi mw	Medium (with middleware)	52k
⚡ HTTP Under the hood: Twisted-based async Python framework. Pure HTTP, sends requests, receives responses, parses with XPath/CSS. No browser. ✓ Pros 52K stars, production standard for HTTP scraping Massive ecosystem: scrapy-redis, scrapy-playwright, scrapyd Async by default, hundreds of concurrent requests Mature: pipelines, middlewares, extensions all built-in ✗ Cons No JS rendering by default (need playwright middleware) Pure HTTP, detectable by TLS fingerprint without curl_cffi middleware Steeper learning curve than requests
Crawlee 🌐	Framework	Node/Py	Playwright-based	Medium targets	15k
🌐 Browser Under the hood: Apify's unified Node.js framework. Wraps both HTTP (got-scraping) and Playwright/Puppeteer. Handles retries, deduplication, storage. ✓ Pros Dual HTTP+browser mode in one framework 15K stars, actively maintained by Apify Built-in dataset storage, request queue, proxy rotation ✗ Cons Node.js primary (Python port is newer, less mature) More opinionated than Scrapy, harder to customise Heavier dependency footprint
scrapy-camoufox ⚡	Framework	Python	Camoufox integration	Hard targets	–
⚡ HTTP Under the hood: Scrapy middleware that routes requests through Camoufox browser for stealth. Best of Scrapy + Camoufox. ✓ Pros Scrapy pipeline management + Camoufox stealth Per-request browser decision (HTTP vs browser) Good for mixed protection targets ✗ Cons Camoufox overhead on browser requests Requires both Scrapy and Camoufox installed
scrapy-nodriver ⚡	Framework	Python	nodriver integration	Medium targets	–
⚡ HTTP Under the hood: Scrapy middleware using nodriver for browser requests, Chrome without CDP flags. ✓ Pros Scrapy framework + Chrome without automation flags Good for Cloudflare-protected targets Use Scrapy architecture you know ✗ Cons nodriver overhead per browser request Less control than raw nodriver
scrapy-stealth ⚡	Framework	Python	Browser TLS + HTTP/2	Cloudflare, Akamai	v0.4 (2026)
⚡ HTTP Under the hood: Pluggable Scrapy DOWNLOADER_MIDDLEWARE with three drivers: `basic` + `turbo` (TLS fingerprint + HTTP/2 impersonation, no browser), and `browser` (real Chrome via CDP for JS-heavy targets). Per-request engine switching via `request.meta["stealth"]`. Repo: github.com/fawadss1/scrapy-stealth. Author Fawad ships frequent updates. ✓ Pros Built-in TLS fingerprint spoofing, scrapy-playwright/scrapy-splash/scrapy-selenium do not have this Per-request engine switching: keep light HTTP for easy URLs, browser only for protected ones Built-in proxy + fingerprint rotation (no separate middleware needed) Native Cloudflare and Akamai detection via status + body keyword checks Browser profiles like `chrome_147`, `safari_ios_18_1_1` kept current MIT license, active development (v0.4 May 2026) ✗ Cons Project is new with limited GitHub adoption (low star count) Less battle-tested than scrapy-playwright in production at scale Browser driver 5-15s per page, use selectively for JS-protected URLs only Requires Python 3.11+ and Scrapy 2.15+
Firecrawl ⚡	AI	API	FIRE-1 engine	Hard via managed	111k
⚡ HTTP Under the hood: API service that converts any URL to clean Markdown or structured JSON for LLM consumption. FIRE-1 agent for multi-page crawls. ✓ Pros 111K stars, most popular LLM scraping tool Outputs clean Markdown, 67% fewer tokens for LLMs MCP server for Claude/Cursor/LangChain Handles JS rendering and auth flows ✗ Cons API cost at scale Less control over request details vs raw scraping Data goes through third-party servers
Crawl4AI 🌐	AI	Python	Playwright-based	Medium targets	60k
🌐 Browser Under the hood: Local Playwright wrapper optimised for LLM output. Runs locally, converts pages to clean Markdown with BM25 relevance filtering. ✓ Pros Fully local, no API cost BM25 filter reduces LLM context bloat LLM extraction schema definition MIT license, commercial friendly ✗ Cons Playwright overhead per page Less anti-bot bypass than Camoufox No managed infrastructure
ScrapeGraphAI ⚡	AI	Python	NL graph pipeline	Light protection	18k
⚡ HTTP Under the hood: LLM-powered extraction that builds a graph pipeline from a natural language prompt. Local or API. ✓ Pros Natural language extraction definition Open source, self-hostable Graph pipeline handles multi-step extractions ✗ Cons LLM inference cost/latency per extraction Less deterministic than CSS/XPath selectors Newer, less battle-tested at scale
Jina Reader API ⚡	AI	API	Built-in rendering	Medium targets	–
⚡ HTTP Under the hood: REST API: prefix r.jina.ai/ to any URL to get clean Markdown back. Zero setup. ✓ Pros Simplest possible API, one URL prefix Good JS rendering Free tier available ✗ Cons Data goes through Jina servers Less control than local scraping Rate limited on free tier
Steel 🌐	AI	API	Docker browser	Medium targets	–
🌐 Browser Under the hood: Self-hosted browser API with MCP server. AI agents call it as a tool to browse the web. ✓ Pros Self-hosted, data stays local MCP server for AI agent integration Docker deployment ✗ Cons Newer product, smaller community Setup overhead vs managed services
Bright Data ⚡	Managed	API	Full enterprise stack	All incl. F5 Shape	–
⚡ HTTP Under the hood: 72M+ IP network + scraping API. Managed infrastructure handles anti-bot, JS rendering, proxy rotation. ✓ Pros 98.44% success rate, highest benchmark Covers F5 Shape (only managed service that does) Residential + ISP + datacenter + mobile IPs Dataset marketplace for pre-scraped data ✗ Cons Most expensive option Data goes through third-party Overkill for simple targets
Zyte ⚡	Managed	API	Full stack	All targets	–
⚡ HTTP Under the hood: Scrapy company's managed scraping platform. Zyte API + AutoExtract for structured data. ✓ Pros #1 Proxyway benchmark 2025 AutoExtract returns structured product/article data Built by the Scrapy maintainers Smart proxy rotation built-in ✗ Cons Expensive at scale AutoExtract less flexible than custom extraction
Apify ⚡	Managed	API	10K+ Actors	Medium-hard	–
⚡ HTTP Under the hood: 10,000+ pre-built Actors on serverless cloud. Crawlee at core. MCP server for AI agents. ✓ Pros Biggest marketplace of pre-built scrapers MCP server: AI agents call Actors as tools Free $5/mo credit for casual use Crawlee open source available locally ✗ Cons CU pricing can escalate Data goes through Apify cloud Less control over anti-bot approach in Actors
ScrapingBee ⚡	Managed	API	Managed rendering	Medium targets	–
⚡ HTTP Under the hood: Managed scraping API. Handles JS rendering, CAPTCHA, proxies via simple REST call. ✓ Pros Dead simple: one API call, get HTML back Free tier available Handles most modern JS rendering ✗ Cons Less anti-bot strength than Zyte or Bright Data Per-call pricing Less control over request details
SerpAPI ⚡	Managed	API	SERP JSON API	Search engine data	–
⚡ HTTP Under the hood: Managed API that abstracts Google, Bing, Baidu, Yandex, Yahoo, DuckDuckGo and 80+ other engines behind a single REST endpoint. Returns fully parsed, normalised JSON — organic results, ads, featured snippets, knowledge graphs, local packs, shopping, images, news — without you touching a proxy or a headless browser. ✓ Pros 80+ search engines including Google, Bing, Baidu, Yandex, Yahoo, DuckDuckGo, Google Maps, Google Shopping, Google Scholar Richest parsed output — 15+ SERP element types including AI Overviews, PAA, video carousels, local pack Advanced geo-targeting and device simulation per request Free tier: 100 searches/month, no credit card required Well-documented, widely adopted — largest mindshare in SERP API category ✗ Cons Premium pricing: $75/mo for 5K, $150/mo for 15K searches — expensive vs alternatives like Scrape.do (21x cheaper per request) SERP-specific only, not a general-purpose scraping API Data routed through SerpAPI servers
ScrapeBadger ⚡	Managed	API	Smart billing + AI extract	Cloudflare, DataDome, hard targets	–
⚡ HTTP Under the hood: Newer managed scraping API built natively for modern anti-bot stacks. Key differentiator: smart billing — if you enable JS rendering and anti-bot bypass but the target doesn't need them, ScrapeBadger auto-downgrades the request and charges you less. Also ships an MCP server for Twitter/X scraping (profiles, tweets, trends) for AI agent workflows. ✓ Pros Pay-only-for-success model — failed requests don't cost you Smart auto-downgrade: enables anti-bot features only when needed, reducing cost automatically Native Cloudflare Turnstile and DataDome bypass — 99%+ reported success on Zillow, Amazon, LinkedIn AI extraction mode: pass a plain-English prompt, get structured JSON back without writing selectors MCP server for Twitter/X — profiles, tweets, trends for AI agent pipelines ✗ Cons Newer entrant — less battle-tested at scale than Bright Data or Zyte Smaller community and ecosystem vs established providers Data routed through third-party servers
Oxylabs ⚡	Managed	API	OxyCopilot AI	Hard targets	–
⚡ HTTP Under the hood: 102M+ IP network with OxyCopilot AI extraction and scraper APIs. ✓ Pros Largest IP pool (102M+) OxyCopilot: AI-powered extraction Strong residential + datacenter options ✗ Cons Enterprise pricing Data through third-party
Browserbase 🌐	Managed	API	Managed browser	Hard targets	–
🌐 Browser Under the hood: Managed Playwright cloud. Run Playwright scripts remotely without managing browser infrastructure. ✓ Pros No browser infra to manage Scales automatically Playwright API unchanged, zero code changes ✗ Cons 42% success rate on anti-bot benchmark (vs 81% Browser Use) Per-session pricing Less stealth than self-hosted Camoufox
chompjs ⚡	Parser	Python	N/A	Parser only	–
⚡ HTTP Under the hood: Python library to parse JavaScript objects embedded in HTML pages. Converts JS literals to Python dicts. ✓ Pros Handles malformed JSON that json.loads rejects Extracts __NEXT_DATA__ and embedded JS objects Zero dependencies ✗ Cons Parsing only, not a scraping framework Narrow use case
Parsel ⚡	Parser	Python	N/A	Parser only	–
⚡ HTTP Under the hood: Scrapy's HTML/XML parser library. XPath and CSS selectors with a clean Python API. ✓ Pros XPath + CSS in one library Used inside Scrapy, familiar API Faster than BeautifulSoup for selection ✗ Cons Parsing only, no HTTP requests Less beginner-friendly than BS4
BeautifulSoup4 ⚡	Parser	Python	N/A	Parser only	10k
⚡ HTTP Under the hood: Python HTML/XML parser. Wraps lxml or html.parser. Builds a parse tree from raw HTML strings. ✓ Pros Simple, readable API, beginner-friendly Works on any HTML string regardless of source No network requests, pure parsing ✗ Cons Not a scraping framework, needs requests/httpx separately Slow on large documents vs selectolax/lxml No anti-bot capability whatsoever
mitmproxy ⚡	RE Tool	Python	N/A	RE / intercept	37k
⚡ HTTP Under the hood: Python-based HTTPS proxy. Intercepts, inspects, and modifies HTTP/HTTPS traffic between client and server. ✓ Pros Full request/response visibility and modification Script intercepted traffic with Python Good for understanding anti-bot request patterns ✗ Cons Requires certificate trust on device SSL pinning blocks it on hardened apps For analysis/RE, not production scraping
HTTPToolkit ⚡	RE Tool	Any	N/A	Mobile API intercept	–
⚡ HTTP Under the hood: HTTPS intercepting proxy for development and mobile API discovery. Open source. ✓ Pros Intercepts HTTPS without SSL pinning (with rooted device) Beautiful UI for inspecting requests Works with Android emulators via ADB ✗ Cons For analysis only, not for production scraping Requires rooted device for mobile apps
Frida ⚡	RE Tool	Py/JS	N/A	SSL hooks	–
⚡ HTTP Under the hood: Dynamic instrumentation toolkit. Injects JavaScript into running processes. Used to hook native functions and bypass SSL pinning. ✓ Pros Bypass SSL pinning in any Android/iOS app Hook any native function at runtime Essential for mobile app API extraction ✗ Cons Requires rooted/jailbroken device Complex setup, not for beginners App-specific scripts needed per target
rebrowser-patches 🌐	Browser	Python	Chrome source patches	Medium targets	–
🌐 Browser Under the hood: JavaScript patches injected into Playwright/Puppeteer pages to mask automation signals. ✓ Pros Removes navigator.webdriver and CDP signals at JS level Works with any Playwright version Easy to integrate ✗ Cons JS-level only, binary signals still present Less robust than C++ patches
cycle-tls ⚡	HTTP	Go/JS	Chrome/Firefox TLS	Akamai, DataDome	–
⚡ HTTP Under the hood: Node.js/Go TLS client that cycles through browser fingerprints. Sends real JA3 hashes per request. ✓ Pros Node.js TLS fingerprint spoofing Per-request fingerprint rotation Good for JS pipeline scraping ✗ Cons Node.js only, no Python Less robust than curl_cffi on hard targets
GoLogin 🌐	Browser	Cloud	Antidetect profiles	Hard multi-account	–
🌐 Browser Under the hood: Cloud anti-detect browser. Manages browser profiles with unique fingerprints stored in cloud. Multi-account management. ✓ Pros Profile fingerprint management at scale Good for multi-account scraping operations Team sharing of browser profiles ✗ Cons Paid product, cloud-dependent Not suitable for automated pipeline scraping Designed for manual browsing, not scripted crawling
Multilogin 🌐	Browser	Cloud	Antidetect profiles	Hard multi-account	–
🌐 Browser Under the hood: Commercial anti-detect browser with managed profile fingerprints. Team collaboration on browser profiles. ✓ Pros Professional multi-account management Managed fingerprint database Team profile sharing ✗ Cons Very expensive Designed for manual use, not automated crawling Data in cloud
ScraperAPI ⚡	Managed	API	Full stack	All incl. Walmart	–
⚡ HTTP Under the hood: Simple proxy rotation + JS rendering API. Handles geo-targeting and header rotation. ✓ Pros Simple integration, just prepend URL Free tier with 1000 calls/mo Geo-targeting built in ✗ Cons Weaker on hard anti-bot targets Basic anti-bot handling vs Zyte/Bright Data
Decodo ⚡	Managed	API	Full stack	All targets	–
⚡ HTTP Under the hood: Smartproxy's new brand. Residential, datacenter, and mobile proxy network. ✓ Pros Affordable residential proxies Pay-as-you-go pricing Good for mid-scale scraping ✗ Cons Less powerful than Bright Data on hard targets Smaller IP pool
CapSolver ⚡	CAPTCHA	API	N/A	reCAPTCHA/hCaptcha	–
⚡ HTTP Under the hood: AI-powered CAPTCHA solving service. Uses computer vision to solve reCAPTCHA v2/v3, hCAPTCHA, Cloudflare Turnstile. ✓ Pros Solves reCAPTCHA v3, hCAPTCHA, Turnstile, ImageCAPTCHA Fast: under 10 seconds for most CAPTCHA types API-based, works with any language ✗ Cons Cost per solve (~$0.001–0.002) reCAPTCHA v3 score may be low vs C++ browser Solving is symptomatic, better to avoid triggering CAPTCHA
2captcha ⚡	CAPTCHA	API	N/A	All CAPTCHA types	–
⚡ HTTP Under the hood: Human + AI hybrid CAPTCHA solving service. One of the oldest in the market. ✓ Pros Solves almost any CAPTCHA type including custom ones Human fallback for unusual CAPTCHAs Large API ecosystem ✗ Cons Slowest option, human solving adds latency Cost per solve Less automated than CapSolver
Anti-Captcha ⚡	CAPTCHA	API	N/A	reCAPTCHA/image	–
⚡ HTTP Under the hood: Human + AI CAPTCHA solving service. Competitor to 2captcha. ✓ Pros Solves all major CAPTCHA types Competitive pricing API compatible with 2captcha ✗ Cons Human solving latency Cost per solve Better to avoid triggering CAPTCHA in the first place
Scrapyd ⚡	Framework	Python	Via middleware	Scrapy deploy tool	–
⚡ HTTP Under the hood: Daemon that deploys and runs Scrapy spiders via JSON API. Port 6800. Process-based job queue. ✓ Pros Zero cloud cost, runs on any server ScrapydWeb provides visual dashboard Simple deploy: scrapyd-deploy -p project ✗ Cons Single node by default, no horizontal scaling No built-in monitoring or alerting Job isolation is process-level only
scrapy-redis ⚡	Framework	Python	N/A	Distributed Scrapy	–
⚡ HTTP Under the hood: Scrapy extension connecting spiders to a Redis shared URL queue. Enables distributed crawling. ✓ Pros Horizontal scale: add workers without code change Redis deduplicates URLs across all workers One codebase, N machines ✗ Cons Redis is a new SPOF No built-in job scheduling Requires Redis infrastructure
scrapy-cluster ⚡	Framework	Python	N/A	Enterprise Scrapy	–
⚡ HTTP Under the hood: Distributed Scrapy cluster using Redis + Kafka + Zookeeper. Enterprise-scale distributed crawling. ✓ Pros True enterprise-scale distributed crawling Kafka for message durability Multi-project support ✗ Cons Complex infra: Redis + Kafka + Zookeeper Overkill for most use cases High ops overhead
scrapy-poet ⚡	Framework	Python	N/A	Page Object pattern	–
⚡ HTTP Under the hood: Dependency injection framework for Scrapy spiders. Cleaner spider code with page objects. ✓ Pros Cleaner code via page objects pattern Works with zyte-spider and AutoExtract Testable spider logic ✗ Cons Adds abstraction overhead Learning curve for Scrapy veterans
Splash 🌐	Browser	Docker	Lua scripting	Light protection	–
🌐 Browser Under the hood: Lua-scriptable browser for JS rendering, runs in Docker. Integrates with Scrapy via scrapy-splash. ✓ Pros Docker-based, easy to deploy Lua scripting for complex interactions Good for Scrapy integration on JS sites ✗ Cons Outdated, Playwright has superseded it Lua scripting adds complexity Less stealth than Camoufox
selectolax ⚡	Parser	Python	N/A	Fast HTML parser	–
⚡ HTTP Under the hood: C-based HTML parser (lexbor engine). 10–100× faster than BeautifulSoup for pure parsing tasks. ✓ Pros Extremely fast, C engine vs Python in BS4 CSS selectors with clean Python API Low memory footprint ✗ Cons CSS selectors only, no XPath Less forgiving on malformed HTML than BS4 Smaller community/docs
lxml ⚡	Parser	Python	N/A	XPath + CSS parser	–
⚡ HTTP Under the hood: C-based XML/HTML parser. Fastest Python HTML parsing option. ✓ Pros Fastest Python HTML parser by far Full XPath 1.0 support Handles massive documents efficiently ✗ Cons Stricter on malformed HTML than BS4 C dependency, occasional install issues Verbose API vs BS4
w3lib ⚡	Parser	Python	N/A	URL/text utils	–
⚡ HTTP Under the hood: Web-related utility functions. URL normalisation, encoding handling. Used internally by Scrapy. ✓ Pros URL cleaning and normalisation Encoding detection and conversion Scrapy internals, very stable ✗ Cons Utility library only, not a scraper Most devs use it via Scrapy, not directly
SwiftShadow ⚡	Proxy	Python	N/A	Proxy pool manager	–
⚡ HTTP Under the hood: Free proxy pool manager. Fetches, validates and rotates free proxies automatically. ✓ Pros Free, zero proxy cost Auto-validates and rotates on failure 2 lines of code integration ✗ Cons Free proxies are low quality, high failure rate Not for hard anti-bot targets IP reputation usually poor
requests-ip-rotator ⚡	Proxy	Python	N/A	AWS API Gateway IPs	–
⚡ HTTP Under the hood: Rotates requests through AWS API Gateway endpoints to get rotating IPs. ✓ Pros Free if AWS free tier available AWS IPs have good reputation Works with requests library ✗ Cons AWS API Gateway has rate limits Setup requires AWS account Limited rotation speed
Colly ⚡	Framework	Go	Go TLS	Medium targets	15k
⚡ HTTP Under the hood: Go HTTP scraping framework. Fast, concurrent, clean API. ✓ Pros Very fast, Go concurrency model Low memory vs Python Good for high-throughput HTTP scraping ✗ Cons Go only, no Python Smaller ecosystem than Scrapy No browser support
Katana ⚡	Framework	Go	Go TLS + Chromium	Medium targets	8k
⚡ HTTP Under the hood: Go-based web crawler by ProjectDiscovery. Designed for security research and recon. ✓ Pros Extremely fast Go crawler Headless mode with Playwright integration Built for large-scale URL discovery ✗ Cons Security/recon focus, not a data scraping framework Go only Less data extraction tooling than Scrapy
playwright-go 🌐	Browser	Go	CDP (detectable)	Medium targets	–
🌐 Browser Under the hood: Go bindings for Playwright. Same Playwright API in Go. ✓ Pros Go concurrency for browser scraping Same Playwright API and capabilities Lower memory than Python for concurrent sessions ✗ Cons Less mature than Python Playwright Smaller community No stealth patches yet
Charles Proxy ⚡	RE Tool	Any	N/A	Mobile API intercept	–
⚡ HTTP Under the hood: Commercial HTTPS proxy for request inspection and debugging. GUI-based. ✓ Pros GUI-based, easy to use for non-developers SSL proxying with certificate install Session recording and replay ✗ Cons Paid product For debugging only, not automated scraping Less powerful than mitmproxy for scripting
Selenoid ⚡	HTTP	Go (Docker)	Browser-as-a-service	Medium targets	2.6k
⚡ HTTP Under the hood: Docker containers running headless Chrome/Firefox in parallel, Aerokube's Go-based Selenium grid replacement. ✓ Pros Run dozens of browsers in parallel from one host Lower memory than Selenium Grid Built-in video recording per session Drop-in replacement for Selenium Grid ✗ Cons Browsers still detectable as headless without stealth patches Older project, slower release cadence Requires Docker infrastructure
noble-tls ⚡	HTTP	Python	Chrome JA3/JA4	Cloudflare, DataDome	–
⚡ HTTP Under the hood: Python port of uTLS via custom TLS handshake stack, emits browser-matching ClientHello. ✓ Pros Bypasses JA3/JA4 fingerprinting Pure Python, no C compilation Lighter than curl_cffi for simple cases Easy install via pip ✗ Cons Smaller community than curl_cffi Fewer browser impersonation profiles Less battle-tested in production
hrequests ⚡	HTTP	Python	Browser-grade TLS	DataDome, Cloudflare	900
⚡ HTTP Under the hood: Drop-in requests replacement with TLS impersonation, header order matching, and optional Playwright browser mode. ✓ Pros requests-compatible API with stealth built in Header order mimics real Chrome Optional browser mode for JS rendering Built-in async support ✗ Cons Smaller ecosystem than curl_cffi Fewer impersonate profiles Newer project, some edge cases
crawlee-python 🌐	Browser	Python	Via curl_cffi backend	Most targets	6.2k
🌐 Browser Under the hood: Python port of Apify Crawlee, wraps curl_cffi for HTTP and Playwright for browser modes in a unified framework. ✓ Pros Mix HTTP and browser workers in one crawler Auto-scaling and proxy rotation built in Storage abstraction for results Strong production patterns from Apify ✗ Cons Larger than plain Scrapy Newer than Node.js Crawlee, some features lag Opinionated framework
🌐 Browser Under the hood: Python port of Apify's Crawlee. Wraps curl_cffi for HTTP and Playwright for browser modes in a unified async framework with built-in retry, dedup, and storage. ✓ Pros Dual HTTP+browser mode in one framework 15K stars, actively maintained by Apify Built-in dataset storage, request queue, proxy rotation ✗ Cons Node.js primary (Python port is newer, less mature) More opinionated than Scrapy, harder to customise Heavier dependency footprint
estela ⚡	Framework	Python (K8s)	Spider-dependent	Distributed Scrapy	90
⚡ HTTP Under the hood: Kubernetes orchestrator for Scrapy, schedules and runs spiders as K8s jobs with auto-scaling. ✓ Pros Open source alternative to Zyte Cloud Elastic scaling on Kubernetes Built-in monitoring and stats UI Multi-tenant by design ✗ Cons Requires Kubernetes infrastructure Heavyweight for small projects Smaller community than Scrapyd
fake-useragent ⚡	HTTP	Python	UA strings only	Lightweight only	3.8k
⚡ HTTP Under the hood: Curated database of real-world User-Agent strings, sampled from browser telemetry sources. ✓ Pros Realistic UA strings ready out of the box Filter by browser family or OS Updated database Tiny dependency ✗ Cons UA alone is trivially detectable in 2026 Not enough for any modern anti-bot Database can become stale
grequests ⚡	HTTP	Python	requests + gevent	Unprotected APIs	4.4k
⚡ HTTP Under the hood: gevent-monkey-patched requests, fires hundreds of HTTP calls in parallel via greenlets. ✓ Pros Drop-in async for requests users Simpler than asyncio for bulk fetches Battle-tested gevent under the hood ✗ Cons Monkey-patching can conflict with other libs No HTTP/2 support Newer code should use httpx instead
Scrapoxy ⚡	Framework	Node.js	Proxy manager	Self-hosted rotation	2.1k
⚡ HTTP Under the hood: Self-hosted proxy pool manager, provisions proxies on AWS, Azure, GCP and rotates IPs automatically. ✓ Pros Free open-source alternative to Bright Data's proxy manager Auto-provision and tear down cloud IPs Ban detection and auto-rotation built in Multi-tenant ✗ Cons Cloud provider costs add up at scale Self-hosting infrastructure complexity Cloud IPs flagged faster than residential

Yes Partial No HTTP/Parser/RE Browser Framework AI Managed CAPTCHA Proxy

Browser engines, deep dive

Critical 2026 fact: CDP (Chrome DevTools Protocol) is itself detectable. Runtime.enable timing, execution context leaks, and binding exposure all signal automation. Camoufox uses Mozilla's Juggler protocol below CDP, no CDP leaks. playwright-stealth patches JS at runtime but Function.toString() exposes the patch.

Microsoft 2020

Playwright ★ 68k

Chromium + Firefox + WebKit

The 2026 standard framework. Powers Firecrawl, Crawl4AI, Browserbase. CDP is detectableuse C++ wrappers above Playwright. Auto-wait, network interception, multi-browser.

pip install playwright && playwright install

C++ Firefox · Juggler

Camoufox ★ 100%

Zero CDP exposure · geoip alignment

Mozilla Juggler below CDP levelzero CDP leaks. Near-zero fingerprint surface. 100% pass rate Mar 2026 on Cloudflare, Instagram, Reddit, X. Note: Firefox ~3% market share.

from camoufox.sync_api import Firefox

Stealth Chromium

CloakBrowser

49+ C++ binary patches

Binary patches: Canvas, WebGL, Battery API, AudioContext, CDP input. reCAPTCHA v3 score 0.9. Passes Akamai's 60 extension probes with real extension loading. Best for Akamai-targeted Chromium sites.

Playwright source fork

PatchRight

No JS signatures anywhere

Patches Playwright Python source, not JS injection. Kasada fingerprints playwright-stealth via toString(). PatchRight leaves nothing in the runtime to inspect.

pip install patchright

Google · Node.js

Puppeteer ★ 89k

Chrome DevTools Protocol

Google's original CDP automation. puppeteer-stealth plugin patches common detection points. CDP signature still visible at protocol level. Better for rendering tasks than hard anti-bot targets.

Multi-language · WebDriver

Selenium ★ 29k

Legacy, navigator.webdriver=true

navigator.webdriver=true detectable in 2 JS lines. Use SeleniumBase UC mode to remove. Stock Selenium is dead against Akamai in 2026. Still valid for non-protected targets.

Python · UC Mode

SeleniumBase ★ 10k

Undetected Chrome Mode

UC mode removes navigator.webdriver. Auto-solves many CAPTCHAs. Good for Kasada, medium targets. Not production-safe against Akamai at scale.

from seleniumbase import Driver

Raw CDP · Async Python

nodriver / pydoll

Direct Chrome DevTools Protocol

Direct CDP without WebDriver overhead. Used with Botright for CAPTCHA solving. scrapy-nodriver integrates with Scrapy directly. Lighter than full Playwright for medium targets.

Human behaviour simulation

Botasaurus

Gaussian mouse physics

Physically realistic mouse curves via Gaussian jitter. Combines with Patchright for protocol-level evasion + human behaviour. Effective against DataDome's 35-signal behavioural analysis.

pip install botasaurus

python

from camoufox.sync_api import Firefox

# geoip=True: auto-aligns IP, timezone, locale, WebRTC simultaneously
with Firefox(
    geoip=True–        # align all 5 identity vectors to proxy exit country
    humanize=True–    # Gaussian mouse jitter
    proxy={"server": "http://proxy.provider.com:8011"–
           "username": "user"– "password": "pass"},
    screen={"width": 1920– "height": 1080}
) as browser:
    page = browser.new_page()
    # Warm up, never go directly to target URL
    page.goto("https://www.google.com")
    page.wait_for_timeout(2000)
    page.goto("https://cloudflare-protected.com")
    page.wait_for_load_state("networkidle")
    print(page.content()[:500])

The tools above solve the access problem. But once you have the raw HTML or JSON, you still need to extract meaning from it. That is where AI-native scraping changes everything. In 2026 the bottleneck is not access. It is the extraction layer.

06 AI & LLM Scraping

Describe, don't
select

AI-native scraping replaces CSS selectors with natural language. A 2025 NEXT-EVAL benchmark showed LLMs hit F1 > 0.95 on structured extraction when input is properly formatted.

2026 Market Shift

Why AI scraping matters now

Firecrawl's markdown output uses 67% fewer tokens than raw HTML, compounds significantly at thousands of pages for RAG pipelines. AI web scraping market: $7.5B → $38B by 2034 (CAGR 19.93%). LangChain, LlamaIndex, and CrewAI all have native integrations. Claude and Cursor can scrape the web via MCP tools with zero code. A server like the Decodo MCP is the worked example: it gives the model a scraping tool that returns Markdown, JSON, or screenshots for JS-heavy pages with no proxy rotation or anti-bot handling on your side, which is exactly the shape a RAG or research pipeline wants.

Firecrawl ★ 111k

Managed · Self-hostable · FIRE-1 · MCP

Send URL → clean Markdown/JSON. No selectors. MCP serverClaude scrapes via natural language. FIRE-1 agent autonomously navigates. /interact endpoint clicks, fills forms, extracts behind dynamic content. SAP, Zapier, Deloitte.

app.scrape(url) | app.crawl(site) | app.search("query")

✓ LangChain + LlamaIndex native · 500 free/mo

Crawl4AI ★ 60k

Open-source · Local LLM · Full control · MIT

"Scrapy for the LLM era." Runs on your infrastructuredata never leaves your servers. Adaptive crawling learns selectors over time. BM25 content filter. Plug in Ollama for local models or OpenAI/Deepseek.

result = await crawler.arun(url)

✓ Full data sovereignty, free, MIT license

ScrapeGraphAI ★ 18k

NL prompts · Graph pipeline · Self-healing

Describe what you want, LLM builds and executes a graph-based extraction pipeline. Self-healing: site structure changes, re-describe and it adapts. No selectors ever written. Supports OpenAI, Claude, local.

SmartScraperGraph(prompt="...", source=url)

✓ Best for: schema-free exploration, prototyping

webclaw

Rust · Chrome TLS · 10 MCP tools · 95.1% accuracy

Rust-native scraper built for AI agent integration. 10 MCP toolsClaude and Cursor can call it directly via natural language. 95.1% success rate on bot-protected sites. Zero Python overhead, runs as subprocess or HTTP service. Chrome-level TLS fingerprinting baked in.

pip install webclaw

✓ Best for: AI agents needing high-performance scraping + MCP integration

Jina Reader API

URL → clean text · Zero code

Simplest LLM scraping tool. r.jina.ai/{url} is the entire API. Returns clean Markdown. Dynamic content handled via built-in rendering. Free tier available, paid ~$0.002–$0.01/page.

✓ Best for: text extraction, no-code integration

Steel

Open-source · Docker · MCP · AI agents

Self-hostable headless browser API for AI agents. MCP serverClaude controls browsers directly. Session persistence + CAPTCHA auto-solve. <1s session start. LangChain/CrewAI integration.

✓ Best for: AI agents needing browser control

Browserbase

Managed cloud · $300M valuation

50M sessions in 2025. Playwright/Puppeteer drop-in, one endpoint swap. Session recordings + CAPTCHA auto-solve. Used by AI agent frameworks as the browser layer. From $50/mo.

✓ Best for: AI agent infrastructure at scale

python

import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import LLMExtractionStrategy
from pydantic import BaseModel

# Define exactly what you want, LLM extracts it, no selectors needed
class Product(BaseModel):
    name: str
    price: float
    model_number: str
    brand: str

async def extract(url):
    strategy = LLMExtractionStrategy(
        provider="openai/gpt-4o-mini"–
        schema=Product.model_json_schema(),
        extraction_type="schema"–
        instruction="Extract all products with prices and model numbers"
    )
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url– extraction_strategy=strategy)
        import json
        return json.loads(result.extracted_content)
# F1 > 0.95 on well-structured pages, NEXT-EVAL benchmark 2025

MCP Server · 55 production tools · MIT

Crawilfy github.com/razavioo/crawilfy-mcp-server

The full scraping stack as an MCP server — gives Claude Code, Cursor, and Codex 55 production tools for the complete crawling pipeline.

What it ships:
• Stealth via curl_cffi TLS impersonation + rotating proxies
• Auto-discover REST + GraphQL endpoints on any site
• Record a flow once, export it as a runnable Python crawler
• Smart extraction with any OpenAI-compatible LLM (free tiers + local Ollama work)
• MIT licensed

One command: uvx crawilfy-mcp-server

Why this matters: at $0.002–0.01 per request, commercial scraping APIs compound fast on any non-trivial AI agent. Crawilfy brings the full stack in-process: TLS impersonation, proxy rotation, LLM extraction, all from within your IDE. The alternative is paying per-request at scale.

GitHub ↗ PyPI ↗

From extraction to production-grade data

Crawl4AI and Firecrawl get you semantic understanding out of an LLM. But ask an LLM for a price across 10,000 articles and you will get $40, 40 dollars, 40 USD, "forty dollars", and occasionally null. Production pipelines cannot ingest that. The fix is to separate the two concerns LLMs conflate: semantic understanding and structural guarantees.

The pattern · Pydantic + Instructor + LLM

Schema-validated LLM extraction

Define a Pydantic schema for what you want. Pass it to the LLM via Instructor (which patches OpenAI, Anthropic, Mistral clients to return validated schema objects, not free text). Instructor handles retries when the LLM returns malformed output, and rejects responses that fail validation before they enter your pipeline.

# pip install instructor pydantic anthropic
from pydantic import BaseModel, Field
import instructor, anthropic

class JobPosting(BaseModel):
    title: str
    company: str
    salary_min_usd: int | None = Field(description="Floor of salary range in USD")
    salary_max_usd: int | None
    years_experience_min: int
    location: str
    remote: bool

client = instructor.from_anthropic(anthropic.Anthropic())
result = client.messages.create(
    model="claude-sonnet-4",
    response_model=JobPosting,
    messages=[{"role": "user", "content": scraped_html}],
    max_retries=3,
)
# result is a validated JobPosting object, not a string
# If LLM hallucinates "competitive" for salary, Instructor retries

Why this beats raw LLM calls: normalises currencies, units, and phrasings ("just under two percent" → 1.8); rejects hallucinated dates that don't fit the schema; retries automatically on type errors; gives you a real Python object downstream.

When classical NLP still wins

spaCy, NLTK, Stanford NLP at scale

LLMs cost $0.001–0.01 per article on Claude or GPT-5. Classical NLP costs effectively zero after model load. For 10M+ document extraction at scale where the schema is narrow and the domain is stable, spaCy NER + dependency parsing still wins on cost.

Use classical NLP when: scraping millions of consistent documents (e-commerce, classifieds), schema is fixed, domain doesn't shift, latency budget is <5ms per doc.

Use LLM + Instructor when: messy heterogeneous sources (news, newsletters, job boards), context disambiguation matters ("Apple" the company vs the fruit), schema may evolve, semantic equivalences need resolving ("FTE" = "full-time" = "permanent" = "direct hire").

Hybrid in production: classical NLP pre-filters and tags. LLM resolves only the ambiguous cases. This is what Bloomberg, Reuters Refinitiv, and FactSet actually do, not pure LLM pipelines.

Sources and further reading: Federico Trotta, The Web Scraping Club, May 2026; Instructor library.

The production failure mode nobody warns you about: LLMs hallucinate plausibly. If your scraped article doesn't mention a publication date, an LLM will sometimes invent one that fits the article's tone. A Pydantic date | None field with Instructor's retry logic catches this, the LLM has to either find a real date or return None. Without schema validation, fabricated dates pass into your database as facts.

The framing to keep your head straight: AI agents did not replace scraping, they became its biggest new client. When half the industry is writing the eulogy for web scraping, it is worth noticing that an agent which browses the web is doing exactly what a scraper does. It makes HTTP requests, it hits anti-bot walls, it gets rate limited, it needs proxies and a believable fingerprint. The only thing that changed is who writes the prompt. The tell is in the spend: teams shipping agents bought more proxy and unblocking infrastructure through 2025, not less, because every agent loop ends in the same place a scraper's does, at a server that would rather not serve it. Everything in this guide applies to your agent the moment it leaves the sandbox.

Making LLM extraction production-reliable: the parts that are engineering, not prompting

Asking a model to read a page and return the fields is the easy 20 percent. A team running AI-generated scrapers across hundreds of partner sites reported the number that matters: first-attempt LLM selectors fail to yield any data roughly 30 to 40 percent of the time. The model is strong at pattern recognition and has no way to check its own guess against the real DOM, so the reliability lives entirely in the scaffolding around the model, validation, diagnosis, cleaning, and a bounded retry loop. Three pieces of that scaffolding are worth stealing outright.

1 · Steer the model toward durable selectors

The selector-stability hierarchy

Left alone a model reaches for whatever selector matches, which is often a hashed layout class (div.qElViY from a Wix or Chakra build) that changes on the next redeploy. Instruct it to prefer selectors in order of durability: (1) JSON-LD structured data, a declared schema contract, (2) data-testid attributes, added for automation and rarely changed, (3) id attributes, (4) semantic elements (h1, time, address, article), (5) itemprop / schema.org attributes that are part of a public SEO contract, (6) named platform class prefixes (tn-, ot_), and only last (7) visual styling classes, the first thing to break on a redesign. The higher the selector sits, the longer it survives, so a self-healing loop that prefers the top of this list heals far less often.

2 · Do not pour the whole DOM into context

A cleaned page, then a DOM-exploration agent

A single listing page is often 200 to 400 thousand characters, and sending that on every retry is slow, expensive, and sometimes larger than the context window. First pre-clean: strip <style>, <svg>, <noscript>, and inline scripts (but keep <script type="application/ld+json">, that is data, not code), which typically shrinks the HTML three to five times. When even the clean page will not fit, do not truncate blindly, give the model targeted tools over the DOM instead, downloading the HTML to a file and exposing helpers like count_selector, dom_excerpt, and find_repeating_blocks so it can answer questions about structure without loading the whole document. And separate the two jobs: the model produces the selectors, your deterministic code does the actual field extraction, so the expensive model never has to see every row it is pulling.

3 · Diagnose the failure before you retry. The worst response to a selector that returned nothing is to resend the same prompt and hope. Most extraction failures fall into a few recognisable classes, and naming the class turns a blind retry into a targeted fix: extracted too little (selector too specific, matched a fragment), extracted too much (matched a container with surrounding noise), wrong DOM region (matched a real element, but the venue name in the footer instead of the event header), attribute-versus-text confusion (pulled an href when you needed the link text), and template mismatch (the sample pages the model saw do not represent every template on the site). Feeding back "the selector is too narrow, find a parent element" produces a far better second attempt than "try again", and it is the diagnostic layer that the circuit-breaker and correction-loop from the architecture section depend on. The same source also flags two JSON-LD traps worth pre-empting: server-side frameworks sometimes inject HTML comment markers () inside the script tag that make json.loads fail silently, so strip comments before parsing, and JSON-LD can be present on some pages of a site and absent on others, so a CSS fallback is mandatory rather than optional.

Serving scraped data to an agent: the naming problem replaces the scraping problem

There is a structural shift worth noting for anyone wiring web data into an agent rather than a pipeline. The old pattern was to write a tool wrapper around each scraper, hand-write its schema, teach the model when to call it, and keep fixing the wrapper every time a site changed. Exposing the scraping capability as an MCP server collapses that layer: the agent connects once and asks in plain language, and the model selects the right endpoint out of a large catalogue and gets structured JSON back, with no bespoke wrapper to break. The non-obvious consequence, reported by people who have built these, is that the hard problem moves. It stops being "how do I scrape this" and becomes "which of these hundreds of endpoints does the model reach for", which means clear, disambiguating tool names and descriptions start mattering more than the scraping code underneath. It is the same lesson as the selector hierarchy one level up the stack: when a model is choosing, the quality of what it chooses from, and how legibly it is labelled, is the thing you actually engineer.

Q2 2026 Landscape · mapped by Massive

The agentic browser stack is 8 layers, not one product

"Browser agent" sounds like a single tool. It's a stack. Most teams building AI agents account for one or two layers; the reliable ones map all eight. When an agent fails a task, the cause is usually not the framework everyone debates, it's one of the other seven layers nobody mapped. The proxy layer sits at the bottom and every layer above it still has to reach the live site.

Layer 1 · Cloud Browser Platforms

Hosted headless browsers at scale

Managed Chrome/Chromium in the cloud, no infra to run. Browserbase, Kernel, Notte, Anchor Browser, Browserless, Hyperbrowser.

Layer 2 · Agent Frameworks & SDKs

Natural-language task → browser actions

Browser Use, Stagehand, Skyvern, AgentQL, Dendrite, Nanobrowser. These translate "log in and download my invoices" into clicks and form fills.

Layer 3 · Browser Automation

The execution primitives

Playwright, Puppeteer, Selenium, Crawlee (by Apify), BrowserMCP. Every layer above eventually calls down to one of these.

Layer 4 · Computer Use Agents

Models that see the screen and act

Anthropic Computer Use, OpenAI Operator, Fellou, Twin, MultiOn. Vision-driven, they screenshot the page and decide the next click rather than reading the DOM.

Layer 5 · Stealth & Anti-Detection

Where most agent tasks silently fail

CloakBrowser, Steel, Lightpanda, Pydoll, Camoufox. A flawless agent framework with a detected browser fingerprint still gets blocked.

Layer 6 · Data Extraction & Enrichment

Page → structured records

Diffbot, ScrapingBee, ScraperAPI, Zyte, ScrapeGraphAI. Turn the rendered page into clean JSON.

Layer 7 · LLM-Optimized Crawling

Crawl output formatted for models

Crawl4AI, Firecrawl, Jina AI, Apify, LLMScraper, Scrapy. Markdown/clean-text output that costs fewer tokens downstream.

Layer 8 · Network & Proxy Layer

The foundation everything depends on

Massive (affiliate), Bright Data, Oxylabs, Smartproxy, NetNut, IPRoyal. A perfect stack with a flagged datacenter IP fails at the first request.

How to use this map when debugging: When an agent fails, don't start with the framework (Layer 2), that's where everyone wastes time. Check Layer 5 (is the browser fingerprint detected?) and Layer 8 (is the IP flagged?) first. The boring layers fail more often than the clever ones. Credit: landscape mapped by Massive, Q2 2026.

When DIY cost exceeds platform cost, these services handle the heavy lifting. Each solves a specific problem, choosing the right one depends on which wall you are facing and at what scale.

The other way to slice it: the four layers of the runtime loop

The eight-layer map above is a catalogue of tool categories, useful for asking "which product do I need for this job." There is a second, orthogonal way to read a browser agent that is more useful for reasoning about latency and token cost: the four layers its runtime loop actually passes through on every single action. A model decides, something executes that decision against a browser, the result comes back, repeat. Those are four distinct layers, and many products blur several at once, which is exactly why they are hard to compare.

The four layers

Model, harness, driver, engine

The model reads context and picks the next action; it cannot fetch a page or click on its own. The harness is the agent program wrapped around it (Claude Code, Codex, and other generalists) that holds the conversation, exposes tools, and feeds results back, with the browser as one tool among many. The browser driver gives the harness hands on the browser (agent-browser, browser-use, Stagehand, and Playwright or Puppeteer as libraries your code imports), translating an intent like "click" into the protocol the engine speaks. The engine is the browser itself, loading pages and running their JavaScript to expose a DOM. Naming a product's layer tells you what it does for you and what you still have to supply around it.

Why the shape matters

Every boundary is a translation tax

Each layer boundary is a place where a call is serialised, crossed between processes, and deserialised on the way down, with the result climbing back up the same way. A typical Chromium setup runs the automation layer in its own Node or Python process and the browser in another beside it, so every action pays that translation tax in latency, CPU, and memory, and everything that reaches the model costs tokens. Two design moves attack this. Collapsing layers into one process (an engine that speaks its driver and loop natively, with no CDP hop) removes the cross-process cost. And compacting what the model sees: returning a compact accessibility-tree snapshot with short element refs runs a few hundred tokens against the several thousand you would spend dumping raw DOM, the same "don't pour the DOM into context" lesson from the extraction section, applied to the agent loop.

One consequence is worth its own line because it changes the cost model. If the loop is deterministic once the model has solved a task, you can run the reasoning at build time, once, record the resolved sequence of actions as a plain script, and replay that script afterwards with no model in the loop at all. The reasoning is paid for a single time; every replay after that costs zero tokens. For a recurring browsing job (a scheduled price check, a nightly login-and-export) this is the difference between paying an LLM on every run and paying it once. The record-and-replay pattern only holds while the target's flow is stable, a layout change invalidates the script and sends you back through the model once, which is simply the self-healing trigger from the architecture section wearing a different hat.

The CAPTCHA reality check: the LLM in your agent is almost never the thing solving it

Every agentic browser ships a "solves CAPTCHAs" bullet point. Open the code of the open-source ones and a different story appears: almost none of them use the LLM to solve a visible challenge. They either decline to try or try and fail. One popular open agent's entire CAPTCHA file just waits for a solver running in the vendor's cloud and blocks the loop until it answers; run it locally with your own model and there is nothing to wait for. Another, asked directly how its open release solves CAPTCHAs, ships a handler that logs "please solve the captcha, you have 30 seconds", sleeps for thirty seconds, and returns success no matter what happened. The honest maintainers say so outright: the anti-bot pieces are kept in the closed cloud product on purpose.

So when a product says its agent "solves hCaptcha", it is doing one of four things, and a model reading the page is not on the list. Either it stays stealthy enough that the visible challenge never fires (clean residential proxies plus fingerprint shaping, so the score stays low and no puzzle appears), or it declares itself an allowlisted bot and is waved through by a business arrangement with the CAPTCHA vendor (the verified-bot path, the same idea as Web Bot Auth), or it ships its own solver for some challenge types and documents a third-party solver service for the rest, or it runs an in-house solver (vision model or classical computer vision) that owns the whole stack. The marketing word "AI" sits on top of all four, but the actual work is proxies, partnerships, or a dedicated solver, not the agent's language model puzzling out the image.

The detail that decides what is even possible

The same-origin boundary around the widget

hCaptcha renders inside an iframe whose origin is hcaptcha.com, cross-origin to the host page. The solve happens entirely inside that frame, and when it finishes the widget writes a token into a hidden textarea[name="h-captcha-response"] the host form submits. An agent driving the outer page over Playwright or CDP has full control of the host but only limited reach inside the widget frame, because the same-origin policy blocks it. A browser extension, which runs with cross-origin privileges, can observe and click inside the widget frame. That asymmetry, not model intelligence, is why extension-based solvers work where a page-driving agent stalls.

What this means for your pipeline

Plan around it, do not buy the bullet point

The practical reading: invisible challenges (reCAPTCHA v3, Turnstile) are a fingerprint and proxy problem, solve those upstream and the puzzle never shows. Visible image challenges (hCaptcha, reCAPTCHA v2) still demand an answer, and in 2026 that answer comes from a dedicated solver service or an in-house vision model wired in through an extension or API, not from your agent's LLM reading the screen. Budget and architect for that explicitly rather than assuming the model handles it, because the open-source evidence says it does not.

07 Managed platforms

When DIY cost
exceeds platform cost

If spending more than 2 engineer-days/month on anti-bot maintenance, a managed platform is cheaper. Crossover typically hits when facing F5 Shape or Kasada at scale.

Bright Data 98.44%

Enterprise · 72M+ IPs · Scrape.do #1 2025

Highest success rate in Scrape.do 2025. 100% on Indeed, Zillow, Capterra. 72M+ residential IPs. GDPR, ISO 27001. Scraping Browser for JS-heavy targets. $1.50/1K requests.

✓ Best for: F5 Shape, hard targets at scale brightdata.com ↗

Zyte 93.14%

#1 Proxyway 2025 · Fastest API · Scrapy

#1 Proxyway 2025, 93.14% success rate. Fastest API response. Smart Proxy auto-selects type. GPTE AI generates parsers from natural language. scrapy-zyte-smartproxy integration.

✓ Best for: Scrapy pipelines, speed zyte.com ↗

Firecrawl ★ 111k

AI scraping · Self-hostable · MCP

URL → Markdown/JSON. No selectors. MCP server, Claude scrapes via natural language. LangChain + LlamaIndex native. Used by SAP, Zapier, Deloitte. 500 free/mo.

✓ Best for: RAG pipelines, AI agents firecrawl.dev ↗

Crawl4AI ★ 60k

Open-source · Local LLM · Free

89.7% OOTB success rate. Runs on your infrastructure. Local LLM support (Ollama). MIT license. Adaptive crawling. Full data sovereignty, data never leaves your servers.

✓ Best for: privacy, open-source, cost $0 crawl4ai.com ↗

Apify FEATURED · NO-CODE OPTION

Serverless cloud · 10,000+ Actors · MCP · LangChain

The best option if you don't want to build scrapers yourself. Apify is a cloud platform where scraping is already done for you, 10,000+ community-built Actors cover almost every major website: Amazon, LinkedIn, Instagram, Google Maps, TikTok, Zillow, Twitter/X, Google Search, and thousands more. You pick an Actor, give it a URL, and get back clean JSON. No Python, no proxies, no infrastructure.

🎭 What is an Actor?

An Actor is a serverless scraping programme that runs on Apify's cloud. Think of it like a function: you pass it inputs (URL, keywords, max results) and it returns data. Someone else wrote the spider, handles the anti-bot bypass, manages proxies, and keeps it updated. You just call it.

💰 How pricing works

You pay in Compute Units (CU). One CU = 1 CPU core for 1 hour. Most Actors use 0.1–0.5 CU per 1,000 results. Free tier: $5/mo creditenough for casual use. Paid plans from $49/mo. You can also run your own code as Actors and monetise them on the marketplace.

🤖 Apify + AI agents

Apify has a native MCP serverplug it directly into Claude, Cursor, or any LangChain agent. Your AI agent can call "scrape this URL", "search Google for X", or "get all reviews for this product" as natural language tool calls. No code required on the LLM side.

🏗️ For engineers who do build

Apify's open-source Crawlee library (formerly Apify SDK) is the core of many Actors. You can build your own Actor locally with Crawlee, push it to Apify, and run it on their infrastructure with built-in proxy rotation, auto-scaling, and a dataset API. 15K+ GitHub stars.

✓ Use Apify when

You need data from a well-known site quickly
You don't want to maintain scrapers long-term
You're building an AI agent that needs live web data
You want someone else to handle anti-bot bypasses
You need to scale without managing infrastructure

✗ Build your own when

Your target site has no existing Actor
You need custom data transformation logic
You're scraping at very high volume (cost)
You need full control over request patterns
Data stays internal and can't touch third-party cloud

Quick start apify.com/store Crawlee (open source) MCP server ~$0.25/CU Free $5/mo tier Python + Node.js SDK

✓ Best for: ready-made scrapers, LangChain

Oxylabs

Enterprise · 100M+ IPs · OxyCopilot AI

100M+ IPs, 195 countries. OxyCopilot AI generates parser code from natural language. Owns ScrapingBee (acquired 2025). ISO 27001 + GDPR. From $49/mo.

ScrapingBee

Headless · Managed rendering

Handles JS rendering, CAPTCHAs and proxies. Simple REST API, pass a URL, get back HTML or screenshots. Good for teams that want managed scraping without infrastructure. Free tier available.

scrapingbee.com ↗

Scrapfly

Anti-bot · AI extraction · Monitoring

Premium scraping API with built-in anti-bot bypass, JS rendering, and AI-powered data extraction. Strong on hard targets. Includes scraping monitoring and scheduling out of the box.

scrapfly.io ↗

Diffbot

AI extraction · Knowledge graph

Uses computer vision and AI to automatically extract structured data from any webpage, no CSS selectors, no XPath. Builds a knowledge graph from scraped content. Best for unstructured web data that needs AI parsing.

diffbot.com ↗

WebScraper.io

No-code · Chrome extension

Point-and-click scraping via a Chrome extension, select elements visually, define pagination, export to CSV. No coding required. Cloud version runs scrapers on schedule. Best for non-technical users.

webscraper.io ↗

Browse.ai

No-code · Monitor · Robots

Train a robot to scrape any website in 2 minutes by clicking on the data you want. Monitors for changes, sends alerts. Handles login flows, pagination, and dynamic sites. No code needed at all.

browse.ai ↗

Browser Use

AI agent · LLM-controlled browser

Open-source library that lets LLMs control a real browser. The AI agent navigates, clicks, fills forms and extracts data from instructions in natural language. 81% success rate on anti-bot benchmarks. GitHub ↗

browser-use.com ↗

Stagehand v3 · OCT 2025

AI Browser SDK · Browserbase · Open source

Browserbase's AI browser automation framework. Four primitives: act(), extract(), observe(), agent(). Write browser flows in plain English ("click submit button") that survive page redesigns via runtime LLM resolution. Built on CDP, supports OpenAI/Anthropic/Gemini. 65% Mind2Web benchmark. Self-healing + auto-caching. TypeScript and Python.

browserbase.com/stagehand ↗

Kadoa

AI · Zero-config · Auto-adapt

AI-powered scraping that requires zero configuration, no selectors, no rules. Understands page structure automatically and adapts when sites change. Ideal for scraping at scale without maintaining spider code.

kadoa.com ↗

ScrapeGraphAI

LLM · Graph pipeline · Open source

Builds a graph-based extraction pipeline from a natural language prompt. Describe what data you want, it generates the scraping logic. Open source and self-hostable. Good for rapid prototyping of complex extractions.

GitHub ↗

TinyFish

AI · Structured extraction · Fast

AI-native scraping API focused on speed and structured data output. Pass a URL and a schema, get back clean typed JSON. Handles JS rendering and basic anti-bot. Good fit for feeding structured data into AI pipelines.

tinyfish.io ↗

Nimble

AI · Structured · E-commerce

AI-powered web data platform with pre-built pipelines for e-commerce, SERP, and social. Returns structured data with no parsing needed. Built-in proxy network. Strong for retail intelligence and price monitoring.

nimbleway.com ↗

NetNut

ISP · Residential · Scraping API

ISP-level residential proxy network with a built-in scraping API. Direct carrier connections for lower detection risk. Strong for e-commerce and SERP scraping where IP freshness and session stability matter.

netnut.io ↗

Infatica

Scraper API · SERP · P2B sourcing

Singapore-based provider with both a P2B residential proxy network and a managed Scraper API (POST URL → HTML/JSON, auto-retry, optional render mode for JS-heavy pages, dedicated SERP endpoint with Google AI Overview support). The differentiator vs Zyte/Bright Data/ScraperAPI: explicit peer-to-business IP sourcing, mandatory KYC on buyers, ISO 27001/27701/22301/20000-1 certified. Smaller pool than the giants but priced below them, with a 5,000-request trial. Worth a look when ethics and certification matter as much as raw scale.

infatica.io ↗

Scraping Robot BY RAYOBYTE

Scraping API · 5,000 free/month · JSON output

Plug-and-play scraping API from Rayobyte. Returns clean JSON, handles cookies + headers + browser attributes automatically. 5,000 free scrapes/month on signup, paid tiers from $5/GB. Built on Rayobyte's proxy network and rayobrowse stealth browser. Lower entry barrier than Bright Data or Zyte for teams wanting "scraping as a service" without infrastructure.

scrapingrobot.com ↗

at">✓ Best for: enterprise scale, AI-generated parsers oxylabs.io ↗

The next CAPTCHA frontier is liveness, and it is being defeated the same week it ships

The reason the visible puzzle is fading is that behavioural scoring has run into a wall. When most of your traffic is automated and the best of it imitates a person convincingly, a test built on watching for human signals stops separating anyone from anyone. So the frontier is moving to liveness: proof that a real, live body is present at the moment of the check. In June 2026 Google began testing a hand-gesture reCAPTCHA inside Google Cloud Fraud Defense. It asks for camera permission, records a short clip of you waving or holding up an open palm, and uses a hand-tracking model to extract 21 knuckle-landmark coordinates, deleting the video afterward. On paper it is a much harder challenge for a script than reading warped text.

In practice it was passed with a still image within days of surfacing. A journalist and independent testers fed a plain stock photo of a hand through a virtual camera (OBS Virtual Camera presents any video file or image to the browser as if it were a real webcam) and the check accepted it. No deepfake, no generated animation, no AI bot reading the page, just a static picture routed through the camera input, and the whole thing automatable in minutes with a small script. The lesson is the one this section keeps returning to: the challenge is only as strong as the integrity of the channel it arrives on, and a browser cannot vouch that the pixels on its camera input came from a physical lens rather than a file.

Why a static image gets through

The attack is injection, not forgery

Liveness systems fail at the point where the media enters the pipeline, not at the recognition step. A virtual-camera driver sits between any media file and the browser's getUserMedia stream, so the model never sees a lens, it sees whatever frames you hand it. This is a presentation or injection attack, and it is the same class of bypass that has dogged face liveness for years (masks, screen replays, injected deepfake video). A hand is, if anything, easier to synthesise than a face, with fewer micro-expressions to get right. Reported virtual-camera and deepfake liveness-bypass attempts rose sharply through 2025-2026 as the tooling commoditised, with some injection kits priced around the cost of a coffee.

What it means for a scraper

The friction grows faster than the gap closes

For most scraping you will never trip a camera challenge, because the way to beat the visible test is to never summon it: keep the trust score high with clean sessions and a coherent fingerprint so the hard challenge stays dormant. Liveness raises the ceiling of pain for the cases that do escalate, but it does not close the underlying gap, and it adds real friction and privacy cost for genuine humans. The structural pressure points the other way entirely: bots increasingly skip the rendered interface and hit the JSON API directly (a measurable and rising share of automated traffic in 2025-2026), where there is no camera prompt to inject into at all. The verification ritual gets longer at the front door while the data keeps leaving through the side.

One caveat worth stating plainly: the vendor here is uniquely placed to harden this over time, owning the dominant browser, the hand-tracking models, the reCAPTCHA scoring layer, and a mobile OS, which is exactly the stack you would need to detect virtual-camera injection through capture-integrity and device-attestation signals. The current stock-photo bypass is an early-rollout gap, not proof the approach is permanently hopeless. The durable point is the principle: a liveness check is only as trustworthy as its ability to prove the media came from a real sensor, and that proof is the hard, unfinished part.

5b Adjacent category

Computer Use Agents when scraping isn't enough

A new category emerged in 2025: AI agents that don't just scrape, they log in as the user, navigate any UI (web apps, legacy portals, desktop software), handle MFA and CAPTCHAs, and return structured JSON. Different from scrapers because the user grants permission, "Plaid for any website." If your problem is utility bills, payroll exports, e-commerce backends, or any portal without a public API, this is the category.

Deck FEATURED · $25M RAISED

Computer Use Agents · Credential Vault · SOC 2

Plaid-ifies any website. Provisions isolated desktop VMs, encrypts credentials in Deck Vault, runs AI agents that log in, navigate, and return schema-validated JSON. Founded by the team behind Flinks (Canadian open-banking, acquired for $150M by National Bank). Connects to 100,000+ utility providers across 40+ countries. Handles MFA, CAPTCHA, device fingerprinting, audit-logged sessions. Strong on regulated portals with no public API.

deck.co ↗

Skyvern

Open source · LLM + Computer Vision · 85.8% WebVoyager

YC-backed open-source agent that uses LLMs and computer vision (no XPath or CSS selectors) to operate any browser workflow. State-of-the-art 85.8% on WebVoyager benchmark. Used for invoice retrieval, job applications, government forms, insurance quotes. Both cloud-hosted and self-hostable SDK with Playwright integration.

skyvern.com ↗

Bytebot

SDK · AI browser automation

SDK-first computer use agent platform. Lighter footprint than full VM solutions, integrates into existing apps. Targets developer workflows where you want agentic browser actions without managing browser pools yourself.

bytebot.ai ↗

CloudCruise

Browser automation · Web agents

Developer platform for creating and managing web agents. Focuses on production-grade browser automation infrastructure. Competes with Deck and Browserbase on the infra layer.

cloudcruise.ai ↗

Autotab

Enterprise AI agent · Data + form automation

General-purpose AI agent for enterprise, data collection, form filling, executing actions across business apps. Pitched at operations teams rather than developers.

autotab.ai ↗

Browserless

Managed headless Chrome · CDP-as-a-service

Chrome-as-a-service over WebSocket and REST. Foundation layer that other agent platforms build on. Strong for teams that want managed browser pools without the agent reasoning layer on top.

browserless.io ↗

When to pick this category over scraping: if the data lives behind a login the user owns (their utility bill, their bank statement, their payroll), Computer Use Agents are the right answer, the user permission model gives you a clean legal posture and access to data scraping legally cannot reach. If the data is public-facing (e-commerce listings, SERPs, social), traditional scraping is faster and cheaper.

Platforms sort out the browser and the fingerprint. But every request still needs an IP address, and the type of IP matters as much as any other signal in your stack.

08 Proxy strategy

IP type matters
more than provider

Rotating proxies is table stakes. The real variable is IP type, datacenter IPs score near-zero on DataDome and PerimeterX regardless of fingerprint quality.

Datacenter

Trust: Very Low

AWS/GCP/Azure ranges. Instantly flagged by DataDome and PerimeterX. Cheapest (~$0.01/GB). Use only on non-protected public data. Never for Akamai or PerimeterX targets.

Residential

Trust: High

Real home ISP addresses. Passes most trust checks. Confirmed: curl_cffi + residential bypasses DataDome on Grainger.com. Rotate per session, not per request, mid-session rotation = Akamai block.

Mobile / 4G

Trust: Highest

T-Mobile, Vodafone, O2 carrier IPs. Highest trust score on DataDome and PerimeterX. Shared tower IPs, hard to flag. Mobile IPs get DataDome 200 OK where residential fails. ~$10–15/GB.

ISP / Static

Trust: High

Static residential range. Akamai multi-request scoring rewards consistent IPs, trust accumulates from same ISP IP. Best for long sessions on Akamai sites. Never rotate mid-session.

NetNut

ISP Direct

ISP-based infrastructure with direct carrier connections. Lower detection risk than pooled residential. Fast and stable, good for e-commerce targets that check IP freshness and session age.

IPRoyal

Residential

Ethically sourced residential + datacenter. Pay-as-you-go pricing, no long-term commitment. Good entry point before scaling to enterprise contracts with Oxylabs or Bright Data.

Massive I use this Ethical · Founded 2018

Residential · ISP · Web Access API · Web Render API · MCP Server

Try Massive with my referral ↗ (affiliate)

"I've tested a lot of proxy providers across my 7 years in scraping. Massive stands out for two reasons: the ethics are real (not marketing), and the performance numbers hold up under actual load. 99.87% US success rate and 0.52s response time aren't made up, my production runs match that. If you care about running a clean, compliant operation, this is where I'd start." Asad Ikram, Data Engineer

Residential Proxies

1.6M+ IPs, 195+ countries. 99.87% US success rate. 0.52s response time. GDPR + CCPA compliant, AppEsteem certified. From $4.9/GB.

ISP Proxies

Static residential IPs for sticky, session-bound workflows. 100% success rate, 0.09s response (US). From $1.8/IP. Best for continuous monitoring.

Web Render API

Full JavaScript rendering with anti-bot bypass at scale. From $8/mo. Handles Cloudflare-protected pages without you managing browsers.

MCP Server ✦ new

Official MCP server. Use Massive directly from Claude, Cursor, or any MCP client. Geo-targeted search, bulk extraction, SERP analysis without leaving your AI workflow.

99.87%

US success rate

0.52s

response time

195+

countries

99.9%

uptime

100%

ethically sourced

<20%

fraud score (US)

Verified target success rates

Instagram 96% Amazon 94% Google 88% ISP 100% Trusted by Snowflake · Shopee · Tavily

Startups get 1TB free for 3 months, no equity required. 24/7 live support. GDPR + CCPA compliant. AppEsteem certified.

Join with my referral ↗ (affiliate)

Rayobyte

Ethical · Multi-type

"America's #1 proxy provider" (formerly Blazing SEO, est. 2014). 40M+ residential IPs across 100+ countries, plus ISP, datacenter, and mobile. Non-expiring bandwidth sets them apart, $3.50/GB residential dropping to $0.50/GB at 5TB+. Ethically sourced via Cash Raven consent-based proxyware. Ships rayobrowse stealth browser too. Hands-on technical CEO, partners with EWDCI for ethics standards.

Scrapoxy

Open Source Manager

Self-hosted proxy manager that pools and rotates proxies across AWS, Azure, GCP. Routes requests through different IPs automatically. Free alternative to commercial proxy managers. github.com/fabienvauchelles/scrapoxy

Byteful

Residential · ISP

Residential and ISP proxies pitched at scraping engineers who understand the full detection stack. Honest positioning: "proxies solve IP reputation, but TLS fingerprint, header order, browser behaviour, and request cadence all have to line up independently." Good entry point for teams that already have their TLS and browser layers handled.

Infatica

Residential · Mobile · P2B network

Singapore-based provider that runs an explicit peer-to-business (P2B) sourcing model: they pay app developers to embed their SDK on idle user devices instead of buying SDK installs through opaque intermediary chains. Smaller pool than the giants (~15M IPs vs Bright Data's ~150M, Oxylabs' ~175M) but with city, ZIP, and ASN filtering you can combine, and they claim to not resell the pool to other providers. Distinctive for the market: mandatory KYC on buyers, which is unusual and reduces abuse on the network. Pricing sits below the premium tier (subscriptions from $0.30/GB, pay-as-you-go from $8/GB). Worth a look when you want ethical sourcing and stricter buyer vetting without paying Bright Data prices, accepting that filter coverage by country is thinner than the larger networks.

WebRTC coherence rule: Proxy IP country, WebRTC ICE candidate, DNS resolver, timezone, and Accept-Language must all agree. US residential proxy + Pakistani DNS = flagged by every major anti-bot. Use geoip=True in Camoufox to align all five vectors automatically.

Crawlera/Zyte proxy bug: Port 8011 speaks plain HTTP. Both http:// and https:// keys must use http:// scheme. Using https:// causes BoringSSL WRONG_VERSION_NUMBER (TLS-over-TLS failure). Fix: "https": "http://key:@proxy.crawlera.com:8011/"

python

from curl_cffi import requests
import time– random

session = requests.Session(impersonate="chrome124")

# Crawlera/Zyte: BOTH keys use http://, never https://
PROXIES = {
    "http":  "http://apikey:@proxy.crawlera.com:8011"–
    "https": "http://apikey:@proxy.crawlera.com:8011"–  # http:// not https://
}

def fetch(url– retries=3):
    for i in range(retries):
        try:
            r = session.get(url– proxies=PROXIES–
                             timeout=30– verify=False)  # verify=False: proxy cert
            if r.status_code == 200: return r
            if r.status_code in (403–429):
                time.sleep(2**i + random.uniform(0–1))
        except Exception as e:
            print(f"Error: {e}")
    return None

Rotate sessions, not IP addresses: stickiness is the strategy

The most expensive rotation mistake is the one that feels most thorough: buy the biggest IP pool you can, rotate the address on every request, randomise the User-Agent, and assume you are now invisible. What actually happens is that the blocks arrive around request 150 to 300, because modern anti-bot systems (Cloudflare, Akamai, DataDome, PerimeterX) do not treat the IP as the primary signal. They model the session, and per-request IP rotation produces a session that no human could generate: the same logged-in user apparently teleporting between cities mid-visit. More IPs do not fix that, they amplify it. The unit you rotate should be the session object, not the address, where a session bundles one exit IP with one coherent set of cookies, headers, and geo, held stable for as long as a real visit would last.

Where a mid-session IP change gives you away

Coherence is the layer most scrapers fail

Three flows punish an IP swap in the middle. Authenticated sessions: change the IP between the login request and the next data request and the server sees a user who signed in from Paris asking for data from Seoul, so it re-verifies or invalidates. Stateful pagination: many apps tie server-side page state to your IP, so changing address between page 3 and page 4 loses the cursor and returns a block or, worse, silently wrong data. Multi-step forms and checkouts: the state established at step one is carried forward, and a mid-flow IP change is an instant red flag. The rule that falls out is simple: all requests belonging to one logical session must exit through the same IP, or at minimum the same subnet, ASN, and geo.

How to build the rotation layer

Sticky assignment, honest expiry, careful retries

Key each session deterministically (job plus target domain, not random) so a crawl reuses one session until it expires, and expire on both a TTL and a request count (refresh around every ten minutes or a couple hundred requests, so a long-lived session does not become its own statistical fingerprint). Keep one header profile per session, no swapping the sec-ch-ua or Accept-Language between requests that claim to be the same browser. Between sessions, rotate within the same geo and ASN cluster, since a jump from a Tokyo residential IP to a Toronto one reads as a VPN switch, not a returning user. And separate your retry cases: a transient timeout retries on the same session after a short jittered backoff, only a clear block (407, 429, a CAPTCHA) flags the session and forces a fresh one. At fifty machines this needs a shared store (a Redis-backed pool) so every worker sticks to the same session instead of each inventing its own.

The division of labour that scales best: let a provider that offers sticky sessions guarantee the exit-IP consistency (a session token with a location baked in, held for up to some minutes or hours), and keep your own layer responsible for the cookies, headers, and navigation coherence on top. Building and maintaining a distributed session pool yourself is real, ongoing work that grows every time a target updates its detection. IP rotation is a tactic; session architecture is the strategy, and it is the difference between firefighting blocks and running a scraper that stays a reliable data source as the detectors evolve.

The mirror-image mistake is just as fatal: rotating IPs under one fixed fingerprint. Sticky sessions solve the case where the IP changes mid-visit, but the opposite pairing is its own tell. If you rotate through a dozen addresses while the browser fingerprint stays byte-identical across all of them, you have not diversified, you have announced that one client is hiding behind twelve IPs, which is behaviour no real user or device produces. Once the fingerprint is part of the identity, an unchanging fingerprint spread across many networks links those sessions together more tightly than a single sticky IP ever would have, so the rotation you added for cover becomes the thing that convicts you. The rule that resolves both failures is the same: one fingerprint travels with one IP for the life of a session, and when you rotate, you rotate both together, the way a real device and its connection move as a unit.

The scaling tension worth naming: real hardware wins precisely because it is not infrastructure, and scaling drags you back toward infrastructure. A real phone on a real mobile connection passes the hardest checks not by spoofing anything but by being genuine at every layer at once, the network, the TLS stack, the sensors, the timing all agree because one real device produced them. That is the cleanest version of the coherence principle this guide keeps returning to. The catch is economic, not technical: one real device does not scale, and the moment you add machines to go faster you are back on servers with datacenter IPs whose Layer-4 (TCP/IP) fingerprint gives you away before a single byte of HTTP is read. Every fix from there reopens the same wound, so the honest way to hold this is as a permanent trade rather than a solved problem: the closer a setup sits to a real device on a real network, the higher its trust and the lower its throughput, and scaling is the deliberate act of spending some of that trust for volume. Naming the trade is what stops you from expecting a single configuration to be both maximally stealthy and maximally fast.

Know where your IPs come from, and remember you are a guest

The detail that separates providers is not the price per gigabyte, it is sourcing: how those residential and ISP addresses end up in the pool. Residential IPs are typically gathered through proxyware, SDKs paid to sit on idle consumer devices, ideally with informed consent (the ethical end of the market) and not so ideally bundled silently into free apps (the part to avoid). ISP proxies are a different animal: addresses registered to consumer ISPs but hosted in datacentres, which is why they combine residential reputation with datacentre stability, and why the way a provider acquires that ISP space is worth understanding before you trust it. Knowing the sourcing model is not a compliance nicety, it directly predicts pool health: consent-based, well-managed pools get burned less and last longer than pools stitched together through opaque intermediary chains.

Two practical stances follow. Pick providers whose sourcing you can actually explain, the reputable ones gatekeep their networks and let only legitimate users on, which is what keeps the IPs clean for you. And carry the right posture onto both networks you touch: you are a guest on the proxy infrastructure and a guest, often an unwanted one, on the target. Behave politely and leave as few footprints as possible. That is not only an ethics point, it is the same discipline that keeps a pool and a target usable for longer.

The risk that runs the other way: your IP is liable for the whole pool

Sourcing is not only about pool health, it is about what your exit IP is on the hook for. When you route through a residential pool, your address is shared infrastructure: at the same moment your scraper reads a public page, other tenants of that pool are sending traffic that exits through addresses just like yours. An investigation that joined one such network as a node and recorded over ten million transiting requests found the mix went well past data collection, into ad-fraud and affiliate-attribution abuse, mass account registration, ticketing and scalping automation, and email spam. The provider's marketing said web data collection and market research; the wire said otherwise.

The uncomfortable part for a buyer is that strong vetting does not remove the risk, it only narrows it. A provider can run KYC on every customer, throttle the SDK, and inspect traffic on the network, and still be one bad edge case away from your IP carrying something you would never send. A trusted customer's credentials leak. A sensitive destination (a bank, a government service) is not classified correctly and so is not blocked. The provider never anticipated what a local LAN address looks like from inside a residential node. You are trusting that their vetting and security hold, on an address that resolves to a real person's home, which is also why the supply side is uncomfortable. One 2026 scan of 6,038 LG and Samsung smart-TV apps found proxy SDKs in 2,058 of them, 42.5 percent of LG webOS apps and 26.9 percent on Samsung Tizen, fish-tank screensavers and solitaire clones quietly turning the television into an always-on exit node, often with consent reduced to a single setup prompt navigated by remote. Cheap streaming boxes have shipped with dormant proxy software preinstalled. The risk is not only that your traffic shares an address with strangers: because a smart TV sits on the home LAN beside routers, NAS drives, and cameras, a failure in the provider's private-range filtering can turn that exit node into a foothold inside the network. The US FBI issued a 2026 advisory on exactly this, warning that when criminal traffic exits through a residential IP, the innocent homeowner is the one whose address is on record. Those same residential IPs feed the pools sold as ethically sourced. Two practical takeaways: prefer providers whose sourcing and abuse controls you can actually inspect (a verifiable partner list and audit trail, not a landing-page adjective), and treat the residential layer as something to use deliberately and sparingly, not as a default you leave running, because every hour you are on it your address is vouching for strangers.

You now have the full picture: detection layers, six anti-bots, sixty libraries, managed platforms, proxy types. This section collapses all of it into a single decision tree you can follow for any target site.

09 Decision playbook

Walk this in order.
Stop at first win.

Each step adds complexity, cost, and maintenance. Most production scraping is solved at steps 1–3. Never start at step 5.

Lowest friction · Asad's priority #1

Find the mobile API

Mobile apps hit same backend with far weaker bot protection. HTTPToolkit intercepts all HTTPS from Android emulator. Frida hooks into SSL_read/SSL_write directly. If you find the API endpoint, every HTML anti-bot becomes irrelevant.

HTTPToolkit

Frida

mitmproxy

Burpsuite

XHR reverse engineering

Find the GraphQL or REST endpoint

Chrome DevTools → Network → Fetch/XHR. Many SPAs load from one undocumented JSON endpoint. Confirmed in production, a direct GraphQL endpoint bypassed all Akamai HTML protection.

Chrome DevTools

Burpsuite

webclaw CLI

JSON in HTML · No requests needed

Look for embedded state

Next.js embeds full state in __NEXT_DATA__. React SPAs often have >50KB script containing all data. Confirmed: Grainger.com (DataDome-protected), 110KB JS state blob bypasses DataDome entirely because it's in initial HTML.

chompjs

Parsel

BeautifulSoup4

HTTP scraping · No browser

curl_cffi + Scrapy

Identify anti-bot with Wappalyzer. curl_cffi with JA4 impersonation resolves most Akamai and DataDome at HTTP layer. Add residential proxy. If __NEXT_DATA__ appears in response, extract it with chompjs.

curl_cffi

Scrapy

Scrapling

Browser automation · C++ level only

Camoufox or CloakBrowser

JS injection patches leave signatures. Camoufox: 100% pass rate Mar 2026. CloakBrowser: 49 C++ patches, reCAPTCHA v3 score 0.9, Akamai extension probes pass. PatchRight for Kasada specifically, no JS signatures.

Camoufox ★

CloakBrowser

PatchRight

Last resort · F5 Shape only viable path

Managed platform API

F5 Shape's custom VM makes DIY impractical. Token expiry in minutes, payload changes every rotation. At scale: engineer maintenance cost > platform cost. One API flag handles everything. Cost-justify: >2 days/month maintenance → managed API wins.

Bright Data

Zyte

Firecrawl

Quick reference cheat sheet

Anti-bot	Primary vector	Steps 1–2 viable?	Best tool	Key note
Akamai	JA4+ + sensor.js + extension probes	Often	curl_cffi + CloakBrowser	Find mobile/GraphQL first
Cloudflare	JA4 Rust edge + Turnstile	Sometimes	Camoufox	Origin IP via SecurityTrails
DataDome	85K ML + WASM boring_challenge	Yes	curl_cffi + mobile IP	Check __NEXT_DATA__ first
PerimeterX	5-vector score	Sometimes	Camoufox + residential	Fresh session per domain
Kasada	Polymorphic JS PoW	Rarely	PatchRight + residential	Never playwright-stealth
F5 Shape	Custom VM + minute expiry	No	Managed API	DIY not practical

10 From the community

What practitioners are
actually shipping in 2026

Fresh insights from engineers actively solving these problems in production. Shared publicly on LinkedIn.

Drag to explore →

Pattern / Multi-agent architecture

Scraping as a Bee Colony: Multi-Agent Orchestration via Claude Skills

Zyte published a sharp piece reframing production scraping as multi-agent orchestration, not single-shot spiders. The metaphor: honeybees solving distributed coordination for 100 million years. The architecture: many small specialist agents (each one a Claude Skill) writing to a shared append-only blackboard, coordinated by a weak orchestrator that routes attention, not commands.

💡 One Claude Skill = one bee with one specialist role

The architecture organises around three connected loops, not a linear pipeline. (1) Discovery loop: intake skill turns a fuzzy "scrape this site" request into a structured spec (fields, page types, refresh, cost, fallback); scout skills (detail-page-finder, page-downloader, site-explorer, link-classifier) reduce uncertainty in parallel before any code is written. (2) Build loop: selector-analyzer proposes candidates with confidence scores; code-synthesizer generates extraction; test-runner is the heartbeat; repair-agent fixes the broken field surgically without rewriting the project. Selector confidence becomes first-class metadata (price=medium, availability=fragile-with-fallback) so monitoring knows where to look. (3) Ops loop: drift-detector watches item count, field fill-rate, ban rate, response codes, schema mismatches; the right repair gets routed to the right loop, not a panic rewrite. The non-obvious piece is the blackboard: a structured project directory (.scrape/site/) with append-only authorship, timestamps, source, and reason on every observation. Three months later when the price field goes null, you read backwards through the trail to see which selector was used, which samples supported it, when it last passed, what changed on the site. The orchestrator is deliberately weak: it senses state, routes work, protects approval gates, and asks for human input on the schema/material-tradeoff decisions only, not every CSS path. Evidence-weighted consensus beats popularity-weighted ("another agent agreed" is the weakest form of review). Full Zyte writeup by Neha Setia Nagpal

2026 landscape / AI browser stack

The AI Browser Stack: 3 New Layers Nobody Talks About Yet

The AI browser race has the headlines (Atlas, Comet, Dia, new contender every month). But the winner gets decided by what sits underneath. A fresh map of the stack the AI browsers actually run on shows three categories that did not exist a year ago: AI-native browsers, anti-detect platforms, and publisher licensing/tolls.

💡 Publisher licensing is the economic answer to "AI crawlers are scrapers too"

Three additions worth tracking, beyond the agentic browser map already in this guide. (1) AI-native browsers as a top layer: ChatGPT Atlas, Perplexity Comet, The Browser Company's Dia. Not wrappers, full browsers built around an LLM as the primary user interface. They drive the lower layers as agents on behalf of real users, which is part of why Microsoft Edge and V8 quietly stopped enforcing automation-transparency flags (see the "Vendors That Wrote the Detection Rules" card). (2) Anti-detect & fingerprinting platforms as a distinct category: Multilogin, AdsPower, Kameleo. These are not scraping libraries, they are productised browser-profile managers (each profile = a coherent fingerprint with its own canvas/WebGL/timezone/IP), originally for multi-account ecommerce and affiliate work but increasingly used by serious scrapers. Worth knowing about because the techniques inside them are the same fingerprint-coherence rules this guide describes, just packaged with a UI. Some are repositioning around scraping outright: Kameleo 5.0 (2026) dropped the signup wall so you can run a few lines against its local API and test its two in-house stealth browsers on your target before creating an account, an acknowledgement that the centre of gravity for these tools has moved from multi-account management to undetected automation for scraping teams. (3) Publisher licensing & tolls, the genuinely new layer: TollBit, Cloudflare Pay-Per-Crawl, ProRata.ai. The premise: instead of fighting AI crawlers with detection, charge them per access. Publishers expose a paid API for LLMs and agents, with rate-limited free tiers and metered paid ones. This is the economic acknowledgment that AI crawlers are scrapers and that some categories of access are worth paying for. If you build agentic browser systems, this is the layer that may eventually replace bot-detection-as-defense for content-heavy sites. The detection layer is designed to tell bots from people, but the AI browsers on top and the residential networks on the bottom both look the part, so this line is blurring fast. Credit: landscape framing by Massive (Q4 2026 update).

2026 reality / Stack composition

Defense in Depth Is Table Stakes Now (26-Site Scan)

A scan of 26 e-commerce and ticketing sites found the same pattern everywhere: the defense is no longer a vendor, it is a stack. Home Depot's sign-in loads Akamai Bot Manager, PerimeterX/HUMAN, Forter, AND reCAPTCHA Enterprise simultaneously. Beating one is tractable. Beating the composition is not.

💡 If you are still planning around a single vendor, you are two layers behind

Five patterns kept appearing across all 26 sites. (1) Stack not vendor: multi-vendor compositions are now the norm, not the exception. (2) Brand names lie: Ticketmaster calls their system "EPS", but open the bundle and you find iamNotaRobot.js, abuse-component.js, aps.js — it is PerimeterX rebranded and served from their own domain. Trust the code, not the label. (3) First-party cloaking: Home Depot serves PX-shaped scripts under random first-party filenames. You cannot identify defenders by checking hostnames anymore, you have to watch how the script behaves at runtime. (4) Lazy-loaded defenses: Ticketmaster ships a reCAPTCHA site key in the homepage JSON but the SDK only loads on login or checkout. Probing only the homepage misses everything. Multi-hop traversal (homepage → login → cart) is the minimum recon bar now. (5) The fingerprint dictates the budget: PerimeterX + behavioral biometrics + reCAPTCHA Enterprise on one page tells you exactly what tier of browser, what kind of proxy, and how slow your automation has to be. Recon is upstream of every other decision. The takeaway: stop asking "which vendor does this site use" and start asking "which stack does this site compose, and where on each layer do I look for cracks."

2026 shift / Browser vendors

The Vendors That Wrote the Detection Rules Are Quietly Breaking Them

Microsoft Edge now returns navigator.webdriver = false when an AI agent drives Playwright. Google patched out CDP detection in V8. Neither change was announced. The signals every anti-bot tool relied on to flag automation just became officially unreliable, because AI agents browsing on behalf of real users broke the human-vs-automation binary.

💡 Detection has to move up the stack to behaviour, intent, and network identity

For years the W3C spec required automated browsers to flag themselves via navigator.webdriver = true. That was the easiest detection signal in the industry, and most public anti-bot stacks were built around it (along with CDP-detection tricks for Chrome's DevTools Protocol). Two undocumented changes in 2026 have just made those signals soft: (1) Microsoft Edge returns navigator.webdriver = false when Playwright is driven by an AI agent on behalf of a user; (2) Google patched out the most common CDP-detection technique in V8. No release notes, no announcements. The reason is rational from the browser vendors' side: agentic browsing is now a legitimate use case (Anthropic Computer Use, OpenAI Operator, Browser Use, etc.) and the old binary doesn't apply. The implication for the guide and for production scrapers: any detection or bypass strategy that pivots on these flags needs to assume they are no longer reliable as either signal or counter-signal. Detection has to move up the stack, to behavioural ML, intent patterns, and network-identity layers (TLS JA4, IP reputation, WebRTC, DNS coherence), all of which are far harder to remove from the inside. DataDome's threat-research team published a longer breakdown if you want the technical specifics.

Detection vector / Hardware side-channel

"Frost": Tracking Users via SSD Timing, Zero Permissions Required

A new side-channel technique called Frost measures microscopic latencies in the SSD subsystem to fingerprint users and infer activity in other tabs. It requires no permissions, no APIs that prompt the user, and ad-blockers + incognito cannot mitigate it. Standard JavaScript on any page can read these timing variations.

💡 The fingerprinting frontier is now below the JS sandbox layer

Every active tab and every site generates a unique load pattern on the local disk subsystem. By timing common disk-touching operations through standard browser APIs, a script can probabilistically infer what other tabs the user has open and what they are doing in them, without ever requesting filesystem access. Not seen in the wild yet, but the technique class is what matters here: it joins WASM SIMD CPU probes and hyphenation-dictionary checks in a growing family of hardware-level fingerprinting that the browser sandbox does not stop. There is no direct fix until browser vendors add fuzzing or rounding to disk-operation timing, which has a real performance cost they have to weigh. For scraping: any stealth strategy that depends on JS-level patches stops working against this class of probe by design, because the signal is below the JS layer. The medium-term answer is the same one that defeats WASM SIMD probes: real hardware via real devices, not patched browsers in datacenters.

Detection vector / Timing attack

Incognito Detection by Timing a Single Byte Write

detectIncognito v1.7 (Nov 2026) integrates a side-channel that detects Chromium incognito mode by writing one byte to navigator.storage and timing the flush. Incognito routes storage to RAM, normal mode hits disk. RAM is faster. That is the entire vulnerability.

💡 <0.1ms flush time = RAM = incognito, no API prompts needed

The technique writes one byte three times to navigator.storage and measures the flush time. Under 0.1ms indicates RAM (incognito), above indicates disk (normal mode). No permission prompts, no API quirks the user can disable, runs in standard JavaScript on any page. Known caveats: RAM disks (used by some privacy-conscious users) trigger false positives unless the threshold is tuned (~0.01ms separates RAM-disk from incognito on test hardware); slow HDDs do not produce false positives because the technique detects suspiciously fast writes, not slow ones. For anti-bot: detecting incognito is a useful behavioural signal (legitimate buyers rarely shop in incognito; scrapers and abuse traffic over-index on it). For scraping: if your stealth stack uses incognito or per-session ephemeral storage to keep contexts clean, you are leaking a signal that is now trivially detectable. The fix is the same as for the broader timing-attack class: use persistent profile directories that hit real disk, accept the cookie/storage management overhead.

Build pattern / Self-hosted infra

Browser-as-a-Service: Separate the Control Library from the Binary

John Watson Rooney built a self-hosted stealth-Chrome scraping service and documented every gotcha. The key mental model: Playwright, Puppeteer, and Selenium are control libraries, not browsers. They speak CDP to whatever binary you point them at. Separate the two: run a persistent patched browser on a dedicated machine, connect to it over WebSocket from many scripts.

💡 playwright.connect("ws://...") = one browser service, many clients

The architecture: instead of playwright.launch() spinning up a local browser per script, run a persistent Playwright server on a dedicated box exposing a WebSocket endpoint, and have every scraping script connect("ws://host:3000") as a client. The page can't tell the difference, the API is identical. Five hard-won lessons from the build: (1) Binary choice matters as much as library choice. JS overrides of navigator.webdriver are themselves detectable (wrong property descriptor, wrong prototype chain); source-patched binaries like CloakBrowser remove the signal instead of masking it. (2) Headed via Xvfb beats headless, the virtual framebuffer means nothing looks headless because it isn't. (3) The two-slot trap: Playwright keeps TWO Chromium directories, a full build and a stripped chrome-headless-shell. Replace only the full slot with your patched binary and Playwright silently launches the untouched headless shell instead, you get 403s and the wrong version string with nothing in the logs explaining why. You must replace both slots and rename the headless one to chrome-headless-shell. (4) supervisord inside Docker manages the multi-process reality (Xvfb priority 10, Playwright server priority 20 with startsecs delay). (5) Concurrency = contexts, not instances. One browser, a pool of isolated contexts (separate cookies/storage), workers pull from an async queue, a 403 requeues with backoff and the worker grabs the next job. Proxy creds go per-context so a bad IP just retries with a fresh one. 16 concurrent contexts ran fine on a Ryzen 4650G mini-desktop. Full writeup · github.com/jhnwr/browser-service · YouTube

Learning / Mobile API reversing

The Mobile-API Reversing Toolchain: Frida + JADX + Ghidra

Your guide's #1 rule is "find the mobile API first", apps often hit the same backend with zero anti-bot. But intercepting a modern app means defeating its own protections (cert pinning, signed headers). The toolchain: Frida to hook and modify native Java/C functions at runtime, JADX to decompile the APK back to Java, Ghidra to decompile the native C, on a rooted Android emulator.

💡 MobileHackingLab ships a free Frida course with certificate

This is the practical skill that makes "scrape the mobile API" actually work in 2026, when apps defend their endpoints. The workflow: (1) JADX decompiles the APK to readable Java so you can find where the signed header or auth token is generated. (2) If that logic is in a native .so library (common for the sensitive bits), Ghidra decompiles the C/C++ to understand the algorithm. (3) Frida hooks those functions at runtime via injected JavaScript, so you can log the inputs/outputs, bypass certificate pinning, or call the signing function directly to mint valid headers, no need to fully reverse the algorithm if you can just invoke it. (4) Run all of this on a rooted Android emulator from Android Studio for a controlled, disposable lab. Pair with HTTPToolkit or mitmproxy to capture the now-decrypted traffic and recover the API contract. MobileHackingLab offers a free Android Frida course with a certificate and CTF-style challenges, the fastest way in if the toolchain feels intimidating. The payoff: once you can mint the app's signed headers, you call its clean JSON API directly and skip every browser-layer anti-bot entirely. MobileHackingLab free Frida course

Tool / Proxy ops

Stop Round-Robining Dead Proxies: Bayesian Selection

Round-robin proxy rotation has a dumb flaw: it keeps sending requests through proxies that are already dead or banned. ProxyOps uses Thompson Sampling to learn which proxies are working right now and route around the rest. Real data over 549,114 requests in 7 days: 76% success with Bayesian selection vs 36% with round-robin. Same proxies, same targets, more than double the success rate.

💡 Treat proxy selection as a multi-armed bandit, not a queue

The core insight: proxy health is non-stationary, a proxy that worked a minute ago may be banned now, and a dead one may recover. Round-robin ignores this entirely and gives every proxy equal traffic regardless of recent performance. Thompson Sampling (a Bayesian multi-armed-bandit method) models each proxy's success probability as a distribution, samples from those distributions to pick the next proxy, and updates beliefs after every request. Good proxies get more traffic, failing ones get probed occasionally to check for recovery but mostly avoided. The benchmark is striking: 76% vs 36% success on identical proxies and targets is not a marginal optimisation, it's the difference between a viable scrape and a failing one. ProxyOps ships this as a full open-source tool, multi-provider inventory, pluggable rotation strategies, per-bot proxy groups, comparison dashboards, on FastAPI + Vue + PostgreSQL, Dockerized (docker compose up). MIT licensed. github.com/Paulo-H/proxyops

Tool / Concurrency

Validating Millions of Public Proxies: Why Python Wasn't Enough

Public proxy lists are mostly dead, slow, or pre-blocked. Validating them at scale before a scrape becomes its own engineering problem. One builder hit the wall with Python threads checking proxies for Google scraping, then rebuilt the validator in Go. The lesson: Go is exceptional for I/O-heavy concurrent workloads where Python's threading model bottlenecks.

💡 Validate the pool before the scrape, in Go, not inline in Python

The bottleneck most people don't anticipate: if you rotate through huge lists of free/public proxies to avoid rate limits, you spend more time discovering which proxies are alive than actually scraping. Doing this inline with Python threads doesn't scale, the GIL and thread overhead cap you well below the concurrency a proxy-check workload needs (it's almost pure network I/O with tiny CPU cost, the ideal case for lightweight concurrency). Go's goroutines and channels handle tens of thousands of concurrent connection checks on modest hardware, with far better memory efficiency than Python threads or even asyncio for this specific pattern. The architecture takeaway generalises: separate proxy validation into its own high-throughput service (in Go or Rust), maintain a continuously-refreshed pool of known-good proxies, and have your scraper pull only from validated proxies rather than checking inline. Useful well beyond scraping, the same pattern applies to OSINT, network reconnaissance, and distributed systems. github.com/harshit-singh-ai/proxy-validator

Case study / CDN routing gap

Bypassing a Queue-It Virtual Waiting Room via a CDN Gap

FlySafair put its whole site behind a Queue-It virtual waiting room (a Cloudflare Worker checking for a queue token) during a flash sale. But two paths, /check-in and /manage, were served from a separate AWS CloudFront origin that was never routed through the Worker. And that origin served the entire SPA bundle.

💡 If any path serves the SPA bundle, it must enforce the same queue

The bypass chain is a masterclass in CDN-gap exploitation. (1) Recon: probe the public sitemap, log response headers per path. Most paths returned Cloudflare headers and redirected to the queue. Two paths (/check-in, /manage) returned AWS CloudFront headers, a completely separate origin. (2) That CloudFront origin didn't serve a lightweight stub, it served the full single-page-app bundle, the same JavaScript that powers the booking flow. (3) The bundle still runs Queue-It's client-side connector on load, which calls assets.queue-it.net and redirects if no token. But that check fires AFTER the browser has the full bundle. Block assets.queue-it.net in DevTools and the check never fires. (4) Because it's an SPA, once the bundle is running every navigation is client-side. Two console lines, history.pushState({}, '', '/') then dispatch a popstate event, render the home page with zero new server requests. Queue entirely bypassed. The root cause isn't a Queue-It bug, it's a CDN routing gap: most of the site went through Cloudflare with the Worker active, but two paths on a separate CloudFront origin served the same SPA without enforcement. Full writeup on GitHub

Benchmarks / 2026 data

Distributed Browsers: The "Residential Proxy Moment" for Stealth

The Web Scraping Insider benchmarked stealth browser APIs (April 2026): Scrapeless 90.95, Bright Data 89.05, Oxylabs 85.71, then a cliff. The bigger idea from Driver.dev: as anti-bots fingerprint GPU and hardware entropy, cloud stealth browsers may become the new datacenter proxies, detectable by default, pushing the field toward real-device browser networks.

💡 Cloud browsers → real-device browser networks, mirroring DC → residential

Three data points worth internalising from the April 2026 benchmarks. (1) Stealth browser APIs are measurably different: Scrapeless (90.95), Bright Data Scraping Browser (89.05), Oxylabs Headless (85.71) lead, then most providers fall off a cliff for the same reason, automation signals leak, and once that happens proxy rotation doesn't save you. (2) Cloudflare bypass has no single winner: across 8 approaches tested on 20 protected sites, only 3 had broad coverage, smart proxy APIs, TLS impersonation (curl_cffi), and browser APIs. Different domains use different protections, so what works on one target fails on another. (3) The structural prediction: anti-bots increasingly fingerprint the whole environment (browser, GPU, hardware entropy, OS quirks), which means a stealth browser running in a datacenter is detectable simply by being in a datacenter, regardless of how good its fingerprint patches are. The likely evolution mirrors proxies exactly: datacenter proxies gave way to residential proxies, and cloud stealth browsers may give way to distributed browser networks running across real consumer devices. The fingerprint isn't patched, it's genuinely real, because it's a real device.

Concept / Shift in thinking

The Web Is Becoming a Real-Time Database for AI Crawlers

Classic scraping: crawl on a schedule, store in a database, query the database. With cheap residential proxies, a new pattern emerged: don't store anything. The target website is your database. Scrape on demand, treat every page load as a query. For AI crawlers especially, why save data when you can fetch it live whenever you want?

💡 Resproxies made scraping cheap enough to skip storage entirely

This is a genuine shift in how scraping infrastructure gets designed. The old model assumed scraping was expensive and risky, so you scraped once, stored aggressively, and served queries from your own database. That meant stale data, storage costs, and sync logic. Residential proxies changed the economics: scraping became cheap and reliable enough that for many use cases you can just hit the live site every time a user (or an AI agent) asks. The website hosts the data, you treat it as a remote database, and the latency tradeoff is acceptable because the customer is willing to wait a few seconds for fresh answers instead of getting instant but stale results. For AI crawlers and agentic workflows this is even more pronounced, there's no point caching a price or a flight time when the agent can re-fetch it at query time. The implication for anti-bot teams: scraping volume is no longer bounded by "how often do they refresh their DB", it's bounded by "how often does a user ask", which can be far higher and far spikier.

Landscape / Agentic browsers

"Browser Agent" Is Not One Product — It's 8 Layers

Massive mapped the agentic browser infrastructure landscape for Q2 2026. The insight: most teams building AI agents think about one or two layers. The reliable ones account for all eight. When an agent fails a task, the cause is usually not the framework everyone debates — it's one of the other seven layers nobody mapped.

💡 Every layer above the network still has to reach the live site

The eight layers of the 2026 agentic browser stack: (1) Cloud Browser Platforms — Browserbase, Kernel, Notte, Anchor Browser, Browserless, Hyperbrowser. (2) Agent Frameworks & SDKs — Browser Use, Stagehand, Skyvern, AgentQL, Dendrite, Nanobrowser. (3) Browser Automation — Playwright, Puppeteer, Selenium, Crawlee, BrowserMCP. (4) Computer Use Agents — Anthropic CUA, OpenAI Operator, Fellou, Twin, MultiOn. (5) Stealth & Anti-Detection — CloakBrowser, Steel, Lightpanda, Pydoll, Camoufox. (6) Data Extraction & Enrichment — Diffbot, ScrapingBee, ScraperAPI, Zyte, ScrapeGraphAI. (7) LLM-Optimized Crawling — Crawl4AI, Firecrawl, Jina AI, Apify, LLMScraper, Scrapy. (8) Network & Proxy Layer — Massive, Bright Data, Oxylabs, Smartproxy, NetNut, IPRoyal. The proxy layer sits at the bottom and everything above it depends on it: a perfect agent framework with a flagged datacenter IP still fails. Most agent debugging focuses on layer 2 (the framework) when the actual failure is layer 5 (detection) or layer 8 (the IP). Map all eight before you debug one.

Architecture / Resilience

Pub/Sub: Why Sequential Crawlers Don't Scale

A threat-intel crawler hitting 1000+ sources started as one service crawling sequentially. Two failures emerged: latency grew linearly with each new source, and one broken source took down the entire pipeline. The fix wasn't more compute — it was event-driven pub/sub architecture.

💡 If N sequential steps can each kill the rest, that's an architecture problem

The pattern: a publisher pushes one message per source to a broker (SQS, Kafka, RabbitMQ, Google Pub/Sub). Independent worker subscribers pull messages and crawl in parallel, completely isolated from each other. When one source breaks — a timeout, a layout change, a 500 — it fails alone in its own worker. The rest of the pipeline never notices. Three properties fall out of this for free: (1) ingestion latency stays flat regardless of source count, because workers run concurrently not sequentially; (2) failures are isolated by design, with a dead-letter queue capturing the broken messages for retry; (3) the system scales horizontally — add more subscriber workers, get more throughput, no code change. This is the difference between a script that crawls 50 sources and a platform that crawls 50,000. I added a full architecture diagram for this pattern in the Architecture section (tab: Pub/Sub Event-Driven). The takeaway generalises beyond scraping: any pipeline where one slow or broken step blocks all the others is one message broker away from being resilient.

Detection / OS fingerprinting

The Detection Vector Nobody Patches: Hyphenation Dictionaries

Chromium on Windows and Linux requires a hyphenation dictionary to be bundled at build time. Most custom Chromium forks ship without one. The result: many stealth browsers literally cannot hyphenate words — a signal anti-bots can probe via hyphens: auto CSS and measure rendered output.

💡 Camoufox, PatchRight, undetected forks — check yours with the PoC

When CSS hyphens: auto is set and text overflows a container, the browser inserts soft hyphens at language-specific break points (so "hyphenation" becomes "hy-phen-ation"). The dictionary that drives this is OS-level on Android and macOS, but Chromium on Windows and Linux must bundle it at build time. Most people forking Chromium don't know this — the build artifact is large and the feature is invisible until you specifically test it. Joe (joe12387) demonstrated this is a detection vector: anti-bot scripts can render a known word in a known-width container with hyphens: auto, screenshot via Canvas, and compare the hyphenation positions against expected values for the claimed OS. A custom Chromium fork that fails to hyphenate at all (or hyphenates wrong) reveals itself instantly. Mitigation: ensure your build includes the hyphenation dictionary for the languages you claim to support, or run real Chromium binaries (not forks) under XVFB instead. Live PoC · github.com/joe12387

Workflow / AI-assisted recon

Burp Suite MCP + Claude Code = Anti-Bot Recon in Minutes

PortSwigger shipped an MCP server for Burp Suite. Point it at Claude Code and the hours-long ritual of tracing which cookie unlocks which route, when the sensor payload fires, what gets re-validated on POST — collapses into a single prompt. The bar for what counts as anti-bot recon just moved.

💡 Build a burp-antibot-recon SKILL.md and replay it across targets

The classic anti-bot recon workflow: capture a session through Burp, scroll the HTTP history one request at a time, manually trace cookie flows (_abck, datadome, cf_clearance, reese84), identify sensor.js challenge endpoints, figure out which requests trigger re-validation. For a moderately complex target like nike.com, this takes hours per session. With Burp's MCP server pointed at Claude Code, you capture the same session and prompt: "trace the _abck cookie lifecycle from home page through add-to-cart, identify all sensor payload endpoints, and explain the validation flow." Claude reads Burp's full history directly and produces the analysis in seconds. The pattern scales: build a reusable burp-antibot-recon Skill once, replay it across Akamai/DataDome/Cloudflare targets. If you work in this space and haven't wired it up, this is the unlock. github.com/PortSwigger/mcp-server

Trending / Browser stealth

CloakBrowser Just Crossed +9.1K Stars in One Week

The fastest-growing GitHub repo this week is a scraping tool. CloakBrowser, the stealth Chromium with source-level fingerprint patches, hit +9.1K stars (Nov 2026), passing 30/30 anti-bot tests. Confirms what production scrapers already knew: real browsers with patched binaries beat stealth plugins.

💡 Drop-in Playwright replacement, no JS patches to detect

Stealth plugins (puppeteer-extra-stealth, playwright-stealth) overwrite JS properties at runtime — which is itself a detection signal. Anti-bot scripts can see the override happen. CloakBrowser patches Chromium's C++ source directly so the fingerprint is real at the binary level. There's no override to detect. The recent velocity (+9.1K stars in one week) was driven by it passing every public anti-bot test suite including Sannysoft, Antoinevastel, Creepjs, and Bot.sannysoft. Its closest analogue is Camoufox (Firefox-based, same approach with Juggler) but CloakBrowser uses Chromium so it works with existing Playwright pipelines. The tradeoff: maintenance burden. When Chrome ships a new version, CloakBrowser has to patch and rebuild. The team has been keeping up so far. github.com/CloakHQ/CloakBrowser

FOSS Defense / Anti-AI

Anubis: The Anime-Mascot Firewall Protecting FOSS

Self-hosted Web AI Firewall (15k+ stars) built by Xe Iaso. Used by Codeberg, FFmpeg, the Linux kernel docs, Sourcehut. Issues a JS proof-of-work challenge with a furry-eared mascot. Codeberg admitted in mid-2025 that AI bots already learned to solve it.

💡 PoW slows scrapers but headless Chromium solves it naturally

Anubis is the new category of anti-scraper protection most guides miss: self-hosted, FOSS-targeted, AI-scraper-focused. Unlike Cloudflare and Akamai (enterprise SaaS), Anubis runs as a reverse proxy on the same server as the protected site. The challenge is pure client-side JavaScript proof-of-work — the browser hashes until it finds a nonce matching a difficulty target. For real users this is invisible (~1-3 seconds on modern hardware, painful on old phones). For scrapers using plain requests or curl_cffi, the challenge is unsolvable without JS execution. The bypass is mundane: any headless browser (Playwright, Camoufox, Patchright) with JS enabled will solve it automatically. Persist the auth cookie (techaro.lol-anubis-auth) and reuse it across requests. The political angle: Anubis exists because AI scrapers (OpenAI, Anthropic, Common Crawl, ByteDance) were DDoSing small FOSS projects by ignoring robots.txt. It's a community response, not a commercial product. github.com/TecharoHQ/anubis

TLS / Anti-bot

Cloudflare Turnstile Solved Without a Browser

Solvable with pure HTTP, no browser needed. Reverse-engineer the POST payload: 79 parameters covering Canvas, WebGL, Timing and crypto hashes. Status 200 in 0.27s.

💡 Turnstile PoW is solvable in under 1s via plain HTTP

The Turnstile POST payload contains 79 parameters. Key groups: Fingerprint (Canvas hash via OffscreenCanvas, WebGL renderer, AudioContext output), Browser Environment (navigator properties, screen dimensions, timezone), Interaction sequence (mouse path, click timing), and Crypto (custom SHA-256 + TEA encryption of the challenge nonce). The Sitekey is extracted automatically from the page source. Algorithms used: Custom SHA-256, TEA block cipher. Token format: Encrypted_Data-Timestamp-Version-Checksum. Full flow: extract Sitekey → initiate challenge → construct responses → generate 95-char token. Result: cf_clearance accepted in 0.27s. No browser process needed.

TLS / Proxies

Why "Just Get Better Proxies" Stopped Working

The problem is your TLS handshakenot your IP. Cipher suites, HTTPS extensions, GREASE values form a JA4 fingerprint. A clean residential IP still fails if the fingerprint exposes you.

💡 The residential IP passed. The fingerprint gave it away.

TLS detection happens at the ClientHello level, before any HTTP is exchanged. The JA4 fingerprint hashes: cipher suite list (sorted), TLS extensions (sorted, GREASE removed), ALPN protocols. Python's requests library sends a different cipher suite order than Chrome. httpx is different again. Even with a clean residential IP, if your cipher ordering does not match Chrome's, you are identified before the server processes a single header. Fix: use curl_cffi with impersonate="chrome124"it emits Chrome's exact TLS ClientHello. Also watch HTTP/2 SETTINGS frames, they contain window sizes and header table parameters that vary per client.

Network Identity

The WebRTC Trap: Your Browser Is Leaking Your Real Location

Proxy says US. WebRTC says elsewhere. It leaks your real IP via STUN and creates geo mismatches. Anti-bots check that IP + WebRTC + timezone + DNS + Accept-Language all agree.

💡 Quick test: browserleaks.com/webrtc, check before blaming your proxy

WebRTC uses the STUN protocol to discover network paths. During ICE candidate gathering, the browser contacts a STUN server and reports: your real public IP, your local LAN IP (e.g. 192.168.x.x), and all network interface addresses. Your proxy only routes HTTP/HTTPS traffic, WebRTC bypasses it entirely. Anti-bots cross-check: proxy exit IP vs WebRTC public IP vs DNS resolver location vs Accept-Language vs timezone. All five must agree. Fix with Camoufox: set geoip=True and it automatically aligns all five vectors. Do not simply disable WebRTC, it removes a feature that 99% of real users have, which itself becomes a bot signal.

Benchmarks · 2026

The 30-Point Gap: Browser Scraping Success Rates Are Not Equal

71 protected sites tested: Browser Use Cloud hit 81%Browserbase hit 42%. That gap is no longer marginal, it is the difference between a working pipeline and a broken one.

💡 A 30-point gap in success rate = the difference between a working pipeline and a broken one

The benchmark tested 71 sites protected by Cloudflare, Akamai, PerimeterX, and DataDome. Methodology: each provider was given identical target lists and measured on first-request success (no retries). Browser Use Cloud succeeded on 81%, achieved via custom Chrome patches at the C++ binary level plus coordinated fingerprint management. Browserbase succeeded on 42%, detected primarily via CDP timing signatures and canvas hash consistency. The gap exists because basic scraping (fetch URL, parse HTML) is commoditised. The data worth having in 2026 sits behind login walls, search interfaces, and multi-step authenticated flows requiring actual browser interaction. Cheap providers are adequate for unprotected targets; they fail silently on protected ones.

Architecture

Browsers as a Session Layer, Not a Scraping Product

HTTP is fast, browsers are expensive. Right architecture: browser for session warmup and hard challenges onlythen lightweight HTTP workers for bulk collection.

💡 Browser for session warmup → HTTP for bulk collection

The architectural insight: most scraping pipelines use browsers for everything, which is expensive. But you only actually need a browser for two things: (1) session establishmentgenerating valid cookies and session tokens that a protected site will accept, and (2) hard challenge pagesAkamai sensor.js, Cloudflare Turnstile, DataDome WASM challenges. Once you have a valid session cookie, the rest of the data collection can happen via lightweight HTTP requests at 10-100× the speed and 1/100th the memory. Implementation: use camoufox or rayobrowse to generate sessions, then curl_cffi with the extracted cookies for bulk collection. Rotate sessions every 30-50 requests.

Python Framework

Scrapling v0.4: The Biggest Python Scraping Update Yet

New async spider: concurrent crawling, mix HTTP and stealth sessions, pause/resume from checkpoint, stream items live. Thread-safe ProxyRotator built in. Handles Turnstile natively.

💡 pip install scrapling --upgrade

The async spider framework uses a Scrapy-like API: define a Spider class, set start_urls, implement parse(). Key differentiators from Scrapy: mixed session types in one spider (HTTP fetchers, headless Camoufox, stealth browser), checkpoint/resumeCtrl+C saves state, restart continues from last position, per-domain throttlingset different rates per target. ProxyRotator: thread-safe, works across all fetcher types, supports custom rotation strategies, per-request override. Parser improvements: blocked_domains list to block tracking/CDN requests in headless mode, automatic proxy-aware retry on network errors, Response.follow() for easy link chaining. Install: pip install scrapling --upgrade.

Proxies

SwiftShadow: Free Proxy Rotation Without the Headaches

Grabs free proxies, validates, rotates automatically, filters by country. Built-in caching, auto-switches on failure. ~300 stars, actively maintained.

💡 pip install swiftshadow, from swiftshadow import QuickProxy

SwiftShadow maintains a pool of free proxies sourced from multiple public lists. On initialisation it validates all proxies (checks response time and anonymity level) and caches the working set. When a proxy fails mid-request, it automatically switches to the next validated proxy in the pool, no intervention needed. The QuickProxy(countries=["FR","DE"]) API filters by exit country. The built-in cache means it does not hit proxy list APIs on every request. Usage: from swiftshadow import QuickProxy; proxy = QuickProxy(); session.proxies = {"http": str(proxy), "https": str(proxy)}. Important: free proxies have high failure rates and low anonymity, do not use for Akamai, DataDome, or PerimeterX targets. Best for scraping open/unprotected sites at scale without cost.

RAG / LLM Pipelines

Keep Your LLM Context Fresh: Incremental Indexing

Scraped data goes stale fast. CocoIndex builds a continuously updated vector indexonly changed rows re-run. Pgvector, LanceDB, Neo4j targets. #1 GitHub Trending on launch.

💡 github.com/cocoindex-io/cocoindex, incremental RAG for LLM agents

The core problem: you scrape a site, embed it into a vector store, and 48 hours later 30% of the content has changed. Traditional batch re-indexing re-processes everything. CocoIndex solves this with a Rust-based delta engine: it tracks byte-level lineage per document, and when you re-run it only processes changed chunks. Target vector stores: Pgvector (PostgreSQL), LanceDB (local), Neo4j (graph). Python with Rust core means the delta calculation is very fast even on large corpora. The LLM integration: your agent always queries a fresh index, so answers reflect current scraped data. Setup: pip install cocoindexconfigure sources (files, URLs, S3), define your chunker and embedding model, run cocoindex.build()done in under 10 minutes.

API-First Scraping

Skip the HTML. Hit the API. 50× Faster.

Open DevTools Network → Fetch/XHR before writing any code. Half the time the page calls a JSON API directly. 50× faster, 100× less memory, zero browsers launched.

💡 Rule: open DevTools Network tab before writing any code

The technique: open Chrome DevTools → Network tab → filter by Fetch/XHR. Reload the page. Look for requests returning JSON. Right-click → Copy → Copy as cURL. Run that cURL command. If you get the same data back, you have found the internal API. What to look for: GraphQL endpoints (POST to /graphql or /api/graphql), REST endpoints (GET to /api/v2/products, etc.), __NEXT_DATA__ (Next.js embeds full page state in a JSON script tag, no request needed, just parse the HTML). Benefits: bypasses most anti-bot because APIs typically have weaker protection than HTML endpoints, returns clean structured data instead of HTML you need to parse, no browser needed, runs at full HTTP speed. When this fails: auth cookies required, the API uses rotating tokens, or the site detects API scraping specifically.

What to watch next: the new QUERY method (RFC 10008, June 2026). One reason so many JSON APIs are POST endpoints is that the read carries a filter object too large for a URL, so developers drop it in a POST body. GraphQL does this for every query. The cost is that a POST is opaque to caches and intermediaries: it cannot be cached or safely retried, and nothing in the method tells you it was only a read. HTTP now has a verb for exactly this, QUERY, which carries a body like POST but is declared safe and idempotent like GET, with a matching Accept-Query response header advertising which query formats a resource speaks. It is a Proposed Standard, not yet widespread (framework support is landing, e.g. an open Spring PR), but two things matter for a scraper. First, as targets adopt it, a QUERY endpoint is an even cleaner signal than a POST that you have found a read-only data API. Second, because QUERY responses are explicitly cacheable, a polite scraper that respects cache headers can cut fetches the way it never could against POST. Watch the Allow header for QUERY alongside GET and HEAD.

When the internal API answers in an obfuscated shape: ProtoJSON. Finding the endpoint is sometimes the easy half. Large sites (Google properties are the textbook case) do not return clean labelled JSON, they serialise Protocol Buffers as deeply nested, positional JSON arrays with no keys, often prefixed with an anti-hijacking junk string such as )]}' that you must strip before parsing. The data is all there, but a field is addressed by its path through the array, not by a name, so a star rating might live at P[7][1][15] and an English translation at P[7][2][15][1][0]. Two things make this tractable. First, treat the leading junk prefix as a known quantity and slice it off before json.loads. Second, map the indices once and encode them as named constants, and this is a task an LLM is genuinely good at: hand it a sample response alongside the rendered page and ask it to align visible values to array paths, then freeze the mapping into a parser. The indices are effectively a private schema, so they can shift without notice, which means the index map belongs behind the same field-coverage monitoring and self-healing trigger as any brittle selector. It is the backend-API strategy carried to its conclusion: you traded HTML parsing for array-path parsing, which is faster and more stable, but it is still a contract the site can change.

Library Analysis

Scrapling Hit 200K Views, Invest in Your Network Layer

Detection vendors update, bypass libraries break, new ones ship. This cycle repeats every few months. What never depreciates: your proxy infrastructure. Invest there first.

💡 Invest more in your network layer, it depreciates slower than your library

The library lifecycle in scraping works like this: a new bypass technique is discovered, someone publishes a library implementing it, the library becomes popular, detection vendors add the library's fingerprints to their models, the library gets blocked, repeat. This cycle runs on a 2-4 month cadence for fast-moving targets. Your proxy setup operates on a different timeline: a well-configured residential proxy pool with good IP diversity and correct session management continues to work across multiple library generations. The specific libraries come and go but the network signals (IP reputation, ASN, session behaviour, timing patterns) remain consistent requirements. Conclusion: spend more engineering time on proxy quality, session management, and IP pool diversity than on tracking the latest bypass library.

Learning Path

Scraping Tutorials Teach the Wrong Things First

Most start with BeautifulSoup on static HTML. Real scraping is JS-rendered, sessions, rate limits, dynamic APIs. Better: DevTools → XHR replication → Scrapy → anti-bot.

💡 DevTools → XHR replication → Scrapy → anti-bot, in that order

The typical tutorial sequence: install Beautiful Soup, parse static HTML, extract data. This teaches the wrong mental model. Real production scraping involves: JavaScript renderingmost modern sites build their UI client-side, the HTML you fetch is an empty shell, Sessions and authcookies, CSRF tokens, login flows, Rate limiting and backoffexponential backoff, per-domain limits, Dynamic selectorssites change their HTML structure, you need adaptive extraction. The right learning sequence: DevTools Network tab → understand how data flows between client and server → learn to replicate XHR requests with requests/curl_cffi → Scrapy for structure and scale → fingerprinting and anti-bot bypass last. Understanding what actually happens when a browser loads a page is more valuable than memorising BeautifulSoup APIs.

IoT / Edge

A Microcontroller Scraping Live Weather Data via API

An ESP32 calling a scraping API, parsing JSON, displaying on TFT screen. The abstraction is now clean enough for devices with no Python. Scraping as real infrastructure.

💡 Scraping APIs are now clean enough for Arduino, #esp8266 #esp32

An ESP8266 microcontroller running Arduino firmware makes an HTTPS request to a scraping API endpoint. The API (Zyte) handles: TLS negotiation with the target, JavaScript rendering if needed, anti-bot bypass, data extraction. The microcontroller receives clean JSON back and renders it on a TFT display. This demonstrates that scraping has become a proper infrastructure layer, just like how you would call a weather API, you can now call a scraping API from any HTTP-capable device. The broader implication: scraping is no longer just a Python script on a server. It is a data access layer that any application can use. The complexity of browser fingerprinting, proxy rotation, and anti-bot evasion is fully abstracted behind a simple API call.

Debugging

Your Scraper Is Blocked Because of Behaviour, Not Code

Identical headers. Machine-speed intervals. No session state. Datacenter IPs. Fix: rotate headers, random.uniform(1.8, 4.3) delays, requests.Session(), residential proxies.

💡 sleep(random.uniform(1.8, 4.3)) beats sleep(2) every time

The signals that get you blocked, in order of detection speed: TLS fingerprint (detected before first HTTP byte), HTTP/2 SETTINGS frames (detected at connection), Request headers (User-Agent, Accept-Language, Sec-CH-UA, checked immediately), Request timing (identical intervals are machine-like), Session patterns (no cookies accumulated, no referrer chain), IP reputation (ASN, datacenter range). Fix each layer: curl_cffi for TLS, full Chrome headers via httpx or curl_cffi, random.uniform(1.8, 4.3) delays, requests.Session() for cookie accumulation, residential/mobile proxies for IP. Check your current fingerprint at tls.browserleaks.com/json.

Mental Model

Understanding Beats Tools Every Time

Not curl_cffi, not Playwright, not a $300/mo plan. Understanding how detection works is the real advantage. Tools change. Detection evolves. Understanding transfers.

💡 "Tools change. Detection evolves. Understanding is what transfers."

The mental model shift: most scrapers think in terms of tools ("which library bypasses Cloudflare?"). Experienced scrapers think in terms of signals ("which signals is my scraper leaking that Cloudflare can detect?"). The difference: tool-thinkers update their library when it breaks. Signal-thinkers understand why the library broke and can fix it themselves or identify the correct replacement. Signals Cloudflare checks: JA4 TLS fingerprint, HTTP/2 SETTINGS frames, navigator properties (webdriver, plugins, languages), Canvas hash, WebGL renderer, timing patterns, IP reputation. If you know which signal you are leaking, you can fix it regardless of which library you are using. This understanding also transfers to new anti-bots, the signals are similar across vendors even though the implementations differ.

AI Visibility · 2026

Top 10 Scraping APIs per ChatGPT + Perplexity + Gemini + Google

All four AI models queried simultaneously. Consensus: 1. Bright Data · 2. Zyte · 3. ScrapingBee · 4. Firecrawl · 5. Scrape.do. Half of B2B buyers now start research in AI chatbots.

💡 AI search is a real channel, Bright Data #1 across all four models

The research methodology: a scraping API was used to query four AI systems simultaneously from a San Francisco IP address (to simulate a US-based B2B buyer). Prompt: "best web scraping API 2026". Results aggregated by occurrence and ranking position. Full ranking: 1. Bright Data (98.44% success, 72M+ IPs), 2. Zyte (93.14%, #1 Proxyway benchmark), 3. ScrapingBee, 4. Firecrawl (111K GitHub stars, LLM-optimised), 5. Scrape.do, 6. ScraperAPI, 7. Apify, 8. Scrapingdog, 9. Oxylabs, 10. Scrapfly. The AI search SEO implication: if you are building a scraping product, being in AI training data and AI search indexes is now a primary distribution channel. The buyers searching "best scraping API" increasingly ask an AI chatbot, not Google.

Browser stack / Serverless

XVFB + Headed Chrome + Nodriver Even on Serverless

The real breakthrough is not header spoofing. It is running a real browser in a real headed environment, even on serverless. Modern anti-bots are trained to detect machines pretending to be browsers, so the answer is to actually be one.

💡 The future belongs to systems that execute like real humans from the ground up

A few years ago, simple HTTP requests were enough. Today Cloudflare, DataDome, Akamai, and HUMAN Security analyse hundreds of browser, network, and behavioural signals simultaneously. The stack you choose now matters more than the proxy you point it at.

Detection risk by stack (lowest is best):

Stack	Detection Risk	Limitation
`requests` / `httpx`	Very High	No browser rendering
`Scrapy`	Very High	No behavioural realism
Headless Browser	High	Headless traces (WebGL=null, missing extensions)
Stealth Headless	Medium	Partial spoofing, JS patches detectable
XVFB + Headed Browser	Lowest	Higher data consumption

The full stack:
✓ XVFB virtual display (real X11 server, not headless flag)
✓ Fully headed Chrome (no --headless anywhere)
✓ Nodriver for CDP without webdriver artefacts (or Camoufox for Firefox)
✓ Authentic TLS / HTTP-2 behaviour (the browser handles this for free)
✓ Humanised interactions (Bezier-curve mouse, variable scroll timing)
✓ Residential proxies, sticky session for trust accumulation
✓ Fingerprint coherence (UA + WebRTC + DNS + timezone all match exit IP)

Why XVFB beats --headless even with stealth patches: headless Chrome reports HeadlessChrome in the user agent (fixable), missing extensions (probe-able), and zero GPU context (the real killer). With XVFB you get a real display, Chrome runs in headed mode, extensions load normally, and the GPU stack is whatever your server provides. JS patches still leave Function.prototype.toString() traces; XVFB does not.

The serverless angle: the conventional wisdom is that serverless cannot run a real browser. The trick is provisioning an X11 socket inside the container (Xvfb :99 &, DISPLAY=:99 chrome ...) so Chrome runs headed on a virtual display. Lambda has hit memory limits historically, but ECS Fargate, Cloud Run, and Modal handle this comfortably with ~1GB memory per browser instance. The result: serverless infrastructure behaving like real users, not automation.

What still beats XVFB: C++ patched browsers like Camoufox (canvas, WebGL, audio at the binary level) and CloakBrowser (real extension probe profiles) close the remaining 10%. But for the 80-90% of targets where XVFB + Nodriver gets you in, the cost difference is significant. Camoufox: 200MB+ per instance. XVFB headed Chrome: same memory but works on any Chromium binary.

Modern anti-bot systems are trained to detect machines pretending to be browsers. The path forward is not better lies, it is fewer lies.

Detection / WASM

The Detection Layer Your Stealth Browser Cannot Patch

WebAssembly SIMD probes the actual CPU. WebAssembly shared memory gives anti-bots a 17× higher-resolution timer than performance.now(). Both run below the JS hooks Camoufox, CloakBrowser, and PatchRight patch.

💡 The arms race of better JS patches has a ceiling. WASM fingerprinting is past it.

The DataDome engineering team published in May 2026 a method for fingerprinting CPUs from the browser using WebAssembly SIMD. Vector operations on 128-bit registers map directly to CPU instructions (NEON on ARM, SSE/AVX on x86), and their timing reveals the actual silicon underneath, not whatever the browser claims.

The enabling primitive arrived in 2024 from Manuel at brokenbrowser.com: a one-liner that gets you a real SharedArrayBuffer on any page, no special headers, by calling new WebAssembly.Memory({shared:true}).buffer. Drive a MessageChannel ping-pong with Atomics.add() inside it and you have a counter ticking at 100,000 Hz, micro-timing precision around 6 microseconds. Chrome marked it Won't Fix.

What this defeats:
× Camoufox (Firefox C++ patches at the browser layer)
× CloakBrowser (49 Chromium binary patches)
× PatchRight, undetected-chromedriver, Nodriver, Pydoll
× Every JS prototype patch (Function.toString detection is irrelevant when nothing JS is touched)

What still works: real hardware diversity. The future of stealth scraping is real consumer machines on real ISP IPs, which is essentially what high-quality residential proxy networks like Massive, Bright Data, and Oxylabs already provide. As detection moves into the CPU layer, the value of actually being real compounds.

This is also why akamai-v3-sensor works on Akamai v3: it never executes the WASM at all because it never reaches sensor.js. By bypassing at the TLS layer, you skip every detection layer above it.

Sources: Anthony Manikhouth (DataDome) and Manuel (brokenbrowser.com).

Persistence / Evercookie tradition

The Cookie That Refuses to Die

Anti-bots increasingly persist tracking state across cookie clears using a chain of fallback storage: localStorage, IndexedDB, Service Workers, Cache API, FileSystem. Clear one, the others restore it. The 2010 Evercookie idea is back, in 2026 form.

💡 If clearing cookies does not reset your session, the detection layer is not in the cookie

The unclearable-cookie repo demonstrates the modern version of Samy Kamkar's 2010 Evercookie technique: when a user clears cookies in DevTools, the cookie immediately respawns from a copy held in localStorage, IndexedDB, or a Service Worker cache. The visual is hypnotic, you delete it, refresh, it is back.

Why this matters for scrapers: if you are rotating cookies between requests to look like a fresh visitor, but the target site is reading your localStorage entry from the previous session, your rotation does nothing. Anti-bot vendors like Forter and Riskified have shipped variants of this for years. Cloudflare's cf_clearance cookie now has localStorage backup in some configurations.

Storage layers a real reset has to clear:
✓ Cookies (HTTP and JS)
✓ localStorage and sessionStorage
✓ IndexedDB (every database)
✓ Service Worker registrations and Cache API entries
✓ FileSystem API (legacy but still works)
✓ Web SQL (deprecated but persists on older Chromium)
✓ ETag / If-Modified-Since headers cached at HTTP layer
✓ HSTS pin database (yes, browsing data can be encoded in HSTS pins, this is real)

Practical implication for scrapers: when you rotate sessions, do not just clear cookies. Either spin up a fresh browser profile each session (Playwright context.close() + new context, or a fresh Camoufox BrowserContext), or run in an entirely isolated container. Half-measures leak state.

For the curious: the original Evercookie by Samy Kamkar in 2010 used 13 storage mechanisms. Modern browsers have removed several (Flash LSO, Silverlight, Java applets), but added more (Service Workers, BroadcastChannel, OPFS). The trick is alive and well, just modernised.

Reality check · robots.txt

robots.txt Works Perfectly. On the Bots That Were Going to Comply Anyway.

A publisher blocked GPTBot, ClaudeBot, and PerplexityBot in robots.txt. Logs confirmed zero traffic from those User-Agents. The scraping continued. It just stopped identifying itself.

💡 User-Agent is a string the client chooses to send. Polite bots send a real one.

A publisher blocked GPTBot, ClaudeBot, and PerplexityBot in robots.txt. Logs confirmed it: zero traffic from any of those User-Agents. They thought they had solved the problem.

What was actually happening: the scraping continued. The traffic that used to say "I am GPTBot" was now saying "I am Chrome 124 on macOS." Same content destinations, same fetch patterns, different label.

User-Agent is a string the client chooses to send. Polite scrapers send a real one. The scrapers you are actually worried about — the ones running at commercial scale on behalf of paying customers — send whatever string gets through.

robots.txt works on:
✓ Academic crawlers (Googlebot, Bingbot, academic research bots)
✓ Large AI labs (OpenAI, Anthropic, Google) that have reputational incentives to comply
✓ Hobbyist scrapers who read the rules and care

robots.txt does not work on:
× Commercial data brokers sending Chrome User-Agents
× Competitive intelligence tools running at scale behind residential proxies
× AI startups that have not publicly announced themselves
× Anyone whose business depends on data you do not want them to have

The implication for anti-bot systems: blocking by User-Agent is the weakest possible signal. Cloudflare, Akamai, and DataDome do not read robots.txt. They score TLS fingerprints, canvas hashes, behavioural timing, and IP reputation because those signals are harder to fake. User-Agent string matching is not a detection layer. It is a flag for voluntary compliance.

For scrapers reading this: if a target blocks GPTBot in robots.txt but has no real anti-bot scoring, the robots.txt is the only gate. Respect it. If they have Akamai or Cloudflare deployed, the robots.txt is decorative. The actual gate is the JA4 hash, the canvas probe, the IP reputation check. That is where this guide comes in.

Via a publisher conversation, May 2026.

Ecosystem shift

Web Scraping Is Becoming Distributed Adversarial Systems Engineering

Most scraping frameworks still assume the web is static. Modern websites are adversarial runtime environments. The shift from HTML parsing to infrastructure engineering is well underway.

💡 The future of autonomous AI agents may depend on infrastructure that can operate reliably across hostile, dynamically changing web environments.

The conventional framing: web scraping is a tool for extracting data from websites.

The 2026 framing: web scraping is distributed adversarial systems engineering.

What modern infrastructure has to operate against:
• Fingerprinting systems (TLS JA4, canvas, WebGL, WASM SIMD)
• Behavioural detection (mouse physics, scroll timing, inter-request jitter)
• Anti-bot orchestration (Akamai EdgeWorker, Cloudflare Worker, DataDome middleware)
• Cloudflare interstitials and Turnstile challenges
• Dynamic runtime rendering (SPAs, hydration, lazy loading, service workers)
• Session-aware defences (trust accumulation, per-session scoring, unclearable cookies)

This is what makes tools like Scrapling architecturally interesting beyond "another Python scraper." It combines stealth browser execution, TLS fingerprint impersonation, adaptive element tracking that survives DOM changes, session-aware orchestration, proxy rotation, and MCP-based AI extraction. Not a scraper. A runtime.

The systems properties that matter now:
• Runtime orchestration (not just retries, state-aware crawl management)
• Observability (what failed, at which layer, on which request)
• Adaptive recovery (selector drift, DOM changes, anti-bot updates)
• State persistence (session trust, cookie chains, cross-request identity)
• Stealth execution (not a flag, an architecture)
• Infrastructure resilience (circuit breakers, session warmup, fallback tiers)

Once AI agents start interacting with the web autonomously at scale, reliability becomes a systems problem first and a parsing problem second. The pipeline that Firecrawl, Crawl4AI, Stagehand, and Scrapling are converging on is not "scraper plus LLM." It is a resilient extraction runtime with LLM as one processing layer among many.

Framing via D4Vinci (Scrapling author), May 2026.

Research · Castle Intelligence · April 2026

Your Anti-Bot Fingerprint Is Probably for Sale Right Now

Castle analysed 811 bot-adjacent sites and found browser fingerprints being collected, packaged, and sold as operational assets. 12.5% deployed fingerprinting scripts consistent with harvesting. Services openly advertise "comprehensive TLS, HTTP/2, and JavaScript fingerprint collection."

💡 Detection systems must assume replay. Client-side signals are untrusted input, not proof of identity.

Castle Research (Antoine Vastel), April 2026 analysed 811 bot-adjacent websites including proxy providers, CAPTCHA farms, engagement manipulation services, and sneaker bots. The findings document a structured, commercialised layer of the bot ecosystem that most scraping engineers have not thought about.

What they found:

12.5% of analysed sites deployed fingerprinting-related scripts consistent with harvesting. A subset replicated vendor-specific telemetry from PerimeterX, Incapsula, Akamai, Adyen, and hCaptcha, not to defend themselves, but to collect and replay the same signals against those vendors.

The mechanics:

Services like impersonate[.]pro openly advertise "comprehensive TLS, HTTP/2, HTTP/3, and JavaScript fingerprint collection." In Discord and Telegram communities, bot developers discuss embedding custom JavaScript on real websites specifically to harvest fingerprints from genuine visitors. The goal: build inventories of authentic device profiles that can be injected into automated sessions.

The PerfectCanvas mechanism from Bablosoft is the clearest example. Their documentation describes exactly the pattern:
• Render canvas on a real remote machine with a real GPU
• Send the canvas output to the automation server
• Inject it into the headless browser's response to the canvas probe

This is the harvesting-and-replay model made explicit. Instead of spoofing canvas values (detectable via inconsistency), you replay values from a real Mac. The fingerprint is genuine. It just came from a different device.

Genesis Marketplace established the precedent: ~323,000 compromised browser environments for sale, each bundled with a real device fingerprint and a custom Chromium extension that injected the victim's browser profile into attacker sessions. F5 Labs and Europol both documented this. Castle's report shows the same approach is now commercialised at scale for bot traffic, not just account takeover.

What this means for scrapers:

The arms race has a new dimension. Anti-bots are scoring fingerprints. Bot services are buying real fingerprints to replay. Defenders are now building for replay conditions, not just spoofing conditions. This is why:

• Canvas/WebGL probes are increasingly paired with behavioural and timing signals (harder to replay than static values)
• WASM SIMD CPU probes (above) are valuable precisely because they are harder to harvest and replay than JS-layer fingerprints
• Anti-bots are introducing controlled variability in their own client-side scripts so that even valid payloads can't be reverse-engineered and replayed reliably

The implication for this guide: when a stealth browser passes the canvas probe, it may not be because it spoofed the hash well. It may be because it replayed a real hash that was never flagged. The distinction matters because vendors will move toward replay-resistant probes, making the harvest-and-replay model progressively harder. WASM SIMD (which requires real hardware timing) is an early example of a replay-resistant signal.

Source: Fingerprint Harvesting in the Bot Ecosystem, Castle Research, Antoine Vastel, April 2026.

Pattern · Self-healing scrapers · June 2026

The LLM as Compiler and Oracle, Not Runtime

A self-healing scraper design where the LLM compiles a cheap deterministic crawler once, then your pages run at zero model cost. When a redesign breaks the crawler, the model serves that one request live to avoid an outage while it regenerates a new crawler in the background. The hard part is not the regenerating, any model can rewrite a broken scraper. The hard part is trusting the rewrite before it ships.

💡 The moat is not the heal. It is the proof that the heal is right.

The architecture treats the LLM as two roles at once. As a compiler it turns a target into cheap deterministic extraction code that runs for free. As an oracle it is the source of truth a regenerated crawler must agree with before it is trusted. Between them sits the piece that makes the whole thing safe to run unattended: a promotion gauntlet.

How a regenerated crawler earns promotion:
• It must agree with the LLM oracle on every item, not on average. A high mean is not good enough; one disagreement fails the batch.
• It must return the exact same record count as the oracle pass.
• It must hold across at least three independent samples, so a lucky single page cannot promote a broken crawler.

Tiered model escalation keeps the cost sane. A cheap model runs the regeneration by default; the loop only escalates to a stronger model when nothing clears the gauntlet. Whatever gets promoted is still free deterministic code, so the one-time model cost amortises across every later run.

This matters because "self-healing scraper" demos are easy and trustworthy self-healing is hard. Any model can produce a plausible rewrite. The engineering is in the verification that decides whether the rewrite is correct before it touches production data, which is the same adversarial-verification idea this guide's AI Workflow is built around. It pairs naturally with the agentic reverse-engineering shift in the next card: the model drives the toolchain, but a gauntlet, not the model's confidence, decides what is true.

The proof that separates it from a demo: the author ran the whole loop live against a real public site over real HTTP, not a saved fixture. The cheap model could not clear the gauntlet, so the loop auto-escalated to a stronger model, promoted a crawler, and that promoted crawler then extracted a fresh page (title, price, stock, URL, cover image) for zero cost with zero further model calls. That is the claim that matters: not "an LLM fixed my scraper" but "the system decided, on its own, that the fix was trustworthy enough to run unattended, and it was right." The honest weak spot is in the open too: holding a greedy oracle to exact agreement across a twenty-item listing is hard, and that is where it currently strains.

Pattern and live proof of concept shared publicly by the author of Crawloop (Apache-2.0 alpha POC, github.com/Jimmynycu/Crawloop), June 2026. Honest status from the README: a working POC without the managed proxies, scale, or dashboards the funded self-healing tools (Kadoa, ScrapeGraphAI, and others) ship.

Shift · Agentic reverse engineering · 2026

Agents Now Drive the Disassembler, Not Just Read It

The interesting change in LLM-assisted reverse engineering is not that models suddenly read obfuscated code perfectly. It is that, wired to real tooling, they coordinate the whole pipeline: query disassemblers and decompilers, inspect cross-references, generate scripts, patch binaries, rerun analysis, and refine hypotheses in a loop. The shift is from a smart helper sitting next to the analyst to an agent driving the toolchain.

💡 The lever is orchestration of the workflow, not raw comprehension of the code.

For scraping this lands squarely on the hardest target type: native mobile request-signing. The manual version of that work (open the .so in Ghidra, trace the call chain, confirm with Frida, rebuild in Python) is exactly the kind of multi-tool loop an agent can now coordinate. You point a coding agent at a target, it spins up specialist sub-agents (engines, impersonation, detection, architecture, fingerprints), runs them in parallel, then synthesises and stress-tests the result.

The honest caveat is the same one the reverse-engineering community draws: agentic workflows compress the process, they do not dissolve every defence. Some obfuscation classes (heavy virtualisation, bytecode VMs) stay resilient, which is precisely when you fall back to the oracle approach from the mobile section rather than a clean offline rebuild. Treat the agent as a force multiplier on a method you already understand, not a replacement for understanding it.

And the obfuscation side is adapting to the agents specifically. The same practitioners who teach automated deobfuscation (SMT solving, symbolic execution, MBA simplification, program synthesis to recover VM handlers and bytecode) now report protections deliberately engineered to break those pipelines: anti-agentic patterns. Analysis traps that detect a symbolic-execution or instrumentation harness and change behaviour under it, and runtime-bound semantics where a function's meaning depends on live state an offline solver cannot supply, are built to make an automated loop stall or draw a confidently wrong conclusion. The arms race did not end when agents could drive the disassembler; it moved up a level, and the counter to a trap you cannot automate around is still a human who understands what the loop was trying to do.

A concrete cold-start case. A researcher pointed an agent that pairs a model with a sandboxed VM (Manus AI) at a live Akamai deployment on a real luxury store, with no prior notes and no internal wiki, and asked only that it study how the protection works, deobfuscate the client sensor, and enumerate which parameters feed the score. The agent fetched and instrumented the actual live script in its VM, then returned a structured map of the anti-hook layer, the challenge flow, and the TLS gate. The point is not a finished bypass (none was shipped) but that the expensive, human-gated part, reading minified obfuscated telemetry code and rebuilding its logic, was done by the agent running and checking its own work rather than a person spending days renaming variables.

The guardrail gap is the real story. Ask a guarded chat assistant to deobfuscate a production anti-bot sensor and enumerate its scoring signals and you hit a refusal, because that is squarely inside cybersecurity guardrails. An agent product wired to a VM took the same task and ran it. The economic consequence is what matters for the arms race: the cost that kept most protections standing was the human reverse-engineering hours, and an agent that verifies its own deobfuscation moves that gate. Defenders should now assume sensor logic is cheaper to map than it used to be, and lean harder on the layers that do not live in the client script.

Framing from "Deobfuscation in the Age of Agentic Reverse Engineering" (REcon 2026), practitioner demos of multi-agent RE pipelines, and a documented cold-start agentic mapping of a live Akamai sensor (The Web Scraping Club, Lab #108, June 2026). The agent's specific findings are reportage, not independently re-verified, and operational specifics were redacted at the source.

Tooling · Agent-native fetching · 2026

The Browser Is Getting Lighter Because Agents Do Not Need Pixels

A human needs a rendered page. An agent needs the meaning of the page. That gap is producing a new class of fetchers that throw away the visual rendering stack agents never use: headless engines measured in tens of megabytes instead of two hundred, and tools that hand back clean Markdown instead of a DOM you have to parse. When the consumer is an LLM, the heavy browser is mostly overhead.

💡 Match the fetcher to the consumer. An LLM wants structure and speed, not a painted viewport.

Three tools sketch the direction, and the pattern matters more than any one of them:

• Obscura is a headless browser written in Rust that ships its own V8, speaks the Chrome DevTools Protocol, and runs as a single binary with no Chrome and no Node.js. Roughly 30MB resident against 200MB-plus for headless Chrome, with per-session fingerprint randomisation and a DOM-to-Markdown mode for feeding pages straight to a model. Be honest about maturity: it is an early v0.1.0 with self-reported numbers, so star it and benchmark it yourself rather than betting production on it today.

• Crawl4AI turns selected pages into clean Markdown or structured data built for agents, RAG, and pipelines. It is the extract-and-shape half of the stack.

• SearXNG is a self-hosted search layer that finds the candidate URLs in the first place. It is the discover half.

Together they describe a small, controllable agent web-context loop: discover, fetch, extract, cache, cite, with each stage owned by a light tool you can host yourself rather than a heavy browser doing all four jobs badly. The takeaway is not "switch to these tools." It is that when an agent is the consumer, the cost of rendering pixels nobody looks at is pure waste, and the tooling is starting to reflect that.

Tools surfaced publicly by practitioners in 2026: Obscura (Rust headless, Apache-2.0/MIT, early), Crawl4AI (Apache-2.0, 68k+ stars), SearXNG (self-hosted metasearch).

Defence experiment · Behavioural CAPTCHA · 2026

A CAPTCHA That Watches How You Tell a Story

A researcher built a deliberately awkward CAPTCHA as a behavioural-biometrics experiment: it shows a random prompt, asks you to type a short story about it, then rate the experience from 1 to 10. The trick is hidden in the interaction order. If you rate the CAPTCHA before you hit submit, you are probably a bot, because a human reads the story task first and rates last. The signal is not the answer, it is the sequence and rhythm of how you produced it.

💡 The frontier of human-checking is behaviour over time, not a single correct response.

This is worth studying for what it tells you about where bot detection is heading, on both sides.

Why the idea is clever. A traditional CAPTCHA asks for an answer a script can compute or outsource to a solving farm. A behavioural CAPTCHA scores how the answer was produced: typing cadence, edit and pause patterns, the order in which UI elements were touched, time spent reading versus writing. Those are expensive to fake convincingly because they are emergent properties of a real person working through a task, not a field you can fill in. The out-of-order tell (rating before submitting) is a neat tripwire: it catches automation that fills every field it sees without modelling the human workflow the form implies.

Why it still falls. Within a day of the public demo, another engineer bypassed it consistently with an LLM plus Playwright. The sequence is familiar from the rest of this guide: first attempt scored too low and was rejected, the approach was tuned, the second and third attempts passed, including a live run. Behavioural scoring raises the cost of automation, it does not create a wall, because a scripted agent can be taught to produce human-shaped timings and to touch the form in the order a person would. The lesson cuts both ways: if you defend, behavioural signals are a strong layer but not a final one, and you must assume they will be modelled; if you scrape, the modern bar is not "submit the right value" but "reproduce the human process that produced it," which is exactly the territory automation-protocol and interaction-timing detection already live in.

The healthy norm on display. Both the defence and the bypass were published openly, demo and method in the open, framed as understanding security rather than breaking it. That is the same posture this guide takes: the techniques are dual-use, and studying them in public is how both sides get sharper.

StoryCaptcha by Tyler Richards (stackedqueries), a stated proof-of-concept and not production-ready by the author's own note; public AI-plus-Playwright bypass write-up, 2026.

Pattern · Tooling ergonomics · 2026

The curl Command as a Front-End to a Real Browser

A recurring 2026 pattern wraps a real browser behind the interface developers already know: you write an ordinary curl command and a tool parses it, replays the same method, headers, body, cookies, and auth inside an actual Chromium driven over the automation protocol, clears the browser-side friction (Cloudflare's managed challenge, a Turnstile widget), and hands back the final response. The point is not a new bypass, it is that the browser-grade request is expressed as the one-liner you would have written anyway.

💡 The ergonomic win is meeting people at the curl command, not asking them to rewrite their workflow as browser automation.

This is worth noting as a shape, not a single product. The reference implementation in this lineage is CurlWright (open source, github.com/seifreed/Curlwright): it parses a curl invocation, supports the common flags (-X, -H, -d/--data-*, -b cookies, -u auth, -x proxy, -L, -k, --max-time), and runs it through a genuine Chrome rather than a spoofed fingerprint. Two design choices generalise beyond it.

Stealth below the automation layer. Rather than faking a browser with injected JavaScript, the current generation of these wrappers drive the real Chrome binary through a patched automation stack (Patchright-style) so the automation tells (the Runtime.enable CDP leak, navigator.webdriver, headless markers) are neutralised at the protocol level, and falls back to a CDP-free driver (nodriver-style) for the hardened "Just a moment" managed challenge. This is the same lesson the detection and library sections of this guide keep arriving at: on protocol-fingerprinting targets, driving a real browser without the standard automation surface beats patching a headless one. Persisting and re-importing the solved-cookie session (so a warmed-up cf_clearance can be reused across calls) is what makes the curl ergonomic actually practical at more than one request.

Machine-readable output for pipelines. The other generalisable idea is emitting structured JSON and even SARIF alongside the response, so a browser-grade fetch can drop straight into CI and security tooling that already understand those formats. It is a small thing that quietly moves browser-based fetching from an interactive task to a scriptable pipeline step. The honest constraint of the whole pattern is the one its own authors flag: because it drives a real Chrome, it needs that browser present on the host and carries the cost and weight of a full browser per protected request, so it earns its place on the hard targets, not as a default replacement for a plain HTTP client on everything.

Pattern reference: CurlWright (seifreed, open source), 2026, plus the Patchright and nodriver engines it builds on. Described here as a tooling-ergonomics pattern, not an endorsement of any one tool.

11 Testing tools

Check your own
fingerprint first

Before you bypass anything, you need to know what your setup is leaking. These tools show exactly what anti-bots see when your scraper connects. Run your scraper through them, not just your browser.

Gold standard, most detailed

BrowserLeaks

browserleaks.com

The most comprehensive fingerprint testing suite online. Tests WebRTC IP leak, Canvas hash, WebGL renderer, JA3/JA4 fingerprint, HTTP/2 Akamai hash, Chrome extension detection, fonts, geolocation, JavaScript environment, battery status. Essential for verifying your scraper identity stack.

TLS specific, generates JA3/JA4

BrowserLeaks TLS

browserleaks.com/tls

Tests your TLS ClientHello. Shows cipher suites, TLS extensions, key exchange groups, JA3 and JA4 hashes. Run Python requests, curl_cffi, and real Chrome through this and compare. JA4 is what Cloudflare and Akamai check at edge before serving any HTML.

IP leak, proxy coherence test

BrowserLeaks WebRTC

browserleaks.com/webrtc

Reveals your real IP even through a proxy. Shows local IP, public IP via STUN, and ICE candidates. If your proxy exit is US but WebRTC shows a local Pakistani address, every anti-bot flags you immediately. The most commonly overlooked leak.

JSON API, use directly in code

TLS JSON API

tls.browserleaks.com/json

Returns your TLS fingerprint as raw JSON including ja3, ja4, akamai hash, HTTP/2 settings. Call this directly from your scraper to compare fingerprints against real Chrome. One requests.get() vs cffi.get() tells you everything about the difference.

Quick pass/fail validation

BrowserScan

browserscan.net

Higher-level green/red check for automation detection, timezone coherence, WebRTC status, canvas fingerprint uniqueness. Good for quick pre-deployment validation before hitting a protected target.

EFF built, uniqueness score

Cover Your Tracks

coveryourtracks.eff.org

Built by the Electronic Frontier Foundation. Tells you how unique your fingerprint is among all visitors. A fingerprint too unique is as bad as one that looks like a bot. Scrapers need to look like the middle of the distribution.

Bot detection simulation

Pixelscan

pixelscan.net

Simulates what anti-fraud systems see. Identifies inconsistencies in timezone, IP, language, and WebRTC that would trigger detection. Fast pass/fail for operational teams before deploying at scale.

Advanced, behavioral and hardware signals

CreepJS

abrahamjuliot.github.io

The most advanced fingerprint tester available. Simulates what modern anti-fraud systems actually detect, including behavioral and hardware-level signals far beyond surface tests. Use this for deep audits of browser configurations.

Diff your bot against a real browser

xray-scanner + script2builtins

catalog-driven fingerprint diff

A different kind of test: instead of scoring your browser, it shows you exactly which surfaces leak. The toolchain reverse-engineers bot-detection JavaScript (seeing through string-array tables, JSF-ck, and reflective getters) into a catalog of several hundred fingerprint surfaces, then gives you a test page you point your bot at and diff its JSON against a real browser. You stop guessing which attribute betrayed you and read it off a list. Pair it with a daily benchmark run so you notice when a detector quietly changes what it probes.

Workflow: Fetch tls.browserleaks.com/json from both your scraper and real Chrome. Compare ja4 hashes. If they differ, fix TLS first with curl_cffi. Then check WebRTC at browserleaks.com/webrtc. Then headers. Work from layer 1 outward.

One habit worth building: re-run these checks on a schedule, not once. Detection scripts are not static. A vendor that probed one set of surfaces last week probes a different set this week, so a fingerprint audit you did a month ago is already stale. Treat your own fingerprint as something to monitor continuously, the same way the other side treats yours.

12 Architecture

How production scrapers
are actually built

From a single Scrapyd daemon to multi-region ECS clusters. Twelve real pipeline architectures, from simple to enterprise-scale, with every component and data flow mapped out.

The simplest production setup. One server, Scrapyd managing spiders via JSON API, ScrapydWeb as UI. Good for <50 spiders and teams without Kubernetes. Deploy with scrapyd-deployschedule via /schedule.jsonmonitor at port 6800.

✓ Pros

Zero infrastructure overhead, one server, done
ScrapydWeb gives full UI: logs, job history, schedule
Deploy new spiders in seconds with scrapyd-deploy
Great for teams without DevOps expertise

✗ Cons

Single point of failure, server down = scrapers down
Limited to one machine's CPU and memory
No auto-scaling, manual capacity planning
Spider isolation is process-level only

↑ Scale up

Add more Scrapyd nodes → ScrapydWeb manages cluster from one UI. Next step: scrapy-redis for shared URL queue.

Stack Scrapyd :6800ScrapydWeb :5000scrapyd-client (deploy)APScheduler or cronGerapy (alt UI)

★ Featured architecture

The AI Workflow
graph + adversarial + five algorithms

A genuine AI-agent scraping system, designed from scratch. Two agents working through a queryable graph of vendor and selector history, with six mathematical concepts each replacing a heuristic that would otherwise rule the system. Below the conceptual diagram, you will find what this looks like when you actually run it in production: real services, real protocols, real bottlenecks, and the cost numbers it should add up to.

① Graph theory

PageRank · Louvain · BFS

Memory as a queryable graph. PageRank ranks vendors by operational criticality. Louvain community detection auto-distinguishes failure modes (structural change vs vendor flip vs isolated drift). BFS bootstraps new URLs from their nearest known-working neighbour.

replaces · flat-file selector history

② Bayesian inference

Beta(α, β) confidence

Every selector and every URL carries a Beta distribution, not a scalar. Beta(95, 5) means 95 successes in 100 tries, very confident. Beta(3, 0) is 3 from 3, but the credible interval is wide. Routing weighs evidence, not just point estimates.

replaces · single confidence float

③ Thompson sampling

multi-armed bandit on rungs

For each URL, the four rungs (curl_cffi / browser / browser+pacing / managed) are arms of a bandit. Thompson sampling balances exploration (try the cheap rung occasionally) and exploitation (use what worked yesterday). Same math as ProxyOps which is already in this guide.

replaces · fixed rung choice per URL

④ Information theory

KL divergence drift

Compute KL divergence between today's value distribution and a rolling 30-day baseline, per field. A field can have 100% fill rate and still drift, prices all shift up 5%, titles all gain a suffix. KL catches what fill-rate misses. This is how silent poisoning becomes visible before it pollutes silver.

replaces · fill-rate as drift signal

⑤ Control theory

PID concurrency loop

Ban rate is the error signal, request concurrency is the actuator. A PID controller adjusts concurrency continuously to hold ban rate at target (e.g. <0.5%). Proportional reacts to magnitude, integral to persistent drift, derivative dampens overshoot. Used in production at Stripe, Cloudflare. Heuristics here cost real money.

replaces · fixed concurrency & manual backoff

⑥ Reservoir sampling

unbiased audit sample

Maintain a 100-row ground-truth audit sample drawn uniformly from a 50M-row stream, in one pass, in O(1) memory. Vitter's Algorithm R. Without this you end up auditing the easy rows from the start of each batch and missing the hard rows at scale.

replaces · first-N or last-N sampling

Concepts deliberately rejected as theatre for this scale: GNN training (graph too small), max-flow/min-cut (no mapping), causal do-calculus (impractical in production), game-theoretic minimax (no concrete operation), Markov chains for traversal (overkill), learned embeddings (cost > value).

The architecture · conceptual view

A genuine AI-agent scraping architecture, built around three ideas you won't find in published frameworks. (1) Graph memory. URLs, selectors, vendors and outcomes are nodes, connected by edges. When one vendor changes its sensor, every URL on that vendor inherits the learning automatically. (2) Adversarial verification instead of self-healing. The scraper agent ships an answer + a confidence score, the verifier agent's job is to disprove it. Only outputs that survive the disproof attempt reach gold storage. The detail that makes this work is keeping the verifier context-blind: run it in a fresh session with no memory of how the answer was produced, so it judges the output on its merits instead of inheriting the generator's rationalisations. A reviewer that sat through the generation will quietly accept the same shortcut that produced the bug; a reviewer handed only the result, the schema, and the live page has to re-derive correctness from scratch, which is exactly the property you want. The same applies when an agent rewrites a broken scraper: benchmark and review the regenerated code from a clean context, not inside the conversation that wrote it. (3) Graph-theoretic intelligence on top of the memory. PageRank ranks vendors by operational criticality, community detection auto-distinguishes structural failures from selector drift, shortest-path traversal bootstraps strategies for new URLs from their nearest known-working neighbours. Most of the system runs cheap, verification runs only on the hard cases.

Stack Neo4j or DuckDB-graph (knowledge graph memory)NetworkX (PageRank · Louvain · BFS)Small LLM as Scraper agentLarger LLM as Verifier agentAny frontier LLM API · prompt: "prove this is wrong"Confidence routing (≥0.9 / 0.6-0.9 / <0.6)Pub/Sub queue (URL fan-out)curl_cffi + CloakBrowser (rung-aware fetch)Vendor-level learning propagationSchema-drift Slack alerts (human gate)Parquet bronze → silver → gold

What this looks like in production

Real services, real protocols, real bottlenecks. Drawn for 10M-50M URLs/day. Each rung in the Scraper agent is annotated with the actual scraping technique that does the work, not just the box that runs it. Color-coded dots mark each flow direction so you can trace any path through the system.

★ stack philosophy

Two agents, no framework, model-agnostic. The two agent roles are scraper (small fast model, runs on every row) and verifier (larger careful model, runs on ~12% of rows). Either role can be served by any frontier LLM with structured output, an Anthropic model, an OpenAI model, a Gemini model, an open-weights model behind vLLM. No LangChain, LangGraph, AutoGen, CrewAI, Haystack, LlamaIndex. The architecture has six math operations, none of which need a framework's abstraction. Direct API call for the model, pydantic for structured output, networkx for the three graph algorithms, scipy.stats.beta for Bayesian state, small custom services for Thompson sampling, KL drift, and the PID loop, boto3 for SQS/Kinesis/S3. Frameworks shine when you have 30 chains and 10 tool integrations to coordinate. We have 2 agents and a queue.

legend scrape traffic · ~85% verifier-flagged · ~12% verifier verdict · written back learning feedback · updates control plane

What this production design handles automatically · no on-call required

vendor sensor flips

PID + bandits re-route within minutes. ALL URLs on that vendor inherit the new escalation through the graph.

selector drift

Verifier disagreement N times raises a new selector candidate. Graph propagates to related URLs. Seamless.

cost creep

PID tightens concurrency as bans rise. Bandit shifts traffic to cheaper rungs. Spend self-corrects.

silent value poisoning

KL divergence catches distribution shifts the row count and 200-OK rate cannot see. The hardest failure mode, now visible.

What it routes to a human · by design: schema changes (new field · removed field · type mismatch) · the shape of data is governed, not learned.

✦ Pattern

Self-Healing Scraper
powered by Claude

Scrapy spiders break when sites change their HTML. Instead of manually fixing selectors, this architecture uses Claude to detect failures, analyse the new page structure, and write corrected selectors automatically, without human intervention.

Spider detects failure

Item count drops to zero or below threshold. A Scrapy extension hook fires immediately, no waiting for the next run.

Claude analyses the broken page

The full page HTML is sent to Claude with the old selectors and a prompt: "The selectors below stopped working. Examine the HTML and write corrected CSS selectors for the same data fields."

New selectors written and applied

Claude returns structured JSON with corrected selectors. The updater patches the spider config or YAML file. No code deployment needed.

Spider retries and confirms

The spider re-runs with new selectors. If items come back, healed. A Slack notification logs what changed. If it fails again, escalates to human review.

Claude prompt pattern

You are a web scraping expert. A Scrapy spider broke because the site changed its HTML.

Old selectors (no longer working):
  title:  h1.product-title::text
  price:  span.price-now::text
  image:  img.main-image::attr(src)

New page HTML (truncated):
{{ page_html[:8000] }}

Return ONLY valid JSON with corrected selectors:
{"title": "...", "price": "...", "image": "..."}

✓ Why it works

Claude reads raw HTML better than regex, handles minified, dynamic, and obfuscated markup
Zero downtime, spider heals mid-run, not on next deployment
Works across 50+ spiders from a single Claude integration
Selector changes are the most common spider failure, this covers 80% of breakages

⚙ Implementation notes

Use claude-haiku-3 for speed and cost, ~$0.0003 per heal
Cap page HTML at 8K chars before sending, beyond that Claude doesn't need more
Store selector history in a YAML file versioned in Git for auditability
Add a confidence check, if healed items look wrong, escalate to human

↑ Extend it

Add a second Claude call to validate the healed output against a schema. Use computer-use to handle JavaScript-rendered pages where HTML alone isn't enough. Log all heals to build a fine-tuning dataset.

Stack Scrapy extension hook Claude API (Haiku) YAML selector store Slack webhook Anthropic SDK

The lab-to-production gap: running thousands of pages unattended for days

There is a wide gap between running one to three spiders on your laptop and running thousands of pages in production, unattended, for days. Most tutorials stop where that gap begins. The crawl that works perfectly in a terminal session fails in ways you never see interactively: a connection resets at hour nine, a target quietly tightens its rate limits, a selector that matched yesterday returns nothing today, and nobody finds out until the morning. Closing that gap is less about clever bypasses and more about a few unglamorous habits.

Throttling and retries

Speed is a tradeoff, not a setting

Tune throttling for the real tradeoff between crawl speed and getting banned rather than hammering at a fixed rate. Scrapy's AutoThrottle adapts the delay to the server's observed latency, which both behaves better and survives longer than a hardcoded concurrency number.

Handle failures at the right layer. A try/except around your parse code never sees a network-level failure, because the request fails before your callback runs. You need an errback to catch DNS errors, timeouts, and connection resets, and a helper like get_retry_request to requeue with proper accounting instead of silently dropping the URL.

Structure and monitoring

Catch a bad run in minutes, not the next morning

Give the project a structure that scales past a handful of spiders without becoming unmanageable, so shared logic lives in one place rather than copy-pasted across files.

Then monitor outcomes, not just uptime. A tool like Spidermon validates each run against expectations (item counts, field coverage, error rates) and alerts when a run drifts, so a half-broken crawl is caught in minutes instead of discovered the next day. For deployment, the ladder runs from Docker to Scrapyd to a managed cloud, and scrapy-redis earns its place only once you genuinely need a distributed, shared request queue across workers, not before.

A concrete case worth internalising. A public Reddit scraper ran at a 92% success rate, then over thirty days, with zero code changes, fell to 61%. The logs showed the same thing on every failure: HTTP 403 on every retry, every proxy, every subreddit. The retry logic was fine, proxy rotation was fine, the user-agent set was varied. The cause was upstream of all of it: the shared residential proxy pool had been fingerprinted and burned against www.reddit.com, and you cannot out-rotate a poisoned pool. The fix was a single hostname change to old.reddit.com, the same JSON API with the same response shape, but served by a subdomain whose bot-detection thresholds are far looser because most human traffic left it years ago. Success snapped back to 92% with zero retries.

Three lessons compound out of that one incident. First, the architectural fix (change the endpoint) beats the tactical fix (add more retries) almost every time a whole proxy pool is burned. Second, before fighting a target, check whether it exposes an old. or m. subdomain whose WAF rules differ from the main host. Third, and most important for unattended systems, the platform never sends you a deprecation notice: your failure-rate chart is the notice. The same scraper later survived Reddit removing its public JSON endpoints entirely by falling back to RSS feeds behind a circuit breaker, with the degraded fields tagged honestly in the output rather than silently passed off as complete. Monitoring is what turns a silent collapse into a five-minute alert and a planned fallback.

A quiet throughput fact when a real browser is your fetch layer: connections are pooled and capped per origin. A browser does not open a fresh TCP and TLS connection for every request. It keeps idle connections alive in a per-origin pool and reuses them, and Chromium caps concurrency at six simultaneous connections to a single origin (a limit inherited from the HTTP/1.1 era and still enforced there; HTTP/2 multiplexes many streams over one connection instead). For a scraper this has two practical consequences worth designing around rather than discovering by surprise. First, it bounds your real parallelism against one host: firing fifty page loads at a single origin through one browser context does not give you fifty concurrent fetches, it gives you six at a time with the rest queued, so genuine scale comes from spreading work across origins, contexts, or workers, not from asking one context to do more. Second, reused connections carry state, cookies, cached negotiation, and warmed TLS, which is part of what makes a coherent session look human (the session-stickiness section leans on exactly this), but it also means a connection you keep open is a connection whose earlier context travels with it, so a clean per-session boundary means letting the pool cycle rather than forcing unrelated work down a socket that is still mid-conversation. The practical rule is simply to respect the browser's own connection model instead of fighting it: parallelise across origins, keep one logical session on its own pool, and let keep-alive do the work it is good at.

Structure the scraper so an agent can fix one selector, not rewrite the project

The self-healing pattern earlier in this guide focused on regenerating a crawler and proving the regeneration is correct. There is a quieter architectural choice that decides how cheap and how safe that healing can be: how the spider is laid out in the first place. A scraper written as one long callback, where fetching, pagination, and field extraction are tangled together, forces any repair (by a human or an agent) to reason about the whole thing at once. Break the same logic into Page Objects, one small class per page type whose only job is to turn a response into fields, and a broken site becomes a broken method, not a broken project.

Why agents love this shape

A small blast radius is a fixable blast radius

When a redesign moves the price selector, the failure is localised to one Page Object's price field. An agent (or a person) can be handed just that class, its test, and the new HTML, and asked to repair that one extractor, with no risk of collaterally rewriting pagination or the request logic that still works. The Scrapy ecosystem packages this as web-poet and scrapy-poet: Page Objects plus dependency injection, so each page type is an isolated, swappable unit. The smaller the unit a fix touches, the easier it is to trust the fix, which is the same principle the promotion-gauntlet card argues from the verification side.

Tests are the trigger and the guardrail

Per-field tests turn healing into a closed loop

Pair each Page Object with tests that assert the shape of what it should return, and monitor field coverage in production (what fraction of expected fields actually came back populated). A coverage drop on one field is both the alarm that triggers healing and the acceptance test the fix has to pass before it ships. That closes the loop: detect the specific broken field, repair the one extractor that owns it, prove it green against the test, deploy. Without the structure, healing means regenerating everything and hoping; with it, healing is a targeted edit a monitor can request and a test can sign off.

The throughline with the rest of this section: agentic maintenance does not replace good engineering, it rewards it. The more modular and tested the scraper, the smaller and more trustworthy each automated repair, and the less often a broken selector turns into a rewritten project or a silent gap in your data.

From prompting to loops: the self-correcting maintenance cycle

There is a mental-model shift behind all of the self-healing material in this guide that is worth stating plainly, because it changes how you build rather than just which tool you reach for. The first instinct with an LLM is to prompt it: describe the broken scraper, paste the HTML, ask for a fix, copy the answer back. That is a one-shot transaction with a human in the middle of every cycle. The more durable pattern is to close the loop: wire generation, execution, benchmarking, and review into a cycle that runs on its own and only surfaces to a human when it cannot resolve something. "Write a loop, not a prompt" is the compressed version, and it is the difference between an assistant you operate and a system that maintains itself.

What the loop actually contains

Generate, run, benchmark, review, repeat

A working self-healing loop is more than a model that rewrites code. It generates a candidate fix, runs it against the live target, benchmarks the result against what a known-good run produced (record counts, field coverage, value sanity), and only then lets a separate reviewer pass judgement, with the failed cases feeding back in as the next iteration's input. The model that writes is not the authority on whether the write is correct; the benchmark and the reviewer are. This is the same promotion-gauntlet idea from the AI Workflow, expressed as a continuously running cycle rather than a one-time check, and it pairs with the context-blind reviewer point from that section: the agent that scores the regenerated code should not be the agent that wrote it.

Spidermon as the loop's trigger and gate

Monitoring is what makes the loop autonomous

A loop needs a signal to fire on and a bar to clear, and in the Scrapy world Spidermon supplies both. As monitoring it watches each run for the drift that should wake the loop up (item counts collapsing, a field's coverage dropping, error rates climbing); as a validation suite it is the acceptance test a regenerated scraper has to pass before its output is trusted. That is what turns "an agent can fix scrapers" into a system that notices it needs fixing, attempts the fix, proves the fix, and ships it without a human kicking off each step. The boring monitoring layer is not a footnote to the clever self-healing, it is the part that makes the self-healing autonomous at all.

Hardening the loop: what a production self-healer needs that a demo does not

The architecture above describes the happy path: detect drift, regenerate, verify, ship. A version you can leave running unattended needs four more things, each of which maps to a lesson elsewhere in this guide. The gap between a self-healer that demos well and one that survives a quarter in production is almost entirely in how it behaves when it cannot heal, when the fetch itself is the problem, and when the loop could run forever.

1 · Circuit breakers

The loop must be able to give up

An autonomous regeneration loop without a stop condition is a way to spend an unbounded amount of money on a target that has simply decided to block you. Every healing attempt needs a bounded budget: a maximum number of regeneration rounds per URL, a cost ceiling per heal (model calls plus fetches), and a wall-clock deadline. When a breaker trips, the loop does not keep trying, it freezes the last known-good extractor, marks the URL degraded in the graph, and escalates to the human gate with the diff it could not resolve. A self-healer that cannot give up is not autonomous, it is just expensive.

2 · Distinguish a broken selector from a blocked fetch

Do not regenerate code to fix a 403

The single most common way a self-healer wastes money is rewriting a perfectly good parser because the page it parsed was a block page, a challenge interstitial, or a soft 404. Before the loop is ever allowed to regenerate extraction code, it must classify why the run failed: empty fields on a 200 that is really a CAPTCHA wall is a fetch problem (escalate the rung via the Thompson bandit), not a selector problem. Wiring the status-code and soft-404 signals from the post-extraction section into the loop's entry condition stops it from healing the wrong layer. The community-detection step already separates structural change from vendor flip from drift; this adds the fourth case, not-actually-broken, which is the one that burns the most compute when missed.

3 · Verify the fetch, not just the output

A clean payload over a leaky session still fails tomorrow

The adversarial verifier checks whether the extracted data is correct. It should also check whether the way the data was fetched is sustainable, because a row that arrived today through a session that looks automated is a row that will arrive as a block next week. Two cheap signals feed back into the control plane: rung-level coherence (the TLS profile, HTTP/2 settings, and User-Agent must name the same current browser, a stale or mismatched fingerprint is treated as a soft failure even when it returned data) and request-cadence sanity (a session whose timing and cache behaviour do not resemble a real browser is flagged, the same server-side timing signal a detector keys on). Feeding fetch-health back means the bandit demotes a rung that is technically returning 200s but accumulating risk, before that risk becomes a ban-rate spike the PID loop has to chase.

4 · Assume the target is adapting to you

Traps, poisoning, and the anti-agentic page

A mature defender does not just change its layout, it engineers pages that behave differently under automation: honeypot fields, content that mutates when it detects instrumentation, and values seeded to poison a scraper that does not validate. The loop needs two defences. The KL-divergence drift signal already catches silent value poisoning before it reaches gold. Against traps that make the loop chase its own tail (a page that looks broken only to an agent, so every regeneration fails for reasons the model cannot see), the circuit breaker from point one is the backstop: if N context-blind regenerations all fail the benchmark in different ways, the correct move is not an N+1th attempt but escalation to a human who can recognise a trap the loop structurally cannot. Agentic maintenance does not end the arms race, it moves it up a level.

There is also a branch the conceptual diagram leaves out: not every target is HTML. When a URL resolves to a PDF, a scanned document, or an image, the same control plane applies but the extraction node changes shape. The rung is not curl-versus-browser, it is the document-parsing hybrid from the post-extraction section, a deterministic OCR-plus-rules pass and an LLM or vision pass run together, with their agreement as the confidence score the verifier consumes. The graph memory, the Beta-distribution confidence, the KL drift check, and the human gate all work unchanged on that branch; only the thing inside the fetch box differs.

The throughline across all four: a production self-healer spends most of its engineering not on the clever regeneration but on knowing when not to trust itself, when the failure is the fetch and not the parser, when an extraction is poisoned rather than correct, when a fingerprint is decaying, and when to stop and call a human. The math operations give it the signals; the circuit breakers and the fetch-health feedback are what keep those signals from being used to confidently automate the wrong thing at scale.

13 Cost & economics

Build vs buy:
the number that decides

The most common mistake is treating "can I bypass it" as the only question. The real production question is "what does each successful record cost, and is rolling my own cheaper than paying someone else." Here is the honest math, with the caveat that exact prices move constantly, so treat these as orders of magnitude, not quotes.

Self-hosted stealth stack vs managed API

Approach	Monthly cost driver	Rough cost	Best when
Self-hosted: HTTP + curl_cffi + residential proxies	Proxy bandwidth (the dominant cost), small server	Proxy GB at roughly 3 to 8 USD/GB, plus ~50 to 200/mo compute	High volume on targets that yield to TLS impersonation, where you control bandwidth use
Self-hosted: Camoufox/CloakBrowser cluster + residential proxies	Server CPU/RAM for browsers, plus proxy bandwidth (browsers burn far more GB)	~200 to 1,200/mo compute, plus heavy proxy GB; engineering time to maintain	Hardened targets that need a real browser, at volumes where per-request API fees would exceed infra cost
Managed API (ScraperAPI, Zyte, Bright Data Web Unlocker, Scrapfly)	Per successful request, anti-bot handling included	Roughly 1 to 5 USD per 1,000 requests (more for JS-render / hard targets)	Low-to-medium volume, or hard targets where engineering time is worth more than the per-request fee

The break-even rule of thumb: Managed APIs win until volume gets high enough that per-request fees dwarf the cost of running your own infra. A worked example: at 1 USD per 1,000 requests, 10 million requests/month is ~10,000 USD on a managed API. A self-hosted stack handling the same volume might run a few hundred in compute plus 1,000 to 3,000 in proxy bandwidth, call it 2,000 to 4,000 all-in, if you have the engineering time to build and babysit it. Below roughly 1 to 2 million requests/month, the managed API is almost always cheaper once you price in your own hours. Above 10 million, self-hosting usually wins on raw cost. In between, it depends on how hardened the target is and how much your time costs.

The cost everyone forgets: maintenance. A managed API absorbs every anti-bot change for you. A self-hosted Camoufox cluster does not, when Akamai ships a new sensor build or your fingerprint starts leaking, that is your weekend. Price your own time into the comparison. A stack that is 3,000/mo cheaper but eats four engineer-days a month is not actually cheaper. For many teams the right answer is hybrid: managed API for the hard or low-volume targets, self-hosted for the high-volume easy ones.

The bill is mostly waste, not price

Most people model scraping cost as proxy price times bandwidth and stop there. The real bill is hiding in inefficiency, and it is usually the bigger number. The mental shift that saves the most money is to stop thinking in cost per gigabyte and start thinking in cost per successful record. A pipeline with a low success rate pays for every failure twice: once in the wasted request, once in the retry.

Driver 1 · Retries and rotation

Failure compounds quietly

Unnecessary retries and a blunt rotation strategy open a steady gap between the cost you expected and the cost you actually pay. Every blocked request that triggers two retries has tripled its bandwidth for zero data. Before buying more proxy, fix the success rate, it is almost always cheaper to stop failing than to pay for the failures.

Driver 2 · Proxy-type mismatch

Mobile IPs on an easy target

Each proxy type is a different point on the speed, cost, and trust curve. Mobile IPs are expensive precisely because they are highly trusted, so they earn their price on a hardened target. Pointing them at a low-protection site burns money for no extra success. Match the proxy tier to how hard the wall actually is, do not run one universal strategy across every target.

Driver 3 · Loading the whole page

You are paying for images and fonts

When a headless browser loads a page it also pulls images, scripts, video, and fonts you never parse. If the data lives in structured JSON, use an HTTP request and skip the render entirely. Reserve the browser for pages that genuinely need JavaScript. Blocking heavy resource types on the requests you do render is one of the cheapest wins available.

Driver 4 · Third-party bleed

One GB on target, five on its trackers

A modern page calls a pile of third-party services as it loads. I have seen a single GB spent on the real target drag in several more GB of analytics, ad, and widget traffic routed through the same proxy. Block the domains you do not need, and give each website its own proxy credentials so a runaway bill shows up as one obvious outlier instead of hiding in a shared total.

The largest cost of all never appears on an invoice, because it is salary. The four drivers above are the ones you can see in a bill. The one you cannot see is the senior engineer who keeps resurrecting the same scraper because it breaks every other week. That time is real money, it is usually the biggest line, and it shows up as headcount rather than spend, so it escapes the comparison entirely. The honest unit is cost per usable document: every dollar of proxy, compute, retries, and human maintenance divided by the count of records you could actually trust and use. Model it that way and the rankings often invert. The source that looked cheapest by the gigabyte turns out to eat an engineer alive in babysitting, while the one that looked expensive runs untouched for months and quietly wins on cost per document. You cannot manage what you have not measured, and the thing most teams have never measured is the one costing them the most.

A perspective worth holding while you read the rest of this guide

This guide is largely about how to get through defences, so it is worth surfacing the strongest argument against treating that as the main event. The case goes like this: bypassing anti-bot systems is a treadmill. You ship a workaround, the target adapts, and you spend the next sprint on infrastructure that produced no business value, running hard to stay in place, and the day a site flips its detection stack overnight you are back to the start. By this view the teams genuinely winning at data collection are not the ones with the cleverest bypasses, they are the ones who abstracted that complexity away so their best engineers spend their time on what the data is for, not on how to fetch it without getting blocked. The pointed version of the question: when did "how do we bypass this?" become more important to your team than "what are we actually going to do with this data?"

It is a fair challenge and mostly correct as a default. It is also not absolute, which is the honest other half. Sometimes the bypass is the business: the data is not available any other way, no managed provider covers the target, the margin only works if you control the cost per document yourself, or the capability is the moat. The resolution is not "always build" or "always buy," it is to decide deliberately which hard problems deserve your strongest people. For most teams, on most targets, the answer is to abstract the fetching away (a managed API, a maintained library, the cheapest rung that works) and spend the saved attention on the data itself. Reserve the deep stealth engineering for the targets where access is genuinely the differentiator. The rest of this guide gives you the depth for when you need it; this note is the reminder to be honest about how often that actually is.

A warning about the one clean number: cost per thousand requests

It is tempting to reduce a target to a single figure, what an operator pays per thousand requests, carried out to a fraction of a cent. It looks authoritative and it makes comparisons easy. It is also quietly dishonest, and it is worth understanding why before you trust one. The inputs that feed that number, solver prices, proxy costs, the detection tier a site happens to be running, are opaque and swing constantly. Doing precise arithmetic on inputs like that produces a figure whose decimal places are invented: false precision dressed up as measurement.

The FAIR Institute, which argues about exactly this in cyber-risk quantification, has the cleanest framing: many "quantitative" scores are really ordinal judgements with numbers painted on, where the digits could be swapped for colours or the words high and low and nothing would change. You cannot legitimately multiply red by yellow, and running multiplication and division over solver-price guesses and proxy-cost estimates is the same move, math on a colour scale. The practical takeaway is not to stop estimating cost, it is to estimate it honestly: prefer a range over a false-precision point value, treat any per-request figure as a snapshot that decays the moment a vendor changes pricing or a target changes its detection tier, and keep the number anchored to the one thing you can actually measure, the cost per validated payload on your target, from your own runs. A wide honest range beats a narrow invented one, because a single confident decimal invites decisions the underlying data cannot support.

14 After the bypass

What happens to
50 million rows

Bypassing detection is the part everyone writes about. But getting the data is only the start. The questions that actually decide whether a scraping operation survives are about what you do next: where the data lands, how you avoid storing the same thing twice, and how you notice when a site quietly starts feeding you garbage.

Storage: stop dumping JSON into a folder

Format

Parquet, not CSV or JSON

For anything past a few hundred thousand rows, columnar Parquet beats row formats decisively: 5 to 10x smaller on disk through column compression, and analytics engines (Athena, DuckDB, Spark) only read the columns a query touches. A 50M-row crawl that is 40GB as JSON is often under 5GB as Parquet, and queries run an order of magnitude faster. Partition by crawl date or source so you can prune whole files without scanning them.

Layout

A simple medallion layout

Land raw responses untouched in a bronze layer (S3/GCS, exactly as scraped, your audit trail and replay source). Clean and normalise into silver (typed, deduplicated, schema-validated Parquet). Build query-ready aggregates in gold (the tables analysts and APIs actually hit). When a parser bug ships, you re-run silver from bronze without re-scraping, which on a hardened target can save you weeks and a lot of proxy spend.

Deduplication: the same item will arrive many times

URL-level

Seen-URL sets & Bloom filters

A large crawl revisits the same URL constantly through pagination loops, cross-links, and retries. Holding every seen URL in a Python set works until memory dies around a few million. A Bloom filter (or scrapy-redis's RFPDupeFilter backed by Redis) checks membership in constant memory with a tiny, tunable false-positive rate. For distributed crawls, a shared Redis set keeps every worker honest so two workers never fetch the same page.

Content-level

Hash the content, not the URL

The harder duplicate is the same product reachable via three different URLs (tracking params, locale prefixes, canonical vs vanity). URL dedup misses these. Compute a stable hash over the meaningful fields (a normalised product ID, or a hash of the cleaned record), and dedup on that. Keep a last_seen timestamp so you can tell a genuine update apart from a re-scrape, and upsert rather than blindly insert.

Data poisoning: when the bypass succeeds but the data is fake

Passing the bouncer does not mean the data is real. Sophisticated sites detect scraper-like behaviour after the initial check and respond not by blocking you but by quietly serving poisoned data: subtly wrong prices, shuffled listings, fabricated rows, or stale snapshots. You get clean 200 responses and a healthy-looking dataset that is silently corrupt. This is more dangerous than a block, because a block is obvious and poisoned data flows straight into your decisions.

Detection

How to catch poisoning

Maintain a small hand-verified ground-truth sample (a few dozen records you check manually) and diff every crawl against it. Watch for statistical anomalies: prices that are suspiciously round, distributions that shift overnight, or the same record varying between two near-simultaneous requests from different IPs. Cross-check a sample against the mobile API or a logged-in view. If two clean sessions disagree on the same field, suspect poisoning before you trust the data.

Cause

Why it happens to fast scrapers

Poisoning is usually triggered by behaviour, not fingerprint: requesting pages far faster than a human could, hitting endpoints in a non-human order, or ignoring the rendering layer entirely. The fix is the same boring discipline that prevents bans: pace requests, vary timing, follow realistic navigation paths, and do not request the whole catalogue in five minutes. Slow, human-shaped scraping gets real data; greedy scraping gets fed lies.

A clean 200 is not the same as a clean dataset. Modern firewalls rarely bother to hard-block you anymore. They return 200 OK with a degraded body: an empty list, sanitised prices, a tarpit that drips bytes forever, or yesterday's snapshot frozen in place. If your scraper treats the status code as the success signal, you will log thousands of green requests and ship corrupt data. Validate the shape and the statistics, not the status. The question is never "did I get a 200", it is "did I get roughly N records like yesterday, with prices in the range I expect, and the fields I depend on populated". Wire that check into the pipeline so a silent drop to ten percent fill rate pages you instead of flowing downstream.

One operational habit that pays for itself: keep a small Targets table as the single source of truth for every endpoint you depend on. Endpoint path, required headers, cookie shape, the schema version you last validated against, and the expected record count. When a site renames /api/v2/ to /api/v3/ or reshapes a field, you change one row instead of hunting through scraper code. It costs half a day to write and saves you the morning you would otherwise spend discovering, after the fact, that a fortnight of data is empty.

The status code is a two-way signal: soft 404s, crawl budget, and hallucinated URLs

The earlier point was that a 200 OK can hide failure on the way in, when a tarpit feeds you poisoned data. The same dishonesty runs the other way too, and it is worth understanding from both chairs because it shapes how crawlers (yours, Google's, and an LLM's) treat a URL.

Many JavaScript frameworks, by default, answer a request for a URL that does not exist by rendering a "not found" view and still returning 200 OK in the header. The page says the content is gone; the server says everything is fine. Google calls that mismatch a soft 404, and it is not harmless. A real 404 or 410 tells a crawler to stop coming back. A soft 404 keeps the dead URL in the queue, because the server never admits the page is gone, so the crawler keeps re-fetching nothing.

The attack your own default config enables

Crawl-budget drain as negative SEO

If a framework hands a 200 to every undefined route, an attacker does not need to touch the site at all. Point bots at thousands of made-up URLs on the domain; each one renders, returns 200, and consumes crawl resources that should have gone to real pages. On a large site the real content gets discovered slower. That is a negative technical SEO attack made possible by the server never saying "gone." The fix is boring, which is the point: return a real 404 or 410 for routes that do not resolve, and verify the rendered status in Search Console rather than how the page looks in a browser. The browser hides the header; the crawler reads it.

The newer, stranger layer

When a 200 confirms a hallucinated URL

Large language models invent URLs the same way they invent facts. One audit of model-suggested login links for major brands found roughly a third pointed at domains the brands did not own, many unregistered and waiting to be claimed. The HTTP status is the honest answer in this exchange. A proper 404 tells the machine the page never existed. When a server answers a hallucinated URL with 200 OK, it does the opposite: it confirms the fiction and tells the model its guess was right. For a scraper consuming model-suggested targets, this is a direct lesson, treat status codes as ground truth over a page that merely looks populated, because the rendered body is exactly what a soft 404 gets wrong.

When the data is not HTML: parsing documents

A large share of the data people actually need does not live in a DOM you can select. It is locked inside PDFs, scanned invoices, receipts, statements, and image-only files, where there is no clean JSON API to hit and no selector to write. This is its own extraction problem that sits downstream of fetching, and it is worth treating as a first-class part of the pipeline rather than an afterthought, because it is where a surprising amount of real-world value (and error) lives. The naive approach, run OCR and parse the raw text with regex, breaks the moment a layout shifts, a scan is skewed, or a total wraps onto two lines.

The durable pattern

Hybrid validation: two parsers, one confidence score

The pattern that holds up on messy real documents runs two extractors and cross-checks them rather than trusting either alone. A deterministic OCR-plus-rules parser (an engine like EasyOCR or Tesseract, then field rules) gives a cheap, reproducible baseline with per-token confidence. An LLM or vision-language model reads the same document and returns structured JSON against your schema, handling the messy cases rules cannot, skewed scans, odd layouts, missing labels. You then compare the two outputs and derive a confidence score from how much they agree. Where they match, you trust the value; where they diverge or both score low, you flag it. Neither half is reliable on its own, the rules are brittle and the model can hallucinate a plausible total, but the agreement between them is a far better signal than either confidence in isolation.

Why it matters downstream

Confidence routing beats silent failure

The point of scoring every extraction is what you do with the score: route low-confidence documents to human review instead of letting a wrong number flow silently into your dataset. This is the document-parsing version of the data-poisoning lesson above, a model that confidently returns grand_total: 369963 from a blurry receipt is its own kind of soft 404, output that looks clean and is wrong. A schema-consistent JSON contract, a numeric confidence on every field, and an explicit human-in-the-loop lane for the low-confidence tail are what turn document parsing from a demo into something you can trust at volume. The same discipline applies whether the input is a scraped PDF, an emailed invoice, or a photographed receipt.

15 Mobile API Scraping

Intercept mobile app traffic
before it hits any anti-bot

Mobile APIs serve the same data as the web, but with weaker protection. No Cloudflare, no JA4 fingerprinting. Intercept the traffic once, replicate the call forever.

Why mobile APIs? The same data served to a mobile app often sits behind a simpler auth layer than the web. No browser fingerprinting, just a clean JSON endpoint you can call directly from Python.

Install Android Studio + create a Virtual Device

Open Android Studio → Virtual Device Manager → Create Device. Pick any phone that shows the Play Store icon. For the system image, choose any API level above 28do not choose Android 9 (Pie / API 28), the rooting script does not support it. API 30 (Android 11) is a safe default.

💡 Any Android 10+ image works. Start the AVD and confirm it boots before proceeding.

Root the AVD using rootAVD

AVDs are not rooted by default. Root access lets HTTP Toolkit intercept SSL traffic. The rootAVD script handles everything in one command.

git clone https://github.com/newbit1/rootAVD.git
cd rootAVD

# Verify AVD is accessible
adb shell

# List your AVDs
./rootAVD.sh ListAllAVDs

# Copy the first command from the output and run it
# e.g: ./rootAVD.sh system-images/android-30/google_apis_playstore/x86_64/ramdisk.img

💡 adb not found? Add to ~/.zshrc: alias adb='/Users/$USER/Library/Android/sdk/platform-tools/adb'

Confirm root, Magisk appears in the app drawer

After rootAVD finishes the AVD reboots automatically. Once it's back up, open the app drawer and look for the Magisk app, this confirms root is working. Zygisk does not need to be enabled.

💡 If the AVD didn't reboot itself, reboot it manually. No Magisk = root failed, re-run the script.

Install HTTP Toolkit and connect via ADB

Download from httptoolkit.com or install via Homebrew. Open it → Intercept tab → "Android device via ADB". HTTP Toolkit detects your running AVD and prompts it to grant superuser rights, grant it.

# macOS
brew install --cask http-toolkit

💡 "System trust disabled" warning? Disconnect and reconnect in HTTP Toolkit, or reboot the AVD.

Install the target app and capture its requests

Sign into Google Play on the AVD and install the app, or download the APK from apk.support and drag-drop it onto the emulator. Open the app, navigate through it (lists, detail pages, search) while HTTP Toolkit runs. Switch to the View tabevery request the app makes is captured in real time.

💡 Use the filter bar, you'll see 400+ requests but only ~10 are the data endpoints. Filter by the target domain name.

Replicate the API call in Python

Click any intercepted request to see full headers, auth tokens, and query parameters. Test in Postman first to confirm it returns data, then replicate in Python. Mobile APIs return clean JSON, no HTML parsing needed.

import curl_cffi.requests as requests

resp = requests.get(
    "https://api.targetapp.com/v2/listings",
    headers={
        "Authorization": "Bearer <token_from_http_toolkit>",
        "X-App-Version": "4.2.1",
        "User-Agent": "TargetApp/4.2.1 (Android 11; SDK 30)",
        "Accept": "application/json",
    },
    impersonate="chrome120"
)
data = resp.json()

💡 Tokens expire, check if the app refreshes on login and build a token refresh step into your scraper.

✓ Works well for

Property portals, classifieds, marketplaces
Apps where the web version is heavily protected
Data only available in the mobile app
Targets using simple Bearer token auth
Any app that doesn't pin SSL certificates

✗ Limitations

Apps with SSL pinning block interception
Some apps crash on rooted devices
ARM-only apps may not run on x86 emulators
Tokens expire, need refresh logic in scraper
App updates can silently change endpoints

↑ SSL Pinning bypass

If the app blocks interception it likely uses SSL pinning. Use Frida or objection to bypass it at runtime, or use Burp Suite with the Xposed + TrustMeAlready module for a more permanent bypass.

Stack Android Studio rootAVD (github.com/newbit1/rootAVD) Magisk HTTP Toolkit apk.support Postman curl_cffi

When the signature lives in native code

Interception gets you the requests. On a hardened app it does not get you the signatures. Replay a captured call against an app like Shopee and the backend answers with an error, because every request carries anti-fraud headers built by code you cannot read in JADX. The signing method is there in name only, declared native, with its body on the far side in compiled ARM inside a .so library. This is where most people give up on the mobile route. It is also where the mobile route gets genuinely durable, because once you reproduce the signing you are no longer tied to a running app at all.

The decision that shapes everything: can you rebuild the signer, or do you have to borrow it? Read the native library and you land in one of two cases. If it is readable crypto, you reverse it and reimplement it in Python, then sign offline at any volume with no app in the loop. If it is a bytecode virtual machine you cannot practically rewrite, you keep the app running and drive its own signer as an oracle your scraper calls. Both beat the amateur route of a rooted phone you have to babysit, because a signer or an oracle runs on a server inside your scraper. The first is cleaner; the second always works. You pick based on what the library actually is, which means you have to open it first.

Four tools, one job each

The native-reversing toolchain

androguard is fast static recon. It lists the .so libraries an app ships and finds which classes declare native methods. Structure you can script, not readable source.

JADX decompiles Dalvik back to Java. It is how you read the managed side and find the exact class and method that crosses into native code. It stops at the native keyword, the handoff point.

Ghidra is the NSA's open framework. It disassembles a .so and decompiles it to pseudo C. It is the only tool here that reads native code, so the work centers on it. Run it headless so the workflow scripts cleanly and repeats exactly.

Frida injects a JS engine into the running process so you can hook and call functions live, and confirm your static reading against what the app actually does.

Two tells worth knowing before you start

Reading the library

Model the layers first. Managed Java and Kotlin builds the request and calls into native methods. A set of .so libraries does the sensitive work. Interceptors attach the anti-fraud headers. Name the layers before you open anything, so you target the one library that matters instead of all of them.

A small library is a good sign. A signing lib of a few hundred kilobytes has little room for a heavy obfuscator, and auto analysis that finishes fast with zero decompile failures tells you the binary is not packed or virtualised. C++ symbols that survived give you the function names for free.

Some values are not in the file. A fixed AES IV can live in .bss, which is zero-filled on disk and only set at runtime. Hash, HMAC, and Base64 modes do not care because their output is fully determined by input and key. The AES modes do, so you read that one value from the live process once and bake it in.

The payoff scales past the one app. Most apps that protect their API at all push the work into a native library, and a large share of those turn out to be plain, readable crypto you can rebuild byte for byte. Work it out once and it is the same move on the next app you open. Pair this with an agent that can drive the disassembler and you compress days of manual tracing into a guided loop, which is the next section's territory.

16 Plain English

Scraping jargon
in simple terms

Every term that makes scraping documentation confusing, explained with an analogy.

Network layer

TLS Fingerprint

How your browser "shakes hands" when connecting securely. Chrome and Firefox shake hands differently, so a server can tell them apart before you send a single header. Analogy: recognising someone by the way they shake hands, firm, soft, awkward.

Network layer

HTTP Fingerprint

The order and style of your HTTP headers. A bot might say "I'm Chrome" but forget to include headers Chrome always sends. Analogy: like a boarding pass, if your name and flight number don't match the expected pattern, it's suspicious.

Network layer

TCP/IP Fingerprint

Looks at how your computer sends and receives internet packets. Windows and Linux send packets with subtle differences. Analogy: recognising someone's hometown by their accent, you didn't ask, they just gave it away by how they talk.

Browser layer

Canvas Fingerprint

Website secretly asks your browser to draw a hidden picture. Each GPU renders it slightly differently, that difference is your fingerprint. Analogy: asking 10 artists to draw the same tree, each drawing is unique even with the same instructions.

Browser layer

WebGL Fingerprint

Uses 3D graphics rendering to identify your GPU and driver. Same browser, different hardware, different fingerprint. Analogy: recognising a car engine by its sound, same model, but each engine has subtle variations you can hear.

Browser layer

Device Fingerprint

Collects OS, fonts, screen size, timezone, plugins, battery level, everything about your setup combined into a unique profile. Analogy: identifying someone by their full outfit + hairstyle + voice + habits. Change one thing, the combo is still unique.

Behavioural

Behavioural Analysis

Watching how you type, scroll, move your mouse. Bots move in straight lines at constant speed. Humans are messy and inconsistent. DataDome runs 35 behavioural signals in real-time. Analogy: a security guard watching body language, not what you say, how you move.

Challenge

Dynamic Challenges

The website throws mini-tests to check if you're real, CAPTCHA, Turnstile, proof-of-work puzzles. Kasada changes them constantly so you can't pre-solve. Analogy: a teacher changing exam questions mid-test to catch cheaters.

Network

IP Reputation

Whether your IP address is associated with known bots, VPNs, datacenters, or abuse. Datacenter IPs are instantly flagged. Residential IPs from real ISPs get highest trust. Analogy: your home address appearing on a blacklist, the doorman knows before you knock.

Compliance signal

robots.txt

A text file at /robots.txt telling crawlers which paths to skip. Works on voluntary compliance only. Googlebot and GPTBot respect it. Commercial scrapers send a Chrome User-Agent and walk straight past it. Analogy: a "staff only" sign. Anyone who cares about signs obeys it. Anyone who does not care walks in anyway.

17 Community

Where scrapers
talk to each other

The best scraping techniques rarely come from documentation, they come from people who've already hit the same wall you're hitting. These communities are where the real knowledge lives.

18 Legal, ethics & reality

The part most guides
conveniently skip

Bypassing detection is only half the question. The other half is whether you should, whether it is legal where you operate, and whether your approach survives contact with reality. This section is deliberately honest about the limits of everything above.

This is not legal advice, but you need to think about these

Legal exposure

Scraping publicly available data has been broadly upheld in some jurisdictions (e.g. the US hiQ v LinkedIn line of cases), but violating a site's Terms of Service, bypassing an authentication wall, or accessing non-public data can expose you to breach-of-contract or computer-misuse claims (CFAA in the US, the Computer Misuse Act in the UK, and equivalents elsewhere). "Technically possible" and "legally safe" are different questions. Logging in and then scraping is a materially different legal posture than scraping anonymous public pages.

Privacy law

GDPR, CCPA & personal data

If your scrape collects personal data of EU/UK residents, GDPR applies regardless of where you operate, and you likely become a data controller with obligations (lawful basis, retention limits, subject rights). The same logic applies to CCPA in California and a growing list of regional laws. Scraping names, profiles, reviews, or contact details is not the same as scraping product prices. Treat personal data as radioactive unless you have a clear lawful basis.

Ethics & sustainability

robots.txt, rate limits & not being a jerk

robots.txt is not legally binding in most places, but ignoring it is a signal of intent and increasingly cited in disputes. Beyond law: hammering a small site with thousands of requests can degrade service for real users and is the fastest way to get your IP ranges burned across an entire provider. Pace your requests, cache aggressively, scrape during off-peak hours, and take only what you need. Sustainable scraping is slow scraping.

Jurisdiction

Where you are matters

The legality of an identical scrape can differ between the country you operate from, the country the target is hosted in, and the country whose residents' data you collect. A technique that is low-risk in one jurisdiction can carry real exposure in another. If you are operating commercially or at scale, this is a question for an actual lawyer in the relevant jurisdictions, not a guide on the internet.

The line is moving: agents, robots.txt, and the end of a clean binary

For twenty years the ethical and technical model rested on a tidy split: humans on one side, bots on the other, with robots.txt as the polite fence between them. That split is dissolving, and it changes how you should think about both the ethics and the defences. The scale of the shift is no longer hypothetical: in June 2026 Cloudflare reported that automated traffic had crossed the majority line for the first time in the web's history, at 57.5% of HTTP requests to HTML content against 42.5% from humans, driven mostly by agentic AI. Read the number carefully, it measures crawlable web content rather than every packet, but the direction is unambiguous: the web is now a machine-to-machine environment as much as a human one, which is the whole reason this skill set stopped being niche.

There is a second number underneath the first that matters even more for anyone who scrapes. Cloudflare's crawl-to-refer ratio measures how many pages a platform crawls for every visitor it sends back. In 2026 the AI crawlers sit at extraordinary imbalances, with Anthropic's reported in the thousands to tens of thousands of pages crawled per referral and other AI operators in the hundreds to low thousands, against a traditional search engine like Google at roughly five to one. Read these as dated, volatile snapshots, the published figure for a single operator has swung from six figures to low five figures inside a year and moves month to month, so treat any specific ratio as a reading rather than a constant. The structure is what is stable: the dominant readers of the web now consume vastly more than they return, which is exactly why so much of the web is hardening against automated reading at the same time as automated reading becomes the majority of traffic. That tension, more machines reading, more defences raised against them, is the backdrop the rest of this guide operates in.

Start with the uncomfortable truth about robots.txt: it was never a security control. It is a request, and only the bots that choose to listen ever obeyed it. That was a workable social contract when the only things crawling the web were search engines and the occasional scraper. It breaks the moment an autonomous agent with a set of tools and a goal is involved. Give an agent the instruction to gather something and a browser to do it with, and a blocked default user agent does not stop it, it problem-solves around the block, picks a tool that blends in, and keeps going, without ever being told to evade. The obedience was always voluntary, and agents do not share the assumptions that made it hold.

The binary is breaking

"Human or bot" no longer maps cleanly

When an agent browses on behalf of a real person, which is it? The platforms themselves are blurring the line: browser and automation vendors have started shipping agent modes where navigator.webdriver reports false, the same value a human-operated browser returns. The signal that reliably meant "automated" for a decade is being switched off from inside the platform, not defeated from outside. A defender can no longer treat the presence of automation as proof of bad intent, and a scraper can no longer assume that "looking automated" is what gets it blocked.

What defence now requires

Two layers, not one fence

If you sit on the defending side of this, a single fence is no longer enough. Protecting content from agentic collection now takes two layers working together: traditional anti-automation (fingerprinting, rate limiting, browser challenges) and behavioural controls that reason about intent rather than identity. Either one alone leaves a gap, and an agent with options is built to find gaps. The honest framing is that this is an adversarial pressure problem, not a classification problem with a clean answer.

Why this matters for you, on either side. The adversarial dynamic has a cost that lands on real people. When sophisticated operators mimic legitimate browsers well enough to poison a scoring system, the classifier does not just start failing on bots, it starts failing on genuine users too, the Firefox visitor handed an unsolvable puzzle, the real customer who quietly gives up and leaves. The bot did not cost that business the sale, the over-tuned defence did. Whichever side of this you build on, aim for proportionality: collect what you have a defensible reason to collect, defend in a way that does not punish the humans you are trying to serve, and treat the agentic shift as a reason to be more careful, not less.

The legislative response: transparency, not technical blocking

Because robots.txt was never enforceable, the pressure has moved from the robots.txt file to the statute book and the contract. Two 2026 developments mark the direction. In the United States, New York passed a Stealth Crawler Prohibition Act (through the state Senate and Assembly, awaiting the governor's signature at the time of writing), which targets bots that scrape news content while evading detection. It would make it an offence to damage, impair, or burden the operation of a covered news site, let aggrieved publishers subpoena a service provider to identify an alleged violator, and allow them to seek injunctions and damages. In the United Kingdom, a Private Members' Bill, the Automated Online Software (Access and Transparency) Bill, backed by the News Media Association, takes a deliberately narrower line: it does not try to regulate AI models or dictate behaviour, it requires that a bot accessing a site and taking content disclose who it is and what it will do with what it takes. Private Members' Bills rarely become law without government backing, but they shape the bills that follow.

The throughline matters more than any single bill. The thing being criminalised or regulated is not scraping, it is deception: a bot that hides its identity or its purpose, that impersonates a human or a legitimate crawler (a fake Googlebot), or that retrieves paywalled articles it was never granted. The same Fastly threat report behind the majority-machine number found that a large share of bot traffic is unwanted or unverifiable, and publishers are not waiting for legislation: some have started replacing robots.txt bans with search-only contracts in their terms of service, so they can invoice per article scraped and pursue it as a contract matter rather than fight a long copyright case. For a scraper the practical lesson is concrete and forward-looking: identifying yourself honestly, respecting an explicit licence or paywall, and being able to say what you collect and why is moving from etiquette toward a legal posture. The operators most exposed to the coming rules are precisely the stealth-crawler pattern this guide describes how to detect, not the ones who scrape openly and within terms.

The three tiers of AI access, and where the pressure actually is

The publisher-facing side of this has stratified into three layers worth knowing, because they explain who the new rules actually touch. At the top are declared "good bots" that identify themselves and pass through infrastructure gatekeepers (Cloudflare, Fastly, Akamai). In the middle are mixed-use crawlers, bots used for both ordinary search and AI training or agentic tasks, historically waved through as "search" because that was their original purpose. At the bottom, and by volume the largest, is the gray scraping economy: traffic that does not declare itself and does not play by robots.txt at all.

Two shifts in 2026 matter for anyone operating here. First, the gatekeepers are moving against the middle tier: from September 2026 Cloudflare's defaults for mixed crawlers are set to allow search but block AI-training and agent use on pages carrying ads, ending the "free pass" that let a search crawler quietly double as a training crawler. Since roughly a fifth of the web sits behind Cloudflare, even leaving those defaults untouched raises the marginal cost of crawling and puts a soft price on access. Second, and more pointed for this guide, defenders have stopped pretending robots.txt is the battleground. One practitioner who reverse-engineers scraping tools estimates that on major news brands, while 20 to 30 percent of traffic is identified crawlers, roughly a quarter of all traffic is stealth crawlers mimicking human users, traffic most publishers never even classify as bots.

The defensive playbook that follows is explicitly an economic one, and it is the mirror image of everything in this guide. Rather than tweaking robots.txt and hoping, the advice to publishers is to turn on the highest-friction defences available, client-side human-check challenges on first page load, systematic blocking of non-essential crawlers, and hard benchmarking of any vendor claiming over 90 percent detection, with the stated aim of making scraping unreliable and expensive enough that the intermediaries are forced into licensing talks. That is the strategic backdrop to the whole detection stack in this guide: the gray economy exists because access is worth more than the friction currently costs, and both sides are now deliberately trying to move that balance. Reading the room, the durable position for a scraper is the same as the legal section's: declared, within-terms access is the tier the rules are being written to protect, and the stealth tier is the one they are being written to squeeze.

Read this before you copy anything above

Anti-bot systems are probabilistic, not deterministic. Modern systems (Akamai, Cloudflare, DataDome, Kasada, HUMAN) are machine-learning systems running adaptive scoring, per-customer configurations, and continuous experiments. That means: the same fingerprint can pass today and fail tomorrow, two identical sessions can get different treatment, and a technique that worked on one customer of a vendor can fail on the next. Nothing in this guide is a guaranteed bypass. It is a description of what worked, on specific targets, at a specific time.

This advice has a shelf life. The more widely a bypass technique spreads, the faster vendors adapt to it, so the most public techniques are often the first to die. Treat dated case studies (each is timestamped) as a snapshot, not a permanent recipe. The durable value here is the detection theory and the decision process, not any single tool or fingerprint. Understanding why a layer fires lets you adapt when the specific recipe stops working. Copying a fingerprint without understanding it is cargo-cult scraping, and it ages badly.

You probably need less than this guide implies. This guide is optimised for hardened targets (Akamai v3, Kasada, F5 Shape). Most scraping jobs are not that. A huge number of sites still yield to a plain HTTP client with proper headers, session reuse, and sensible pacing, no patched browser, no mobile proxies, no JA4 spoofing. Always start at the cheapest rung of the ladder (the decision flow near the top of this guide) and only escalate when a target actually forces you to. The expensive stealth stack is a last resort, not a default. Your results will also vary enormously by target category: a strategy tuned for ecommerce can fail completely on airline pricing, ticketing, sneaker drops, or social platforms.

19 The arms race

From IP bans
to transformer ML

Every bypass technique was born as a direct response to a specific detection innovation. The escalation explains why each tool exists.

2004

Selenium born for QA testing. WebDriver protocol. Zero anti-bot thinking, scraping wasn't a threat concept yet.

2017

Puppeteer launches. First CDP Chrome automation. Anti-bots respond with IP reputation + User-Agent checks. Defeated by: proxy rotation + fake UA strings.

2018–2020

JS fingerprinting era. Canvas hash, WebGL, navigator.webdriver=true. playwright-stealth emerges. Playwright 2020, Microsoft, cross-browser. F5 acquires Shape Security for $1 billion.

2021–2022

TLS fingerprinting mainstream. JA3 at CDN edges. Python httpx identified by hash. curl_cffi emerges. DataDome + Akamai add behavioural scoring. PerimeterX + HUMAN Security mergecreating 29,650-site network.

2023–2024

JA4+ standardisation. Cloudflare Rust crate. Chrome randomises extensions, breaks JA3. Camoufox and CloakBrowser emerge as first C++ binary solutions. DataDome WASM boring_challenge. Scrapling 38K stars.

2025

ML + OS-level signals. HTTP/2 SETTINGS frames. TCP TTL. Transformer ML on micro-timing at <1ms. DataDome 85K models. Browserbase $40M at $300M valuation. Firecrawl 60K stars. ScrapingBee acquired by Oxylabs.

2026

C++ binary patches are the baseline. JS injection obsolete, toString() exposes all patches. Camoufox 100% pass rate. Firecrawl 111K stars. webclaw Rust + 10 MCP tools. JA4+ universal. Market: $7.5B → $38B by 2034. The arms race is AI vs AI.

★ You made it

Thank you for reading.

This is everything I know about web scraping in 2026, every detection layer, every anti-bot system, every library, every architecture I've actually built or used in production over the last seven years.

If even one section saved you a late night of debugging, that's why I wrote it.

Build something interesting with this. And if you do, I'd genuinely love to hear about it.

Asad Ikram

Data Engineer · Scraping specialist · Lahore, PK

LinkedIn ↗ Portfolio ↗ Email ↗

They built walls.I spent 7 years finding doors.

The scrapingdecision flow

Before you send a single byte,you've already been judged.

Layer 1, TLS Fingerprinting: The Handshake That Betrays You

Layer 2, JavaScript Fingerprinting: The Page That Interrogates You

Layer 2.6, Side Channels And State Leaks: The Signals Nobody Thinks To Patch

Layer 2.5, WebAssembly Fingerprinting: The Layer Below Your Stealth Browser

Layer 3, Network Identity: The Five Vectors That Must Agree

Layer 3.5, DOM Honeypots: The Trap Doesn't Care About Your Fingerprint

Layer 4, Behavioural ML: You Can't Fake Being Human

Layer 5, Fingerprint Replay: The Game Stopped Being About Spoofing

Six companies built the walls.Here's every key.

Identify which anti-bot you're facing

Quick identification reference

How I approached real-world bypasses

Akamai v3 in 2026: cracking it without a browser

Cloudflare case study

DataDome case study

PerimeterX (HUMAN) case study

Kasada case study

F5 Shape case study

Every tool built to fightevery wall we just described.

Scraping is no longer Python-only.

Master comparison table, all 75 libraries & tools

Browser engines, deep dive

Describe, don'tselect

From extraction to production-grade data

Making LLM extraction production-reliable: the parts that are engineering, not prompting

Serving scraped data to an agent: the naming problem replaces the scraping problem

The agentic browser stack is 8 layers, not one product

The other way to slice it: the four layers of the runtime loop

The CAPTCHA reality check: the LLM in your agent is almost never the thing solving it

When DIY costexceeds platform cost

The next CAPTCHA frontier is liveness, and it is being defeated the same week it ships

Computer Use Agents when scraping isn't enough

IP type mattersmore than provider

Rotate sessions, not IP addresses: stickiness is the strategy

Know where your IPs come from, and remember you are a guest

The risk that runs the other way: your IP is liable for the whole pool

Walk this in order.Stop at first win.

Quick reference cheat sheet

What practitioners areactually shipping in 2026

Check your ownfingerprint first

How production scrapersare actually built

The AI Workflowgraph + adversarial + five algorithms

The architecture · conceptual view

What this looks like in production

Self-Healing Scraperpowered by Claude

The lab-to-production gap: running thousands of pages unattended for days

Structure the scraper so an agent can fix one selector, not rewrite the project

From prompting to loops: the self-correcting maintenance cycle

Hardening the loop: what a production self-healer needs that a demo does not

Build vs buy:the number that decides

Self-hosted stealth stack vs managed API

The bill is mostly waste, not price

A perspective worth holding while you read the rest of this guide

A warning about the one clean number: cost per thousand requests

What happens to50 million rows

Storage: stop dumping JSON into a folder

Deduplication: the same item will arrive many times

Data poisoning: when the bypass succeeds but the data is fake

The status code is a two-way signal: soft 404s, crawl budget, and hallucinated URLs

When the data is not HTML: parsing documents

Intercept mobile app trafficbefore it hits any anti-bot

When the signature lives in native code

Scraping jargonin simple terms

Where scraperstalk to each other

Discord servers

Reddit communities

Newsletters worth reading

Free learning resources

Resources from The Web Scraping Club

YouTube channels worth following

The part most guidesconveniently skip

This is not legal advice, but you need to think about these

The line is moving: agents, robots.txt, and the end of a clean binary

The legislative response: transparency, not technical blocking

The three tiers of AI access, and where the pressure actually is

Read this before you copy anything above

From IP bansto transformer ML

They built walls.
I spent 7 years finding doors.

The scraping
decision flow

Before you send a single byte,
you've already been judged.

Six companies built the walls.
Here's every key.

Every tool built to fight
every wall we just described.

Describe, don't
select

When DIY cost
exceeds platform cost

IP type matters
more than provider

Walk this in order.
Stop at first win.

What practitioners are
actually shipping in 2026

Check your own
fingerprint first

How production scrapers
are actually built

The AI Workflow
graph + adversarial + five algorithms

Self-Healing Scraper
powered by Claude

Build vs buy:
the number that decides

What happens to
50 million rows

Intercept mobile app traffic
before it hits any anti-bot

Scraping jargon
in simple terms

Where scrapers
talk to each other

The part most guides
conveniently skip

From IP bans
to transformer ML