They built walls.
I spent 7 years finding doors.
I started scraping in 2018. Since then I have worked across five companies, built hundreds of production spiders, and fought every major anti-bot system that exists. This guide is everything that actually worked.
Paste into Claude, ChatGPT, or Cursor — full guide as LLM context
The scraping
decision flow
Walk steps in order. Stop at the first win. Complexity and cost increase right. Most production scraping is solved at steps 1–3.
_abck, cf_clearance, datadome, reese84), identifies sensor payload endpoints, and tells you which step from the flow below will actually work for this target. What used to be a 4-hour manual walk through HTTP history is now a 2-minute prompt.
Frida · mitmproxy
chompjs · Parsel
Scrapy · Scrapling
CloakBrowser
Zyte · Firecrawl
The six steps above tell you what order to try. But to know which step to stop at, and why skipping ahead costs you days, you first need to understand how the detection actually works. Let's go deeper.
Before you send a single byte,
you've already been judged.
The moment your scraper opens a TCP connection to a CDN, a fingerprinting pipeline triggers. By the time your HTTP request body arrives, four independent scoring systems have already assigned you a trust score. Here's exactly what each one measures, and why defeating just one is never enough.
Layer 1, TLS Fingerprinting: The Handshake That Betrays You
This fires before a single HTTP byte is exchanged. Understanding it is non-negotiable.
TLS Version + Cipher Suites + Extensions + Elliptic Curves + Curve FormatsThis produced a stable 32-char hex hash. Python's
requests library has always had the same JA3 hash. Every major anti-bot catalogued it. By 2021, your Python scraper was identifiable before the first HTTP header.JA3's weakness: Chrome started randomising TLS extension order in 2022. Same browser, different JA3 every session. The fingerprint became unstable and unreliable.
JA4 format:
t13d1516h2_8daaf6152771_b0da82dd1658,
t13 = TLS 1.3, d = DTLS, 1516 = cipher count+length hash, h2 = ALPN (HTTP/2), remainder = extension hashJA4+ extends this with: JA4H (HTTP header fingerprint), JA4X (X.509 certificate), JA4SSH (SSH handshake), JA4T (TCP window + options). Cloudflare deployed it in a Rust crate at CDN edge. Akamai in an EdgeWorker. Both fire before your request reaches origin.
HEADER_TABLE_SIZE, MAX_CONCURRENT_STREAMS, INITIAL_WINDOW_SIZE, MAX_FRAME_SIZE, MAX_HEADER_LIST_SIZEChrome's exact values are documented. Python's
httpx sends different values. curl sends different values. The ordering of these settings, the window update frame sizes, and the HPACK compression decisions all create a secondary fingerprint that cannot be spoofed without rewriting the HTTP/2 clientwhich is exactly what curl_cffi does.
Chrome's QUIC stack differs from libcurl's QUIC implementation differs from Python's
aioquic. Each leaves a unique signature in the Initial packets.Current status: JA4+ covers QUIC. Cloudflare has begun collecting QUIC fingerprints. Not yet widely enforced for blocking, but the infrastructure is live. Tools like curl_cffi are actively implementing QUIC parity.
# Test your actual JA4 fingerprint against tls.browserleaks.com import requests from curl_cffi import requests as cffi # ❌ requests, exposes Python/urllib3 JA4, blocked immediately r1 = requests.get("https://tls.browserleaks.com/json") print(r1.json()["ja4"]) # → t13d1516h2_8daaf6152771_b0da82dd1658 (Python fingerprint, catalogued, blocked) # ✓ curl_cffi, emits Chrome 124's exact JA4 hash, HTTP/2 frames, cipher order r2 = cffi.get( "https://tls.browserleaks.com/json"– impersonate="chrome124" # also: chrome110, chrome107, safari17 ) print(r2.json()["ja4"]) # → t13d1517h2_c4b4b4b4b4b4_aaaaaaaaaa (Chrome 124 fingerprint, passes) # Also check HTTP/2 fingerprint print(r2.json()["http2"]) # Chrome's exact SETTINGS frame values
All the JA4+ research is academic until you ship it. Three tiers of solution, in order of how often you should reach for each:
curl_cffi (Python), tls-client (Go), noble-tls, hrequests. One line of code, exact Chrome/Firefox JA4. Drop-in replacement for requests.curl_cffi.requests.get(url, impersonate="chrome131")
meta={"stealth": {"profile": "chrome_147"}}
Camoufox, rayobrowse, or CloakBrowser. C++ binary patches ship a real-browser TLS stack along with everything else.Cost: 200MB+ memory per browser instance
urllib3, you flag faster than no spoofing at all, the mismatch is the signal.2. Forgetting HTTP/2 SETTINGS frames. Even perfect JA4 fails if your HTTP/2 SETTINGS (header table size, max concurrent streams, initial window size) do not match the browser you claim to be.
curl_cffi and tls-client handle this; rolling your own usually does not.3. Using stale impersonation profiles. Chrome 120 fingerprints in 2026 are themselves suspicious, real users rolled forward. Keep
impersonate="chrome131" or newer.
Layer 2, JavaScript Fingerprinting: The Page That Interrogates You
navigator.webdriver = false for AI-agent-driven Playwright sessions, and Google patched out the most common CDP-detection technique in V8. The browser vendors that wrote the automation-transparency rules quietly stopped enforcing them. Any bypass or detection strategy pivoting on these flags should treat them as unreliable. See the Innovation Feed card "The Vendors That Wrote the Detection Rules" for details.
Once your TLS passes, the page loads its anti-bot script. This is a 500KB+ obfuscated interrogation that runs dozens of tests in parallel.
canvas.getContext('2d') then calls canvas.toDataURL(). The exact pixel output varies by:, GPU manufacturer and model (NVIDIA vs AMD vs Intel)
, Driver version and sub-pixel rendering
, OS-level font rendering (Windows ClearType vs macOS CoreText)
, Canvas size and DPI scaling
A headless Chromium with no GPU produces a software-rendered canvas with a known hash. Botaaurus and CloakBrowser spoof this at the C++ level by injecting slight noise into the pixel values before
toDataURL() returns, enough to vary the hash while remaining visually identical.
gl.getParameter(gl.RENDERER) and gl.getParameter(gl.VENDOR). Real Chrome returns something like ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0).Headless Chrome returns a generic string or crashes on WebGL entirely. Anti-bots cross-reference: if WebGL says "Intel UHD 620" but Canvas hash shows software rendering, that's a contradiction, you're flagged.
WebGL extensions list is also fingerprinted. Real GPUs expose 30–40 extensions. Software renderers expose a different subset. The exact combination is GPU-specific and stable across sessions.
AudioContextgenerates a sine wave through an OscillatorNoderuns it through a DynamicsCompressorNodeand reads the output buffer values. The floating-point output depends on:, CPU architecture (x86 vs ARM floating-point precision)
, Operating system audio stack
, Audio driver implementation
Headless environments often return
0.0 across the buffer (no audio context), or a software-emulated value that differs from hardware. CloakBrowser patches this at the Chromium C++ audio rendering layer.
When JS patches a native function, for example,
navigator.webdriverit replaces the getter with a custom function. Calling Function.prototype.toString.call(getter) on the patched function returns function () { [custom code] } instead of function () { [native code] }.Kasada specifically tests dozens of native functions this way. playwright-stealth patches them in JavaScript, so toString() reveals the patch. PatchRight fixes this at the Python source levelbefore Chrome even starts. There's no JS to inspect.
fetch('chrome-extension://[id]/manifest.json'). Real Chrome browsers have at least a few extensions installed (ad blockers, password managers, etc.).A headless browser returns
net::ERR_FAILED on all 60 requests simultaneously, a statistically impossible result for a real user. The extension IDs probed include:cjpalhdlnbpafiamejdnhcphjbkeiagm (uBlock Origin)hdokiejnpimakedhajhdlcegeplioahd (LastPass)nngceckbapebfimnlniiiahkandclblb (Bitwarden)Fix: CloakBrowser loads real extension profiles. You install 1Password or Bitwarden into it so some probes return real manifest data.
navigator.webdriverCDP-controlled browsers expose themselves through subtler signals:Timing: CDP's
Runtime.enable command leaves a timing gap between page parse and script execution that doesn't exist in real Chrome.Execution context:
window.cdc_adoQpoasnfa76pfcZLmcfl_Array and similar artifacts left by ChromeDriver are checked.Permission API: Real Chrome returns realistic permission states. ChromeDriver returns defaults inconsistent with a "normal" browser.
Plugins: Headless Chrome has zero plugins. Real Chrome always has at least the PDF viewer plugin.
Camoufox's solution: Uses Mozilla's Juggler protocol, which sits below CDP entirely, none of these artifacts exist.
Layer 2.5, WebAssembly Fingerprinting: The Layer Below Your Stealth Browser
The probe: set
hyphens: auto on a narrow container, render a known word like "hyphenation", read the rendered width or screenshot via Canvas. A real Chrome on Windows produces hy-phen-ation. A custom fork without the dictionary produces no break, or the wrong break.Affected stealth browsers: anything built from a custom Chromium source that skipped the hyphenation step, which is most of them. Real CloakBrowser and properly-built forks include it, hand-rolled patches usually don't.
Mitigation: confirm your build ships the dictionary for every language you claim to support, or run a real Chrome binary under XVFB. Verify with the live PoC: joe12387.github.io/hyphenation-dictionary-poc · github source
Why this matters: stealth browsers like Camoufox, CloakBrowser, PatchRight patch what the browser reports. WASM SIMD probes the actual CPU. A real Mac with M2 chip can't be spoofed to look like an Intel laptop because the SIMD timing fingerprint is generated by the silicon, not by the browser.
Source: Anthony Manikhouth (DataDome bot detection engineer), blog.azerpas.com, May 2026.
performance.now() at 100µs on non-isolated pages to prevent Spectre-style timing attacks. But one line of JavaScript breaks that: new WebAssembly.Memory({shared:true}).buffer returns a real SharedArrayBuffer on any page, no special headers required.
Paired with a
MessageChannel ping-pong loop in a hidden iframe driving Atomics.add(), you get a counter incrementing at 100,000 Hz, distinguishing steps around 6µs. That's 17× finer than the timer Chrome intends you to have.
Why anti-bots love this: micro-timing patterns (canvas render time, JS jitter, animation frame variance) differ between humans and bots at sub-millisecond scale. WASM shared memory makes that measurable on every page, not just cross-origin-isolated ones. Reported to Chrome as crbug 40057687, marked Won't Fix.
Source: Manuel (brokenbrowser.com).
1. Anti-bot ships a WASM module with SIMD ops and a high-resolution timer built from
WebAssembly.Memory({shared:true}).2. The module runs natively, no JS hooks to intercept, no
Function.toString() traces to leak.3. CPU microarchitecture + timing patterns are POSTed back as part of the bot scoring payload, often alongside the canvas hash.
What this defeats: Camoufox (Firefox C++ patches), CloakBrowser (49 Chromium patches), PatchRight, undetected-chromedriver, Nodriver, Pydoll. All of them patch JS APIs and binary internals, but none patches the WASM execution layer.
What still works: real hardware diversity. Different physical machines produce different SIMD fingerprints naturally. The future of stealth scraping is less about better lies and more about real hardware in real consumer locations, which is exactly what residential proxies on real ISP IPs already approximate.
indexedDB.databases(), the names come back in hash table iteration order, which is deterministic and stable for the lifetime of that process. Two unrelated sites see the same ordering and can use it to silently link a user's activity across domains — no cookies, no shared storage, no user interaction required.The fingerprint persisted across reloads, new private windows, and even Tor Browser's "New Identity" resets. Only a full browser restart cleared it. Fixed in Firefox 150 / Tor Browser 15.0.10 (April 21, 2026).
Scraper implication: If you run multiple scraping identities inside the same browser process (shared Camoufox instance, same Firefox PID), an anti-bot can correlate them using this ordering as a stable session token — regardless of proxy rotation, cookie isolation, or fingerprint patching. The signal is below every stealth layer.
Rule: Isolate scraping identities at the process level, not just the profile level. One identity = one browser process. Verify your Camoufox build is on Firefox 150+.
Ref: CVE-2026-6770 · mfsa2026-30 · SecurityAffairs writeup
Layer 3, Network Identity: The Five Vectors That Must Agree
geoip=True aligns WebRTC candidates with the proxy exit country.IP country, timezone, Accept-Language, WebRTC candidate, DNS resolver location. A US proxy with Accept-Language: ur-PK fails immediately. All five must tell a consistent geographic story. This is why setting geoip=True in Camoufox is critical, it auto-configures all five to match the proxy's exit country.Layer 3.5, DOM Honeypots: The Trap Doesn't Care About Your Fingerprint
display:none, visibility:hidden, opacity:0zero-dimension elements, off-screen positioning, fields with tabindex="-1"or links placed after the closing </body> tag.getBoundingClientRect()) before interacting.Layer 4, Behavioural ML: You Can't Fake Being Human
document.querySelector() after DOMContentLoaded looks nothing like a human who reads the page for 2.3 seconds first. Warm-up navigation (visiting homepage before target) significantly improves behavioural scores.Now you know the four detection layers. Every vendor below is just a different weighting of those same four signals, some prioritise TLS, others behaviour, others network identity. Knowing the layer tells you which tool to pick. Here are the six walls.
Six companies built the walls.
Here's every key.
Each vendor applies the detection layers differently, different weights, different signals, different architectures. What bypasses Cloudflare has zero effect on Kasada. You need to know exactly which wall you're facing before you choose a tool.
Identify which anti-bot you're facing
Wrong strategy on the wrong vendor wastes hours. Before writing a single line of code, spend 30 seconds identifying exactly what's protecting the target.
Visit the target site, click the Wappalyzer icon in your toolbar. It instantly shows all detected technologies, including the anti-bot vendor. Shows Akamai, Cloudflare, DataDome, PerimeterX, Kasada and more with a single click.
Open DevTools → Application → Cookies. Match any cookie name to identify the vendor. Multiple vendors can run on the same site. For CLI scanning at scale: wafw00f https://target.com identifies WAF + anti-bot vendor in one command.
DevTools → Network → any request → Response Headers. Look for x-datadome, server: cloudflare, x-akamai-request-idor challenge redirect URLs containing vendor names.
Free Chrome + Firefox extension. One click on any site shows:
- Anti-bot / security vendor
- CDN provider
- CMS, framework, analytics
- Server technology
curl_cffi impersonate="chrome124" handles TLS + HTTP/2 layergeoip=True, 100% pass rate Mar 2026 on Instagram, Reddit, X, LinkedInStealthyFetcher solves Turnstile natively and automaticallyboring_challenge is a Rust-compiled state machine that cannot be emulatedit requires actual browser execution to produce valid tokens. IP reputation alone accounts for 25–30% of the total trust score.__NEXT_DATA__ in HTML source, Grainger had 110KB of product data in it, bypassing DataDome entirelycurl_cffi chrome124 + residential proxy → confirmed 200 OK (Grainger.com)geoip=Truealigns all 5 identity vectors with proxy exit country_px3 token generation flow for token replayips.jsrenamed polymorphically each deployment) issues proof-of-work challenges that require real CPU cycles and browser APIs to solve. There are no CAPTCHAs, failures are silent 403s or 429s with no explanation. The critical 2026 fact: Kasada specifically fingerprints playwright-stealth by calling Function.prototype.toString() on patched native functions. The patch signatures are catalogued.robots.txt. Used by Codeberg, FFmpeg, the Linux kernel source, Sourcehut, and most non-Cloudflare FOSS projects. Recognisable by its anime "Anubis" mascot illustration during the challenge. Bypass: headless Chromium with JS enabled (it'll solve the PoW naturally, just slower), or persist the verification cookie across requests. Codeberg confirmed in mid-2025 that AI scrapers already learned to solve Anubis challenges, so it slows scraping but doesn't stop a determined operator.Quick identification reference
| What you see | Anti-bot | Key cookie/header | Detection method |
|---|---|---|---|
| "Pardon Our Interruption" page | Akamai block | _abck | Wappalyzer · response body |
| CF-Ray header · Turnstile iframe | Cloudflare challenge | cf_clearance | Response header CF-Ray |
JSON with datadome key | DataDome block | datadome | Response header x-datadome |
_px3 or _pxde set | PerimeterX block | _px3 | Cookie inspection |
| Silent 403 · no body | Kasada silent | x-kpsdk-ct | Response headers · ips.js in source |
reese84 or TS cookie | F5 Shape block | reese84 | Cookie names · Shape JS reference |
| Anime mascot "weighing your soul" page | Anubis challenge | techaro.lol-anubis-auth | JS PoW challenge · Anubis HTML title |
| 302 redirect to a virtual waiting room | Queue-It queue | Queue-it token cookie | X-Queueit-Connector header · queue-it.net redirect |
Six walls. Now the tools. Every library below exists as a direct response to one of those six systems, curl_cffi was built because JA4 broke Python's TLS. Camoufox because CDP leaks signal automation. PatchRight because Kasada fingerprints JS patches. The arms race made this arsenal.
How I approached real-world bypasses
The theory above tells you what anti-bots do. These notes tell you what I did when I hit them on a production job. Each is a full day or two of work distilled to: what I tried, why it failed, what finally worked, and the decision tree I'd use next time.
Akamai v3 in 2026: cracking it without a browser
Field notes from a production scraping job. The story of what I tried, why each thing failed, and the exact approach that finally got clean 200 responses with zero browser overhead.
_abck ~-1~ won't flipAkamai's _abck cookie has two states. ~-1~ means unvalidated, full bot score, blocked. ~0~ means validated, trust granted. The cookie is set immediately on any page load, but only flips to ~0~ after sensor.js (a 512KB obfuscated fingerprinting script) executes, collects signals, and POSTs them to /_bm/data.
Signals that matter most: canvas fingerprint (pixel-level hash of GPU-rendered shapes and text), WebGL renderer (exact GPU model via WEBGL_debug_renderer_info), AudioContext (floating-point sine wave through a compressor node), Chrome extension probes (60 chrome-extension:// URLs fetched via fetch(), zero passing = instant bot score), mouse/scroll trajectory physics, and navigator properties cross-checked against the fingerprint.
The kicker: validation is multi-request. Trust accumulates across the session, not just on the first hit.
undetected-chromedriver (uc) routed through a Comcast ISP proxy. Then switched to Pydoll — CDP automation without the usual webdriver flags. Both behaved identically. _abck set immediately as ~-1~, never flips. Waited 60 seconds, scrolled, dispatched JS mouse events. Nothing.gl.getContext('webgl') returns null. Sensor.js sees zero WebGL context and assigns maximum bot score before the session even starts.--use-angle=swiftshader --use-gl=angle. WebGL works. Canvas renders. AudioContext works. Renderer: ANGLE (Google, Vulkan 1.3.0 (SwiftShader Device (Subzero) (0x0000C0DE))).0x0000C0DE is SwiftShader's device ID, in public lists of virtual GPU IDs. Akamai checks the unmasked renderer against a blocklist. SwiftShader is on it. The canvas hash it produces is also deterministic and known.geoip=True aligns WebRTC, DNS, and timezone with the proxy exit country. Set up a session, pointed it at the target, ran a few warm-up requests.Page.addScriptToEvaluateOnNewDocument: patch WebGLRenderingContext.prototype.getParameter to return "Intel Iris OpenGL Engine". Patch navigator.platform to "MacIntel", deviceMemory to 8, battery API, chrome.runtime. Result: 2327-byte error page before sensor.js runs.Page.addScriptToEvaluateOnNewDocument. The prototype tampering itself is detectable via Function.prototype.toString().Network.setUserAgentOverride with full userAgentMetadata to spoof macOS Chrome 148. No JS injection. Same error page.navigator.userAgent returns, but not the TLS ClientHello fingerprint. Akamai's EdgeWorker sees the JA4 hash (still Linux Python automation) and blocks at the network layer before the page loads.xvfb-run -a, Chrome launches in headless=False. Pages load fully (1.1MB real HTML, all images, category navigation).glxinfo shows Mesa software rasterizer. Canvas hash from Mesa llvmpipe is different from SwiftShader but still a known server software renderer, also flagged. _abck stays ~-1~ for 60+ seconds regardless of scrolling.Page.addScriptToEvaluateOnNewDocument again. Same problem as Attempt 3.Function.toString() inspection.Every failed attempt above tried to fix the browser layer. The fundamental insight: most Akamai-protected sites never reach the deep sensor.js evaluation if the request looks like real Chrome at the network layer first.
Akamai scores in five layers:
Python's requests, httpx, even curl_cffi with a wrong impersonation profile all fail at Layer 1. The JA4 hash doesn't match Chrome 148's actual ClientHello. Fix Layers 1-3 correctly and you often never reach Layer 4.
A Go library, akamai-v3-sensor, reimplements Chrome's exact TLS stack at the C level: cipher suite order, GREASE values, extension ordering, ALPN, HTTP/2 SETTINGS frames, HTTP/3 QUIC parameters. The JA4 fingerprint it produces is indistinguishable from real Chrome 148 because it is Chrome 148's cipher suite, implemented in Go.
// One session, one proxy, one request s := sensor.NewSession("chrome-148", sensor.WithSessionProxy("http://user:pass@comcast-ip:port"), sensor.WithSessionTimeout(30*time.Second), ) resp, _ := s.Get(context.Background(), "https://target-site.com/") // Status: 200, Protocol: h2, _abck: ~0~ (validated) // Then GraphQL directly on the same session gql, _ := s.DoWithBody(ctx, req, bytes.NewReader(payload)) // Status: 200, 30KB product data, zero blocks
No browser process. No GPU. No canvas hash. No sensor.js execution. Just a TLS handshake that matches Chrome 148 exactly because it uses Chrome 148's cipher suites.
Scrapy spider
→ GoProxyMiddleware (urllib, ~35ms round trip)
→ Go HTTP server :8765 (4-session pool)
→ Go TLS library sessions
→ ISP proxy (Comcast AS7015, static residential)
→ Target site
Session rotation logic: 206 or GenericError triggers the next session in the pool. Three errors on one session triggers a background re-warm (new TLS handshake, new session state). All 4 sessions blocked returns 503; Python middleware waits 5s and retries up to 3× before falling back to curl_cffi.
toDataURL() or getParameter() in JavaScript is detectable via Function.prototype.toString(). The only real fix is at the C++ level, either a real GPU or a library that bypasses the browser entirely.
0x0000C0DE device ID is permanently flagged.
Don't bother. It's in Akamai's blocklist and the deterministic canvas hash is also known. Same for Mesa llvmpipe.
Page.addScriptToEvaluateOnNewDocument is itself a signal.
Akamai's EdgeWorker detects the timing gap left by CDP's Runtime.enable command. The injection runs, but the metadata around it is visible.
/graphql, /api/v1/, mobile traffic intercepted via HTTP Toolkit.
Every tool built to fight
every wall we just described.
Now that you understand the detection stack and the six anti-bot vendors, every library below makes sense in context. curl_cffi exists because of JA4. Camoufox exists because of CDP leaks. PatchRight exists because of Kasada's toString() inspection. The arsenal wasn't built randomly, each tool is a direct countermeasure to a specific detection innovation.
Scraping is no longer Python-only.
Python still dominates the open-source ecosystem (Scrapy, curl_cffi, Camoufox), but the hardest 10% of targets in 2026 reach for Go, TypeScript, or Rust. Here's when each language earns its place, and why mixing them in one pipeline is the production-grade move.
urllib for the protected requests only.Master comparison table, all 73 libraries & tools
| Library (click to expand) | Type | Lang | JS render | TLS spoof | TLS detail | Anti-bot target | MCP | Stars |
|---|---|---|---|---|---|---|---|---|
| curl_cffi ⚡ | HTTP | Python | Chrome JA4+ | Akamai, DataDome | – | |||
|
⚡ HTTP
Under the hood: libcurl C library with custom TLS patches. Emits exact Chrome/Safari/Firefox TLS ClientHello at the C level, cipher suites, extensions, ALPN, GREASE all match real browsers.
✓ Pros
✗ Cons
|
||||||||
| Scrapling ⚡ | HTTP | Python | Chrome TLS | Cloudflare Turnstile | 38k | |||
|
⚡ HTTP
Under the hood: Wraps curl_cffi for stealth HTTP + integrates Camoufox for browser mode. StealthyFetcher uses a real patched Firefox under the hood when needed.
✓ Pros
✗ Cons
|
||||||||
| webclaw ⚡ | HTTP | Rust | Chrome TLS | Medium targets | – | |||
|
⚡ HTTP
Under the hood: Rust HTTP client with TLS fingerprint spoofing. Emits browser TLS signatures from Rust, fast and low-memory.
✓ Pros
✗ Cons
|
||||||||
| httpx ⚡ | HTTP | Python | None | Unprotected only | – | |||
|
⚡ HTTP
Under the hood: Modern Python HTTP library with async support and HTTP/2.
✓ Pros
✗ Cons
|
||||||||
| requests ⚡ | HTTP | Python | None | Unprotected only | 52k | |||
|
⚡ HTTP
Under the hood: Pure Python HTTP library. Sends HTTP/1.1 requests with standard Python TLS.
✓ Pros
✗ Cons
|
||||||||
| tls-client ⚡ | HTTP | Go/Py | Chrome/Firefox TLS | Akamai, DataDome | – | |||
|
⚡ HTTP
Under the hood: Go/Python wrapper around a Go TLS client that mimics browser fingerprints. Predecessor to cycle-tls.
✓ Pros
✗ Cons
|
||||||||
| Playwright 🌐 | Browser | Py/JS | CDP (detectable) | Medium (CDP leaks) | 68k | |||
|
🌐 Browser
Under the hood: Chromium DevTools Protocol (CDP). Microsoft-maintained. Drives real Chromium, Firefox, or WebKit browsers over CDP socket.
✓ Pros
✗ Cons
|
||||||||
| Camoufox 🌐 | Browser | Python | C++ Firefox Juggler | Cloudflare 100%, Akamai | – | |||
|
🌐 Browser
Under the hood: Forked Firefox with C++ binary patches to Juggler protocol (below CDP). Patches navigator, canvas, WebGL, fonts, window.chrome at binary level.
✓ Pros
✗ Cons
⚠ CVE-2026-6770 — Process-Level Fingerprint Leak (patched Firefox 150)
Firefox below v150 returned IndexedDB database names in hash table iteration order rather than sorted order. Because the hash table is shared across all origins within the same browser process, the ordering became a stable, high-entropy process-lifetime fingerprint — consistent across tabs, sites, private windows, and even Tor Browser's "New Identity" resets. Anti-bot systems could use this to correlate multiple scraping identities running in the same browser process, regardless of proxy rotation or profile switching.
Fix: Use Camoufox built on Firefox 150+ (patched April 2026). Verify your version. Scraper lesson: Always isolate identities at the process level, not just the profile level. Multiple sessions in one browser process can be correlated through memory-state artifacts like this even when fingerprint patching is otherwise perfect. Ref: CVE-2026-6770 · mfsa2026-30 · Fixed Firefox 150 / Tor Browser 15.0.10 |
||||||||
| CloakBrowser 🌐 | Browser | Python | 49 C++ patches | Akamai, reCAPTCHA v3 0.9 | – | |||
|
🌐 Browser
Under the hood: 49+ C++ binary patches to Chromium itself. Patches webdriver, chrome object, plugins, permissions, WebGL, Canvas, AudioContext, and extension probe responses at the C++ level, not JavaScript. Repo: github.com/CloakHQ/CloakBrowser.
✓ Pros
✗ Cons
|
||||||||
| PatchRight 🌐 | Browser | Python | Py source patches | Kasada, Cloudflare | – | |||
|
🌐 Browser
Under the hood: Patches Playwright Python source files at install time. Removes CDP signatures, webdriver property, and stealth tells from the JS layer.
✓ Pros
✗ Cons
|
||||||||
| Puppeteer 🌐 | Browser | Node | CDP (detectable) | Medium targets | 89k | |||
|
🌐 Browser
Under the hood: Node.js CDP driver for Chromium. Google-maintained. The original headless browser automation library.
✓ Pros
✗ Cons
|
||||||||
| Selenium 🌐 | Browser | Multi | webdriver=true | Weak (legacy) | 29k | |||
|
🌐 Browser
Under the hood: WebDriver protocol (W3C standard). Drives any browser via standardised JSON protocol. The original browser automation framework.
✓ Pros
✗ Cons
|
||||||||
| SeleniumBase UC 🌐 | Browser | Python | UC removes WD flag | Kasada, general stealth | 10k | |||
|
🌐 Browser
Under the hood: SeleniumBase with undetected-chromedriver mode. Patches Chrome binary to remove webdriver flag and CDP signatures.
✓ Pros
✗ Cons
|
||||||||
| Selenium-Driverless 🌐 | Browser | Python | CDP no WebDriver | Medium targets | – | |||
|
🌐 Browser
Under the hood: Direct CDP connection without ChromeDriver binary, no webdriver flag set. Async Python API.
✓ Pros
✗ Cons
|
||||||||
| nodriver 🌐 | Browser | Python | Raw CDP async | Medium targets | – | |||
|
🌐 Browser
Under the hood: Controls Chrome via its internal DevTools socket without using CDP's standard automation flag. Chrome doesn't know it's being driven.
✓ Pros
✗ Cons
|
||||||||
| pydoll 🌐 | Browser | Python | Async CDP | Medium targets | – | |||
|
🌐 Browser
Under the hood: Pure Python browser automation using Chrome DevTools Protocol directly. No external driver.
✓ Pros
✗ Cons
|
||||||||
| Botright 🌐 | Browser | Python | CAPTCHA solving | CAPTCHA targets | – | |||
|
🌐 Browser
Under the hood: Playwright wrapper focused on CAPTCHA solving and stealth. Uses AI to solve CAPTCHAs during automation.
✓ Pros
✗ Cons
|
||||||||
| Botasaurus 🌐 | Browser | Python | Gaussian mouse | DataDome behaviour | – | |||
|
🌐 Browser
Under the hood: Playwright wrapper that adds Gaussian mouse movement, realistic typing, scroll physics, and session management.
✓ Pros
✗ Cons
|
||||||||
| rayobrowse 🌐 | Browser | Py/Docker | Real device FP DB | Hard targets | – | |||
|
🌐 Browser
Under the hood: Docker-based stealth Chromium browser from Rayobyte. C++ level patches (not JS-level), exposed via CDP so Playwright/Puppeteer/Selenium can connect natively. Self-hosted = free and unlimited; managed Cloud version available.
✓ Pros
✗ Cons
|
||||||||
| undetected-chromedriver 🌐 | Browser | Python | Removes WD flag | Medium targets | 5k | |||
|
🌐 Browser
Under the hood: Patches ChromeDriver binary to remove webdriver=true and CDP automation flags at binary level.
✓ Pros
✗ Cons
|
||||||||
| ⭐ Scrapy ⚡ | Framework | Python | Via curl_cffi mw | Medium (with middleware) | 52k | |||
|
⚡ HTTP
Under the hood: Twisted-based async Python framework. Pure HTTP, sends requests, receives responses, parses with XPath/CSS. No browser.
✓ Pros
✗ Cons
|
||||||||
| Crawlee 🌐 | Framework | Node/Py | Playwright-based | Medium targets | 15k | |||
|
🌐 Browser
Under the hood: Apify's unified Node.js framework. Wraps both HTTP (got-scraping) and Playwright/Puppeteer. Handles retries, deduplication, storage.
✓ Pros
✗ Cons
|
||||||||
| scrapy-camoufox ⚡ | Framework | Python | Camoufox integration | Hard targets | – | |||
|
⚡ HTTP
Under the hood: Scrapy middleware that routes requests through Camoufox browser for stealth. Best of Scrapy + Camoufox.
✓ Pros
✗ Cons
|
||||||||
| scrapy-nodriver ⚡ | Framework | Python | nodriver integration | Medium targets | – | |||
|
⚡ HTTP
Under the hood: Scrapy middleware using nodriver for browser requests, Chrome without CDP flags.
✓ Pros
✗ Cons
|
||||||||
| scrapy-stealth ⚡ | Framework | Python | Browser TLS + HTTP/2 | Cloudflare, Akamai | v0.4 (2026) | |||
|
⚡ HTTP
Under the hood: Pluggable Scrapy DOWNLOADER_MIDDLEWARE with three drivers:
basic + turbo (TLS fingerprint + HTTP/2 impersonation, no browser), and browser (real Chrome via CDP for JS-heavy targets). Per-request engine switching via request.meta["stealth"]. Repo: github.com/fawadss1/scrapy-stealth. Author Fawad ships frequent updates.✓ Pros
✗ Cons
|
||||||||
| Firecrawl ⚡ | AI | API | FIRE-1 engine | Hard via managed | 111k | |||
|
⚡ HTTP
Under the hood: API service that converts any URL to clean Markdown or structured JSON for LLM consumption. FIRE-1 agent for multi-page crawls.
✓ Pros
✗ Cons
|
||||||||
| Crawl4AI 🌐 | AI | Python | Playwright-based | Medium targets | 60k | |||
|
🌐 Browser
Under the hood: Local Playwright wrapper optimised for LLM output. Runs locally, converts pages to clean Markdown with BM25 relevance filtering.
✓ Pros
✗ Cons
|
||||||||
| ScrapeGraphAI ⚡ | AI | Python | NL graph pipeline | Light protection | 18k | |||
|
⚡ HTTP
Under the hood: LLM-powered extraction that builds a graph pipeline from a natural language prompt. Local or API.
✓ Pros
✗ Cons
|
||||||||
| Jina Reader API ⚡ | AI | API | Built-in rendering | Medium targets | – | |||
|
⚡ HTTP
Under the hood: REST API: prefix r.jina.ai/ to any URL to get clean Markdown back. Zero setup.
✓ Pros
✗ Cons
|
||||||||
| Steel 🌐 | AI | API | Docker browser | Medium targets | – | |||
|
🌐 Browser
Under the hood: Self-hosted browser API with MCP server. AI agents call it as a tool to browse the web.
✓ Pros
✗ Cons
|
||||||||
| Bright Data ⚡ | Managed | API | Full enterprise stack | All incl. F5 Shape | – | |||
|
⚡ HTTP
Under the hood: 72M+ IP network + scraping API. Managed infrastructure handles anti-bot, JS rendering, proxy rotation.
✓ Pros
✗ Cons
|
||||||||
| Zyte ⚡ | Managed | API | Full stack | All targets | – | |||
|
⚡ HTTP
Under the hood: Scrapy company's managed scraping platform. Zyte API + AutoExtract for structured data.
✓ Pros
✗ Cons
|
||||||||
| Apify ⚡ | Managed | API | 10K+ Actors | Medium-hard | – | |||
|
⚡ HTTP
Under the hood: 10,000+ pre-built Actors on serverless cloud. Crawlee at core. MCP server for AI agents.
✓ Pros
✗ Cons
|
||||||||
| ScrapingBee ⚡ | Managed | API | Managed rendering | Medium targets | – | |||
|
⚡ HTTP
Under the hood: Managed scraping API. Handles JS rendering, CAPTCHA, proxies via simple REST call.
✓ Pros
✗ Cons
|
||||||||
| SerpAPI ⚡ | Managed | API | SERP JSON API | Search engine data | – | |||
|
⚡ HTTP
Under the hood: Managed API that abstracts Google, Bing, Baidu, Yandex, Yahoo, DuckDuckGo and 80+ other engines behind a single REST endpoint. Returns fully parsed, normalised JSON — organic results, ads, featured snippets, knowledge graphs, local packs, shopping, images, news — without you touching a proxy or a headless browser.
✓ Pros
✗ Cons
|
||||||||
| ScrapeBadger ⚡ | Managed | API | Smart billing + AI extract | Cloudflare, DataDome, hard targets | – | |||
|
⚡ HTTP
Under the hood: Newer managed scraping API built natively for modern anti-bot stacks. Key differentiator: smart billing — if you enable JS rendering and anti-bot bypass but the target doesn't need them, ScrapeBadger auto-downgrades the request and charges you less. Also ships an MCP server for Twitter/X scraping (profiles, tweets, trends) for AI agent workflows.
✓ Pros
✗ Cons
|
||||||||
| Oxylabs ⚡ | Managed | API | OxyCopilot AI | Hard targets | – | |||
|
⚡ HTTP
Under the hood: 102M+ IP network with OxyCopilot AI extraction and scraper APIs.
✓ Pros
✗ Cons
|
||||||||
| Browserbase 🌐 | Managed | API | Managed browser | Hard targets | – | |||
|
🌐 Browser
Under the hood: Managed Playwright cloud. Run Playwright scripts remotely without managing browser infrastructure.
✓ Pros
✗ Cons
|
||||||||
| chompjs ⚡ | Parser | Python | N/A | Parser only | – | |||
|
⚡ HTTP
Under the hood: Python library to parse JavaScript objects embedded in HTML pages. Converts JS literals to Python dicts.
✓ Pros
✗ Cons
|
||||||||
| Parsel ⚡ | Parser | Python | N/A | Parser only | – | |||
|
⚡ HTTP
Under the hood: Scrapy's HTML/XML parser library. XPath and CSS selectors with a clean Python API.
✓ Pros
✗ Cons
|
||||||||
| BeautifulSoup4 ⚡ | Parser | Python | N/A | Parser only | 10k | |||
|
⚡ HTTP
Under the hood: Python HTML/XML parser. Wraps lxml or html.parser. Builds a parse tree from raw HTML strings.
✓ Pros
✗ Cons
|
||||||||
| mitmproxy ⚡ | RE Tool | Python | N/A | RE / intercept | 37k | |||
|
⚡ HTTP
Under the hood: Python-based HTTPS proxy. Intercepts, inspects, and modifies HTTP/HTTPS traffic between client and server.
✓ Pros
✗ Cons
|
||||||||
| HTTPToolkit ⚡ | RE Tool | Any | N/A | Mobile API intercept | – | |||
|
⚡ HTTP
Under the hood: HTTPS intercepting proxy for development and mobile API discovery. Open source.
✓ Pros
✗ Cons
|
||||||||
| Frida ⚡ | RE Tool | Py/JS | N/A | SSL hooks | – | |||
|
⚡ HTTP
Under the hood: Dynamic instrumentation toolkit. Injects JavaScript into running processes. Used to hook native functions and bypass SSL pinning.
✓ Pros
✗ Cons
|
||||||||
| rebrowser-patches 🌐 | Browser | Python | Chrome source patches | Medium targets | – | |||
|
🌐 Browser
Under the hood: JavaScript patches injected into Playwright/Puppeteer pages to mask automation signals.
✓ Pros
✗ Cons
|
||||||||
| cycle-tls ⚡ | HTTP | Go/JS | Chrome/Firefox TLS | Akamai, DataDome | – | |||
|
⚡ HTTP
Under the hood: Node.js/Go TLS client that cycles through browser fingerprints. Sends real JA3 hashes per request.
✓ Pros
✗ Cons
|
||||||||
| GoLogin 🌐 | Browser | Cloud | Antidetect profiles | Hard multi-account | – | |||
|
🌐 Browser
Under the hood: Cloud anti-detect browser. Manages browser profiles with unique fingerprints stored in cloud. Multi-account management.
✓ Pros
✗ Cons
|
||||||||
| Multilogin 🌐 | Browser | Cloud | Antidetect profiles | Hard multi-account | – | |||
|
🌐 Browser
Under the hood: Commercial anti-detect browser with managed profile fingerprints. Team collaboration on browser profiles.
✓ Pros
✗ Cons
|
||||||||
| ScraperAPI ⚡ | Managed | API | Full stack | All incl. Walmart | – | |||
|
⚡ HTTP
Under the hood: Simple proxy rotation + JS rendering API. Handles geo-targeting and header rotation.
✓ Pros
✗ Cons
|
||||||||
| Decodo ⚡ | Managed | API | Full stack | All targets | – | |||
|
⚡ HTTP
Under the hood: Smartproxy's new brand. Residential, datacenter, and mobile proxy network.
✓ Pros
✗ Cons
|
||||||||
| CapSolver ⚡ | CAPTCHA | API | N/A | reCAPTCHA/hCaptcha | – | |||
|
⚡ HTTP
Under the hood: AI-powered CAPTCHA solving service. Uses computer vision to solve reCAPTCHA v2/v3, hCAPTCHA, Cloudflare Turnstile.
✓ Pros
✗ Cons
|
||||||||
| 2captcha ⚡ | CAPTCHA | API | N/A | All CAPTCHA types | – | |||
|
⚡ HTTP
Under the hood: Human + AI hybrid CAPTCHA solving service. One of the oldest in the market.
✓ Pros
✗ Cons
|
||||||||
| Anti-Captcha ⚡ | CAPTCHA | API | N/A | reCAPTCHA/image | – | |||
|
⚡ HTTP
Under the hood: Human + AI CAPTCHA solving service. Competitor to 2captcha.
✓ Pros
✗ Cons
|
||||||||
| Scrapyd ⚡ | Framework | Python | Via middleware | Scrapy deploy tool | – | |||
|
⚡ HTTP
Under the hood: Daemon that deploys and runs Scrapy spiders via JSON API. Port 6800. Process-based job queue.
✓ Pros
✗ Cons
|
||||||||
| scrapy-redis ⚡ | Framework | Python | N/A | Distributed Scrapy | – | |||
|
⚡ HTTP
Under the hood: Scrapy extension connecting spiders to a Redis shared URL queue. Enables distributed crawling.
✓ Pros
✗ Cons
|
||||||||
| scrapy-cluster ⚡ | Framework | Python | N/A | Enterprise Scrapy | – | |||
|
⚡ HTTP
Under the hood: Distributed Scrapy cluster using Redis + Kafka + Zookeeper. Enterprise-scale distributed crawling.
✓ Pros
✗ Cons
|
||||||||
| scrapy-poet ⚡ | Framework | Python | N/A | Page Object pattern | – | |||
|
⚡ HTTP
Under the hood: Dependency injection framework for Scrapy spiders. Cleaner spider code with page objects.
✓ Pros
✗ Cons
|
||||||||
| Splash 🌐 | Browser | Docker | Lua scripting | Light protection | – | |||
|
🌐 Browser
Under the hood: Lua-scriptable browser for JS rendering, runs in Docker. Integrates with Scrapy via scrapy-splash.
✓ Pros
✗ Cons
|
||||||||
| selectolax ⚡ | Parser | Python | N/A | Fast HTML parser | – | |||
|
⚡ HTTP
Under the hood: C-based HTML parser (lexbor engine). 10–100× faster than BeautifulSoup for pure parsing tasks.
✓ Pros
✗ Cons
|
||||||||
| lxml ⚡ | Parser | Python | N/A | XPath + CSS parser | – | |||
|
⚡ HTTP
Under the hood: C-based XML/HTML parser. Fastest Python HTML parsing option.
✓ Pros
✗ Cons
|
||||||||
| w3lib ⚡ | Parser | Python | N/A | URL/text utils | – | |||
|
⚡ HTTP
Under the hood: Web-related utility functions. URL normalisation, encoding handling. Used internally by Scrapy.
✓ Pros
✗ Cons
|
||||||||
| SwiftShadow ⚡ | Proxy | Python | N/A | Proxy pool manager | – | |||
|
⚡ HTTP
Under the hood: Free proxy pool manager. Fetches, validates and rotates free proxies automatically.
✓ Pros
✗ Cons
|
||||||||
| requests-ip-rotator ⚡ | Proxy | Python | N/A | AWS API Gateway IPs | – | |||
|
⚡ HTTP
Under the hood: Rotates requests through AWS API Gateway endpoints to get rotating IPs.
✓ Pros
✗ Cons
|
||||||||
| Colly ⚡ | Framework | Go | Go TLS | Medium targets | 15k | |||
|
⚡ HTTP
Under the hood: Go HTTP scraping framework. Fast, concurrent, clean API.
✓ Pros
✗ Cons
|
||||||||
| Katana ⚡ | Framework | Go | Go TLS + Chromium | Medium targets | 8k | |||
|
⚡ HTTP
Under the hood: Go-based web crawler by ProjectDiscovery. Designed for security research and recon.
✓ Pros
✗ Cons
|
||||||||
| playwright-go 🌐 | Browser | Go | CDP (detectable) | Medium targets | – | |||
|
🌐 Browser
Under the hood: Go bindings for Playwright. Same Playwright API in Go.
✓ Pros
✗ Cons
|
||||||||
| Charles Proxy ⚡ | RE Tool | Any | N/A | Mobile API intercept | – | |||
|
⚡ HTTP
Under the hood: Commercial HTTPS proxy for request inspection and debugging. GUI-based.
✓ Pros
✗ Cons
|
||||||||
| Selenoid ⚡ | HTTP | Go (Docker) | Browser-as-a-service | Medium targets | 2.6k | |||
|
⚡ HTTP
Under the hood: Docker containers running headless Chrome/Firefox in parallel, Aerokube's Go-based Selenium grid replacement.
✓ Pros
✗ Cons
|
||||||||
| noble-tls ⚡ | HTTP | Python | Chrome JA3/JA4 | Cloudflare, DataDome | – | |||
|
⚡ HTTP
Under the hood: Python port of uTLS via custom TLS handshake stack, emits browser-matching ClientHello.
✓ Pros
✗ Cons
|
||||||||
| hrequests ⚡ | HTTP | Python | Browser-grade TLS | DataDome, Cloudflare | 900 | |||
|
⚡ HTTP
Under the hood: Drop-in requests replacement with TLS impersonation, header order matching, and optional Playwright browser mode.
✓ Pros
✗ Cons
|
||||||||
| crawlee-python 🌐 | Browser | Python | Via curl_cffi backend | Most targets | 6.2k | |||
|
🌐 Browser
Under the hood: Python port of Apify Crawlee, wraps curl_cffi for HTTP and Playwright for browser modes in a unified framework.
✓ Pros
✗ Cons
|
||||||||
|
🌐 Browser
Under the hood: Python port of Apify's Crawlee. Wraps curl_cffi for HTTP and Playwright for browser modes in a unified async framework with built-in retry, dedup, and storage.
✓ Pros
✗ Cons
|
||||||||
| estela ⚡ | Framework | Python (K8s) | Spider-dependent | Distributed Scrapy | 90 | |||
|
⚡ HTTP
Under the hood: Kubernetes orchestrator for Scrapy, schedules and runs spiders as K8s jobs with auto-scaling.
✓ Pros
✗ Cons
|
||||||||
| fake-useragent ⚡ | HTTP | Python | UA strings only | Lightweight only | 3.8k | |||
|
⚡ HTTP
Under the hood: Curated database of real-world User-Agent strings, sampled from browser telemetry sources.
✓ Pros
✗ Cons
|
||||||||
| grequests ⚡ | HTTP | Python | requests + gevent | Unprotected APIs | 4.4k | |||
|
⚡ HTTP
Under the hood: gevent-monkey-patched requests, fires hundreds of HTTP calls in parallel via greenlets.
✓ Pros
✗ Cons
|
||||||||
| Scrapoxy ⚡ | Framework | Node.js | Proxy manager | Self-hosted rotation | 2.1k | |||
|
⚡ HTTP
Under the hood: Self-hosted proxy pool manager, provisions proxies on AWS, Azure, GCP and rotates IPs automatically.
✓ Pros
✗ Cons
|
||||||||
Browser engines, deep dive
Runtime.enable timing, execution context leaks, and binding exposure all signal automation. Camoufox uses Mozilla's Juggler protocol below CDP, no CDP leaks. playwright-stealth patches JS at runtime but Function.toString() exposes the patch.pip install playwright && playwright installfrom camoufox.sync_api import Firefoxpip install patchrightpuppeteer-stealth plugin patches common detection points. CDP signature still visible at protocol level. Better for rendering tasks than hard anti-bot targets.navigator.webdriver=true detectable in 2 JS lines. Use SeleniumBase UC mode to remove. Stock Selenium is dead against Akamai in 2026. Still valid for non-protected targets.navigator.webdriver. Auto-solves many CAPTCHAs. Good for Kasada, medium targets. Not production-safe against Akamai at scale.from seleniumbase import Driverscrapy-nodriver integrates with Scrapy directly. Lighter than full Playwright for medium targets.pip install botasaurusfrom camoufox.sync_api import Firefox # geoip=True: auto-aligns IP, timezone, locale, WebRTC simultaneously with Firefox( geoip=True– # align all 5 identity vectors to proxy exit country humanize=True– # Gaussian mouse jitter proxy={"server": "http://proxy.provider.com:8011"– "username": "user"– "password": "pass"}, screen={"width": 1920– "height": 1080} ) as browser: page = browser.new_page() # Warm up, never go directly to target URL page.goto("https://www.google.com") page.wait_for_timeout(2000) page.goto("https://cloudflare-protected.com") page.wait_for_load_state("networkidle") print(page.content()[:500])
The tools above solve the access problem. But once you have the raw HTML or JSON, you still need to extract meaning from it. That is where AI-native scraping changes everything. In 2026 the bottleneck is not access. It is the extraction layer.
Describe, don't
select
AI-native scraping replaces CSS selectors with natural language. A 2025 NEXT-EVAL benchmark showed LLMs hit F1 > 0.95 on structured extraction when input is properly formatted.
/interact endpoint clicks, fills forms, extracts behind dynamic content. SAP, Zapier, Deloitte.app.scrape(url) | app.crawl(site) | app.search("query")result = await crawler.arun(url)SmartScraperGraph(prompt="...", source=url)pip install webclawr.jina.ai/{url} is the entire API. Returns clean Markdown. Dynamic content handled via built-in rendering. Free tier available, paid ~$0.002–$0.01/page.import asyncio from crawl4ai import AsyncWebCrawler from crawl4ai.extraction_strategy import LLMExtractionStrategy from pydantic import BaseModel # Define exactly what you want, LLM extracts it, no selectors needed class Product(BaseModel): name: str price: float model_number: str brand: str async def extract(url): strategy = LLMExtractionStrategy( provider="openai/gpt-4o-mini"– schema=Product.model_json_schema(), extraction_type="schema"– instruction="Extract all products with prices and model numbers" ) async with AsyncWebCrawler() as crawler: result = await crawler.arun(url=url– extraction_strategy=strategy) import json return json.loads(result.extracted_content) # F1 > 0.95 on well-structured pages, NEXT-EVAL benchmark 2025
What it ships:
• Stealth via
curl_cffi TLS impersonation + rotating proxies
• Auto-discover REST + GraphQL endpoints on any site
• Record a flow once, export it as a runnable Python crawler
• Smart extraction with any OpenAI-compatible LLM (free tiers + local Ollama work)
• MIT licensed
One command:
uvx crawilfy-mcp-server
Why this matters: at $0.002–0.01 per request, commercial scraping APIs compound fast on any non-trivial AI agent. Crawilfy brings the full stack in-process: TLS impersonation, proxy rotation, LLM extraction, all from within your IDE. The alternative is paying per-request at scale.
From extraction to production-grade data
Crawl4AI and Firecrawl get you semantic understanding out of an LLM. But ask an LLM for a price across 10,000 articles and you will get $40, 40 dollars, 40 USD, "forty dollars", and occasionally null. Production pipelines cannot ingest that. The fix is to separate the two concerns LLMs conflate: semantic understanding and structural guarantees.
# pip install instructor pydantic anthropic from pydantic import BaseModel, Field import instructor, anthropic class JobPosting(BaseModel): title: str company: str salary_min_usd: int | None = Field(description="Floor of salary range in USD") salary_max_usd: int | None years_experience_min: int location: str remote: bool client = instructor.from_anthropic(anthropic.Anthropic()) result = client.messages.create( model="claude-sonnet-4", response_model=JobPosting, messages=[{"role": "user", "content": scraped_html}], max_retries=3, ) # result is a validated JobPosting object, not a string # If LLM hallucinates "competitive" for salary, Instructor retriesWhy this beats raw LLM calls: normalises currencies, units, and phrasings ("just under two percent" → 1.8); rejects hallucinated dates that don't fit the schema; retries automatically on type errors; gives you a real Python object downstream.
Use classical NLP when: scraping millions of consistent documents (e-commerce, classifieds), schema is fixed, domain doesn't shift, latency budget is <5ms per doc.
Use LLM + Instructor when: messy heterogeneous sources (news, newsletters, job boards), context disambiguation matters ("Apple" the company vs the fruit), schema may evolve, semantic equivalences need resolving ("FTE" = "full-time" = "permanent" = "direct hire").
Hybrid in production: classical NLP pre-filters and tags. LLM resolves only the ambiguous cases. This is what Bloomberg, Reuters Refinitiv, and FactSet actually do, not pure LLM pipelines.
Sources and further reading: Federico Trotta, The Web Scraping Club, May 2026; Instructor library.
date | None field with Instructor's retry logic catches this, the LLM has to either find a real date or return None. Without schema validation, fabricated dates pass into your database as facts.
The agentic browser stack is 8 layers, not one product
"Browser agent" sounds like a single tool. It's a stack. Most teams building AI agents account for one or two layers; the reliable ones map all eight. When an agent fails a task, the cause is usually not the framework everyone debates, it's one of the other seven layers nobody mapped. The proxy layer sits at the bottom and every layer above it still has to reach the live site.
When DIY cost exceeds platform cost, these services handle the heavy lifting. Each solves a specific problem, choosing the right one depends on which wall you are facing and at what scale.
When DIY cost
exceeds platform cost
If spending more than 2 engineer-days/month on anti-bot maintenance, a managed platform is cheaper. Crossover typically hits when facing F5 Shape or Kasada at scale.
The best option if you don't want to build scrapers yourself. Apify is a cloud platform where scraping is already done for you, 10,000+ community-built Actors cover almost every major website: Amazon, LinkedIn, Instagram, Google Maps, TikTok, Zillow, Twitter/X, Google Search, and thousands more. You pick an Actor, give it a URL, and get back clean JSON. No Python, no proxies, no infrastructure.
- You need data from a well-known site quickly
- You don't want to maintain scrapers long-term
- You're building an AI agent that needs live web data
- You want someone else to handle anti-bot bypasses
- You need to scale without managing infrastructure
- Your target site has no existing Actor
- You need custom data transformation logic
- You're scraping at very high volume (cost)
- You need full control over request patterns
- Data stays internal and can't touch third-party cloud
act(), extract(), observe(), agent(). Write browser flows in plain English ("click submit button") that survive page redesigns via runtime LLM resolution. Built on CDP, supports OpenAI/Anthropic/Gemini. 65% Mind2Web benchmark. Self-healing + auto-caching. TypeScript and Python.Computer Use Agents when scraping isn't enough
A new category emerged in 2025: AI agents that don't just scrape, they log in as the user, navigate any UI (web apps, legacy portals, desktop software), handle MFA and CAPTCHAs, and return structured JSON. Different from scrapers because the user grants permission, "Plaid for any website." If your problem is utility bills, payroll exports, e-commerce backends, or any portal without a public API, this is the category.
Platforms sort out the browser and the fingerprint. But every request still needs an IP address, and the type of IP matters as much as any other signal in your stack.
IP type matters
more than provider
Rotating proxies is table stakes. The real variable is IP type, datacenter IPs score near-zero on DataDome and PerimeterX regardless of fingerprint quality.
github.com/fabienvauchelles/scrapoxygeoip=True in Camoufox to align all five vectors automatically.http:// and https:// keys must use http:// scheme. Using https:// causes BoringSSL WRONG_VERSION_NUMBER (TLS-over-TLS failure). Fix: "https": "http://key:@proxy.crawlera.com:8011/"from curl_cffi import requests import time– random session = requests.Session(impersonate="chrome124") # Crawlera/Zyte: BOTH keys use http://, never https:// PROXIES = { "http": "http://apikey:@proxy.crawlera.com:8011"– "https": "http://apikey:@proxy.crawlera.com:8011"– # http:// not https:// } def fetch(url– retries=3): for i in range(retries): try: r = session.get(url– proxies=PROXIES– timeout=30– verify=False) # verify=False: proxy cert if r.status_code == 200: return r if r.status_code in (403–429): time.sleep(2**i + random.uniform(0–1)) except Exception as e: print(f"Error: {e}") return None
You now have the full picture: detection layers, six anti-bots, sixty libraries, managed platforms, proxy types. This section collapses all of it into a single decision tree you can follow for any target site.
Walk this in order.
Stop at first win.
Each step adds complexity, cost, and maintenance. Most production scraping is solved at steps 1–3. Never start at step 5.
SSL_read/SSL_write directly. If you find the API endpoint, every HTML anti-bot becomes irrelevant.__NEXT_DATA__. React SPAs often have >50KB script containing all data. Confirmed: Grainger.com (DataDome-protected), 110KB JS state blob bypasses DataDome entirely because it's in initial HTML.curl_cffi with JA4 impersonation resolves most Akamai and DataDome at HTTP layer. Add residential proxy. If __NEXT_DATA__ appears in response, extract it with chompjs.Quick reference cheat sheet
| Anti-bot | Primary vector | Steps 1–2 viable? | Best tool | Key note |
|---|---|---|---|---|
| Akamai | JA4+ + sensor.js + extension probes | Often | curl_cffi + CloakBrowser | Find mobile/GraphQL first |
| Cloudflare | JA4 Rust edge + Turnstile | Sometimes | Camoufox | Origin IP via SecurityTrails |
| DataDome | 85K ML + WASM boring_challenge | Yes | curl_cffi + mobile IP | Check __NEXT_DATA__ first |
| PerimeterX | 5-vector score | Sometimes | Camoufox + residential | Fresh session per domain |
| Kasada | Polymorphic JS PoW | Rarely | PatchRight + residential | Never playwright-stealth |
| F5 Shape | Custom VM + minute expiry | No | Managed API | DIY not practical |
What practitioners are
actually shipping in 2026
Fresh insights from engineers actively solving these problems in production. Shared publicly on LinkedIn.
iamNotaRobot.js, abuse-component.js, aps.js — it is PerimeterX rebranded and served from their own domain. Trust the code, not the label. (3) First-party cloaking: Home Depot serves PX-shaped scripts under random first-party filenames. You cannot identify defenders by checking hostnames anymore, you have to watch how the script behaves at runtime. (4) Lazy-loaded defenses: Ticketmaster ships a reCAPTCHA site key in the homepage JSON but the SDK only loads on login or checkout. Probing only the homepage misses everything. Multi-hop traversal (homepage → login → cart) is the minimum recon bar now. (5) The fingerprint dictates the budget: PerimeterX + behavioral biometrics + reCAPTCHA Enterprise on one page tells you exactly what tier of browser, what kind of proxy, and how slow your automation has to be. Recon is upstream of every other decision. The takeaway: stop asking "which vendor does this site use" and start asking "which stack does this site compose, and where on each layer do I look for cracks."navigator.webdriver = false when an AI agent drives Playwright. Google patched out CDP detection in V8. Neither change was announced. The signals every anti-bot tool relied on to flag automation just became officially unreliable, because AI agents browsing on behalf of real users broke the human-vs-automation binary.navigator.webdriver = true. That was the easiest detection signal in the industry, and most public anti-bot stacks were built around it (along with CDP-detection tricks for Chrome's DevTools Protocol). Two undocumented changes in 2026 have just made those signals soft: (1) Microsoft Edge returns navigator.webdriver = false when Playwright is driven by an AI agent on behalf of a user; (2) Google patched out the most common CDP-detection technique in V8. No release notes, no announcements. The reason is rational from the browser vendors' side: agentic browsing is now a legitimate use case (Anthropic Computer Use, OpenAI Operator, Browser Use, etc.) and the old binary doesn't apply. The implication for the guide and for production scrapers: any detection or bypass strategy that pivots on these flags needs to assume they are no longer reliable as either signal or counter-signal. Detection has to move up the stack, to behavioural ML, intent patterns, and network-identity layers (TLS JA4, IP reputation, WebRTC, DNS coherence), all of which are far harder to remove from the inside. DataDome's threat-research team published a longer breakdown if you want the technical specifics.navigator.storage and timing the flush. Incognito routes storage to RAM, normal mode hits disk. RAM is faster. That is the entire vulnerability.navigator.storage and measures the flush time. Under 0.1ms indicates RAM (incognito), above indicates disk (normal mode). No permission prompts, no API quirks the user can disable, runs in standard JavaScript on any page. Known caveats: RAM disks (used by some privacy-conscious users) trigger false positives unless the threshold is tuned (~0.01ms separates RAM-disk from incognito on test hardware); slow HDDs do not produce false positives because the technique detects suspiciously fast writes, not slow ones. For anti-bot: detecting incognito is a useful behavioural signal (legitimate buyers rarely shop in incognito; scrapers and abuse traffic over-index on it). For scraping: if your stealth stack uses incognito or per-session ephemeral storage to keep contexts clean, you are leaking a signal that is now trivially detectable. The fix is the same as for the broader timing-attack class: use persistent profile directories that hit real disk, accept the cookie/storage management overhead.playwright.launch() spinning up a local browser per script, run a persistent Playwright server on a dedicated box exposing a WebSocket endpoint, and have every scraping script connect("ws://host:3000") as a client. The page can't tell the difference, the API is identical. Five hard-won lessons from the build: (1) Binary choice matters as much as library choice. JS overrides of navigator.webdriver are themselves detectable (wrong property descriptor, wrong prototype chain); source-patched binaries like CloakBrowser remove the signal instead of masking it. (2) Headed via Xvfb beats headless, the virtual framebuffer means nothing looks headless because it isn't. (3) The two-slot trap: Playwright keeps TWO Chromium directories, a full build and a stripped chrome-headless-shell. Replace only the full slot with your patched binary and Playwright silently launches the untouched headless shell instead, you get 403s and the wrong version string with nothing in the logs explaining why. You must replace both slots and rename the headless one to chrome-headless-shell. (4) supervisord inside Docker manages the multi-process reality (Xvfb priority 10, Playwright server priority 20 with startsecs delay). (5) Concurrency = contexts, not instances. One browser, a pool of isolated contexts (separate cookies/storage), workers pull from an async queue, a 403 requeues with backoff and the worker grabs the next job. Proxy creds go per-context so a bad IP just retries with a fresh one. 16 concurrent contexts ran fine on a Ryzen 4650G mini-desktop. Full writeup · github.com/jhnwr/browser-service · YouTube.so library (common for the sensitive bits), Ghidra decompiles the C/C++ to understand the algorithm. (3) Frida hooks those functions at runtime via injected JavaScript, so you can log the inputs/outputs, bypass certificate pinning, or call the signing function directly to mint valid headers, no need to fully reverse the algorithm if you can just invoke it. (4) Run all of this on a rooted Android emulator from Android Studio for a controlled, disposable lab. Pair with HTTPToolkit or mitmproxy to capture the now-decrypted traffic and recover the API contract. MobileHackingLab offers a free Android Frida course with a certificate and CTF-style challenges, the fastest way in if the toolchain feels intimidating. The payoff: once you can mint the app's signed headers, you call its clean JSON API directly and skip every browser-layer anti-bot entirely. MobileHackingLab free Frida coursehyphens: auto CSS and measure rendered output.hyphens: auto is set and text overflows a container, the browser inserts soft hyphens at language-specific break points (so "hyphenation" becomes "hy-phen-ation"). The dictionary that drives this is OS-level on Android and macOS, but Chromium on Windows and Linux must bundle it at build time. Most people forking Chromium don't know this — the build artifact is large and the feature is invisible until you specifically test it. Joe (joe12387) demonstrated this is a detection vector: anti-bot scripts can render a known word in a known-width container with hyphens: auto, screenshot via Canvas, and compare the hyphenation positions against expected values for the claimed OS. A custom Chromium fork that fails to hyphenate at all (or hyphenates wrong) reveals itself instantly. Mitigation: ensure your build includes the hyphenation dictionary for the languages you claim to support, or run real Chromium binaries (not forks) under XVFB instead. Live PoC · github.com/joe12387_abck, datadome, cf_clearance, reese84), identify sensor.js challenge endpoints, figure out which requests trigger re-validation. For a moderately complex target like nike.com, this takes hours per session. With Burp's MCP server pointed at Claude Code, you capture the same session and prompt: "trace the _abck cookie lifecycle from home page through add-to-cart, identify all sensor payload endpoints, and explain the validation flow." Claude reads Burp's full history directly and produces the analysis in seconds. The pattern scales: build a reusable burp-antibot-recon Skill once, replay it across Akamai/DataDome/Cloudflare targets. If you work in this space and haven't wired it up, this is the unlock. github.com/PortSwigger/mcp-serverrequests or curl_cffi, the challenge is unsolvable without JS execution. The bypass is mundane: any headless browser (Playwright, Camoufox, Patchright) with JS enabled will solve it automatically. Persist the auth cookie (techaro.lol-anubis-auth) and reuse it across requests. The political angle: Anubis exists because AI scrapers (OpenAI, Anthropic, Common Crawl, ByteDance) were DDoSing small FOSS projects by ignoring robots.txt. It's a community response, not a commercial product. github.com/TecharoHQ/anubisrequests library sends a different cipher suite order than Chrome. httpx is different again. Even with a clean residential IP, if your cipher ordering does not match Chrome's, you are identified before the server processes a single header. Fix: use curl_cffi with impersonate="chrome124"it emits Chrome's exact TLS ClientHello. Also watch HTTP/2 SETTINGS frames, they contain window sizes and header table parameters that vary per client.geoip=True and it automatically aligns all five vectors. Do not simply disable WebRTC, it removes a feature that 99% of real users have, which itself becomes a bot signal.camoufox or rayobrowse to generate sessions, then curl_cffi with the extracted cookies for bulk collection. Rotate sessions every 30-50 requests.blocked_domains list to block tracking/CDN requests in headless mode, automatic proxy-aware retry on network errors, Response.follow() for easy link chaining. Install: pip install scrapling --upgrade.QuickProxy(countries=["FR","DE"]) API filters by exit country. The built-in cache means it does not hit proxy list APIs on every request. Usage: from swiftshadow import QuickProxy; proxy = QuickProxy(); session.proxies = {"http": str(proxy), "https": str(proxy)}. Important: free proxies have high failure rates and low anonymity, do not use for Akamai, DataDome, or PerimeterX targets. Best for scraping open/unprotected sites at scale without cost.pip install cocoindexconfigure sources (files, URLs, S3), define your chunker and embedding model, run cocoindex.build()done in under 10 minutes.curl_cffi for TLS, full Chrome headers via httpx or curl_cffi, random.uniform(1.8, 4.3) delays, requests.Session() for cookie accumulation, residential/mobile proxies for IP. Check your current fingerprint at tls.browserleaks.com/json.Detection risk by stack (lowest is best):
| Stack | Detection Risk | Limitation |
|---|---|---|
requests / httpx |
Very High | No browser rendering |
Scrapy |
Very High | No behavioural realism |
| Headless Browser | High | Headless traces (WebGL=null, missing extensions) |
| Stealth Headless | Medium | Partial spoofing, JS patches detectable |
| XVFB + Headed Browser | Lowest | Higher data consumption |
✓ XVFB virtual display (real X11 server, not headless flag)
✓ Fully headed Chrome (no
--headless anywhere)
✓ Nodriver for CDP without webdriver artefacts (or Camoufox for Firefox)
✓ Authentic TLS / HTTP-2 behaviour (the browser handles this for free)
✓ Humanised interactions (Bezier-curve mouse, variable scroll timing)
✓ Residential proxies, sticky session for trust accumulation
✓ Fingerprint coherence (UA + WebRTC + DNS + timezone all match exit IP)
Why XVFB beats
--headless even with stealth patches: headless Chrome reports HeadlessChrome in the user agent (fixable), missing extensions (probe-able), and zero GPU context (the real killer). With XVFB you get a real display, Chrome runs in headed mode, extensions load normally, and the GPU stack is whatever your server provides. JS patches still leave Function.prototype.toString() traces; XVFB does not.
The serverless angle: the conventional wisdom is that serverless cannot run a real browser. The trick is provisioning an X11 socket inside the container (
Xvfb :99 &, DISPLAY=:99 chrome ...) so Chrome runs headed on a virtual display. Lambda has hit memory limits historically, but ECS Fargate, Cloud Run, and Modal handle this comfortably with ~1GB memory per browser instance. The result: serverless infrastructure behaving like real users, not automation.
What still beats XVFB: C++ patched browsers like Camoufox (canvas, WebGL, audio at the binary level) and CloakBrowser (real extension probe profiles) close the remaining 10%. But for the 80-90% of targets where XVFB + Nodriver gets you in, the cost difference is significant. Camoufox: 200MB+ per instance. XVFB headed Chrome: same memory but works on any Chromium binary.
Modern anti-bot systems are trained to detect machines pretending to be browsers. The path forward is not better lies, it is fewer lies.
performance.now(). Both run below the JS hooks Camoufox, CloakBrowser, and PatchRight patch.The enabling primitive arrived in 2024 from Manuel at brokenbrowser.com: a one-liner that gets you a real
SharedArrayBuffer on any page, no special headers, by calling new WebAssembly.Memory({shared:true}).buffer. Drive a MessageChannel ping-pong with Atomics.add() inside it and you have a counter ticking at 100,000 Hz, micro-timing precision around 6 microseconds. Chrome marked it Won't Fix.
What this defeats:
× Camoufox (Firefox C++ patches at the browser layer)
× CloakBrowser (49 Chromium binary patches)
× PatchRight, undetected-chromedriver, Nodriver, Pydoll
× Every JS prototype patch (Function.toString detection is irrelevant when nothing JS is touched)
What still works: real hardware diversity. The future of stealth scraping is real consumer machines on real ISP IPs, which is essentially what high-quality residential proxy networks like Massive, Bright Data, and Oxylabs already provide. As detection moves into the CPU layer, the value of actually being real compounds.
This is also why akamai-v3-sensor works on Akamai v3: it never executes the WASM at all because it never reaches sensor.js. By bypassing at the TLS layer, you skip every detection layer above it.
Sources: Anthony Manikhouth (DataDome) and Manuel (brokenbrowser.com).
Why this matters for scrapers: if you are rotating cookies between requests to look like a fresh visitor, but the target site is reading your localStorage entry from the previous session, your rotation does nothing. Anti-bot vendors like Forter and Riskified have shipped variants of this for years. Cloudflare's
cf_clearance cookie now has localStorage backup in some configurations.
Storage layers a real reset has to clear:
✓ Cookies (HTTP and JS)
✓ localStorage and sessionStorage
✓ IndexedDB (every database)
✓ Service Worker registrations and Cache API entries
✓ FileSystem API (legacy but still works)
✓ Web SQL (deprecated but persists on older Chromium)
✓ ETag / If-Modified-Since headers cached at HTTP layer
✓ HSTS pin database (yes, browsing data can be encoded in HSTS pins, this is real)
Practical implication for scrapers: when you rotate sessions, do not just clear cookies. Either spin up a fresh browser profile each session (Playwright
context.close() + new context, or a fresh Camoufox BrowserContext), or run in an entirely isolated container. Half-measures leak state.
For the curious: the original Evercookie by Samy Kamkar in 2010 used 13 storage mechanisms. Modern browsers have removed several (Flash LSO, Silverlight, Java applets), but added more (Service Workers, BroadcastChannel, OPFS). The trick is alive and well, just modernised.
What was actually happening: the scraping continued. The traffic that used to say
"I am GPTBot" was now saying "I am Chrome 124 on macOS." Same content destinations, same fetch patterns, different label.
User-Agent is a string the client chooses to send. Polite scrapers send a real one. The scrapers you are actually worried about — the ones running at commercial scale on behalf of paying customers — send whatever string gets through.
robots.txt works on:
✓ Academic crawlers (Googlebot, Bingbot, academic research bots)
✓ Large AI labs (OpenAI, Anthropic, Google) that have reputational incentives to comply
✓ Hobbyist scrapers who read the rules and care
robots.txt does not work on:
× Commercial data brokers sending Chrome User-Agents
× Competitive intelligence tools running at scale behind residential proxies
× AI startups that have not publicly announced themselves
× Anyone whose business depends on data you do not want them to have
The implication for anti-bot systems: blocking by User-Agent is the weakest possible signal. Cloudflare, Akamai, and DataDome do not read robots.txt. They score TLS fingerprints, canvas hashes, behavioural timing, and IP reputation because those signals are harder to fake. User-Agent string matching is not a detection layer. It is a flag for voluntary compliance.
For scrapers reading this: if a target blocks GPTBot in robots.txt but has no real anti-bot scoring, the robots.txt is the only gate. Respect it. If they have Akamai or Cloudflare deployed, the robots.txt is decorative. The actual gate is the JA4 hash, the canvas probe, the IP reputation check. That is where this guide comes in.
Via a publisher conversation, May 2026.
The 2026 framing: web scraping is distributed adversarial systems engineering.
What modern infrastructure has to operate against:
• Fingerprinting systems (TLS JA4, canvas, WebGL, WASM SIMD)
• Behavioural detection (mouse physics, scroll timing, inter-request jitter)
• Anti-bot orchestration (Akamai EdgeWorker, Cloudflare Worker, DataDome middleware)
• Cloudflare interstitials and Turnstile challenges
• Dynamic runtime rendering (SPAs, hydration, lazy loading, service workers)
• Session-aware defences (trust accumulation, per-session scoring, unclearable cookies)
This is what makes tools like Scrapling architecturally interesting beyond "another Python scraper." It combines stealth browser execution, TLS fingerprint impersonation, adaptive element tracking that survives DOM changes, session-aware orchestration, proxy rotation, and MCP-based AI extraction. Not a scraper. A runtime.
The systems properties that matter now:
• Runtime orchestration (not just retries, state-aware crawl management)
• Observability (what failed, at which layer, on which request)
• Adaptive recovery (selector drift, DOM changes, anti-bot updates)
• State persistence (session trust, cookie chains, cross-request identity)
• Stealth execution (not a flag, an architecture)
• Infrastructure resilience (circuit breakers, session warmup, fallback tiers)
Once AI agents start interacting with the web autonomously at scale, reliability becomes a systems problem first and a parsing problem second. The pipeline that Firecrawl, Crawl4AI, Stagehand, and Scrapling are converging on is not "scraper plus LLM." It is a resilient extraction runtime with LLM as one processing layer among many.
Framing via D4Vinci (Scrapling author), May 2026.
What they found:
12.5% of analysed sites deployed fingerprinting-related scripts consistent with harvesting. A subset replicated vendor-specific telemetry from PerimeterX, Incapsula, Akamai, Adyen, and hCaptcha, not to defend themselves, but to collect and replay the same signals against those vendors.
The mechanics:
Services like
impersonate[.]pro openly advertise "comprehensive TLS, HTTP/2, HTTP/3, and JavaScript fingerprint collection." In Discord and Telegram communities, bot developers discuss embedding custom JavaScript on real websites specifically to harvest fingerprints from genuine visitors. The goal: build inventories of authentic device profiles that can be injected into automated sessions.
The PerfectCanvas mechanism from Bablosoft is the clearest example. Their documentation describes exactly the pattern:
• Render canvas on a real remote machine with a real GPU
• Send the canvas output to the automation server
• Inject it into the headless browser's response to the canvas probe
This is the harvesting-and-replay model made explicit. Instead of spoofing canvas values (detectable via inconsistency), you replay values from a real Mac. The fingerprint is genuine. It just came from a different device.
Genesis Marketplace established the precedent: ~323,000 compromised browser environments for sale, each bundled with a real device fingerprint and a custom Chromium extension that injected the victim's browser profile into attacker sessions. F5 Labs and Europol both documented this. Castle's report shows the same approach is now commercialised at scale for bot traffic, not just account takeover.
What this means for scrapers:
The arms race has a new dimension. Anti-bots are scoring fingerprints. Bot services are buying real fingerprints to replay. Defenders are now building for replay conditions, not just spoofing conditions. This is why:
• Canvas/WebGL probes are increasingly paired with behavioural and timing signals (harder to replay than static values)
• WASM SIMD CPU probes (above) are valuable precisely because they are harder to harvest and replay than JS-layer fingerprints
• Anti-bots are introducing controlled variability in their own client-side scripts so that even valid payloads can't be reverse-engineered and replayed reliably
The implication for this guide: when a stealth browser passes the canvas probe, it may not be because it spoofed the hash well. It may be because it replayed a real hash that was never flagged. The distinction matters because vendors will move toward replay-resistant probes, making the harvest-and-replay model progressively harder. WASM SIMD (which requires real hardware timing) is an early example of a replay-resistant signal.
Source: Fingerprint Harvesting in the Bot Ecosystem, Castle Research, Antoine Vastel, April 2026.
Check your own
fingerprint first
Before you bypass anything, you need to know what your setup is leaking. These tools show exactly what anti-bots see when your scraper connects. Run your scraper through them, not just your browser.
How production scrapers
are actually built
From a single Scrapyd daemon to multi-region ECS clusters. Eleven real pipeline architectures, from simple to enterprise-scale, with every component and data flow mapped out.
The simplest production setup. One server, Scrapyd managing spiders via JSON API, ScrapydWeb as UI. Good for <50 spiders and teams without Kubernetes. Deploy with scrapyd-deployschedule via /schedule.jsonmonitor at port 6800.
Self-Healing Scraper
powered by Claude
Scrapy spiders break when sites change their HTML. Instead of manually fixing selectors, this architecture uses Claude to detect failures, analyse the new page structure, and write corrected selectors automatically, without human intervention.
You are a web scraping expert. A Scrapy spider broke because the site changed its HTML.
Old selectors (no longer working):
title: h1.product-title::text
price: span.price-now::text
image: img.main-image::attr(src)
New page HTML (truncated):
{{ page_html[:8000] }}
Return ONLY valid JSON with corrected selectors:
{"title": "...", "price": "...", "image": "..."}
Build vs buy:
the number that decides
The most common mistake is treating "can I bypass it" as the only question. The real production question is "what does each successful record cost, and is rolling my own cheaper than paying someone else." Here is the honest math, with the caveat that exact prices move constantly, so treat these as orders of magnitude, not quotes.
Self-hosted stealth stack vs managed API
| Approach | Monthly cost driver | Rough cost | Best when |
|---|---|---|---|
| Self-hosted: HTTP + curl_cffi + residential proxies | Proxy bandwidth (the dominant cost), small server | Proxy GB at roughly 3 to 8 USD/GB, plus ~50 to 200/mo compute | High volume on targets that yield to TLS impersonation, where you control bandwidth use |
| Self-hosted: Camoufox/CloakBrowser cluster + residential proxies | Server CPU/RAM for browsers, plus proxy bandwidth (browsers burn far more GB) | ~200 to 1,200/mo compute, plus heavy proxy GB; engineering time to maintain | Hardened targets that need a real browser, at volumes where per-request API fees would exceed infra cost |
| Managed API (ScraperAPI, Zyte, Bright Data Web Unlocker, Scrapfly) | Per successful request, anti-bot handling included | Roughly 1 to 5 USD per 1,000 requests (more for JS-render / hard targets) | Low-to-medium volume, or hard targets where engineering time is worth more than the per-request fee |
What happens to
50 million rows
Bypassing detection is the part everyone writes about. But getting the data is only the start. The questions that actually decide whether a scraping operation survives are about what you do next: where the data lands, how you avoid storing the same thing twice, and how you notice when a site quietly starts feeding you garbage.
Storage: stop dumping JSON into a folder
Deduplication: the same item will arrive many times
RFPDupeFilter backed by Redis) checks membership in constant memory with a tiny, tunable false-positive rate. For distributed crawls, a shared Redis set keeps every worker honest so two workers never fetch the same page.last_seen timestamp so you can tell a genuine update apart from a re-scrape, and upsert rather than blindly insert.Data poisoning: when the bypass succeeds but the data is fake
Intercept mobile app traffic
before it hits any anti-bot
Mobile APIs serve the same data as the web, but with weaker protection. No Cloudflare, no JA4 fingerprinting. Intercept the traffic once, replicate the call forever.
git clone https://github.com/newbit1/rootAVD.git
cd rootAVD
# Verify AVD is accessible
adb shell
# List your AVDs
./rootAVD.sh ListAllAVDs
# Copy the first command from the output and run it
# e.g: ./rootAVD.sh system-images/android-30/google_apis_playstore/x86_64/ramdisk.imgadb not found? Add to ~/.zshrc: alias adb='/Users/$USER/Library/Android/sdk/platform-tools/adb'# macOS
brew install --cask http-toolkitimport curl_cffi.requests as requests
resp = requests.get(
"https://api.targetapp.com/v2/listings",
headers={
"Authorization": "Bearer <token_from_http_toolkit>",
"X-App-Version": "4.2.1",
"User-Agent": "TargetApp/4.2.1 (Android 11; SDK 30)",
"Accept": "application/json",
},
impersonate="chrome120"
)
data = resp.json()- Property portals, classifieds, marketplaces
- Apps where the web version is heavily protected
- Data only available in the mobile app
- Targets using simple Bearer token auth
- Any app that doesn't pin SSL certificates
- Apps with SSL pinning block interception
- Some apps crash on rooted devices
- ARM-only apps may not run on x86 emulators
- Tokens expire, need refresh logic in scraper
- App updates can silently change endpoints
If the app blocks interception it likely uses SSL pinning. Use Frida or objection to bypass it at runtime, or use Burp Suite with the Xposed + TrustMeAlready module for a more permanent bypass.
Scraping jargon
in simple terms
Every term that makes scraping documentation confusing, explained with an analogy.
/robots.txt telling crawlers which paths to skip. Works on voluntary compliance only. Googlebot and GPTBot respect it. Commercial scrapers send a Chrome User-Agent and walk straight past it. Analogy: a "staff only" sign. Anyone who cares about signs obeys it. Anyone who does not care walks in anyway.Where scrapers
talk to each other
The best scraping techniques rarely come from documentation, they come from people who've already hit the same wall you're hitting. These communities are where the real knowledge lives.
Discord servers
Reddit communities
Newsletters worth reading
Resources from The Web Scraping Club
YouTube channels worth following
The part most guides
conveniently skip
Bypassing detection is only half the question. The other half is whether you should, whether it is legal where you operate, and whether your approach survives contact with reality. This section is deliberately honest about the limits of everything above.
This is not legal advice, but you need to think about these
Read this before you copy anything above
From IP bans
to transformer ML
Every bypass technique was born as a direct response to a specific detection innovation. The escalation explains why each tool exists.
navigator.webdriver=true. playwright-stealth emerges. Playwright 2020, Microsoft, cross-browser. F5 acquires Shape Security for $1 billion.Thank you for reading.
This is everything I know about web scraping in 2026, every detection layer, every anti-bot system,
every library, every architecture I've actually built or used in production over the last seven years.
If even one section saved you a late night of debugging, that's why I wrote it.
Build something interesting with this. And if you do, I'd genuinely love to hear about it.