Building a Universal Web Scraper with n8n and Firecrawl

I needed a way to scrape any URL and get back clean, structured markdown — the kind of output you can feed directly into an LLM for summarization, Q&A, or RAG pipelines. Not raw HTML full of nav bars and cookie banners. Clean text. Ideally from a single API call I could trigger from anywhere.

The result is a self-hosted webhook that accepts a URL via POST and returns LLM-ready markdown. It runs entirely on my home server using two Docker containers: n8n for orchestration and Firecrawl for the actual scraping. This post walks through how I built it, the gotchas I hit, and what it looks like in production.

1. Why n8n + Firecrawl

There are plenty of scraping tools out there, but most either give you raw HTML (useless for LLM consumption) or require you to write custom selectors for every site. I wanted something universal — point it at any URL and get readable content back.

Firecrawl handles the hard scraping part. It uses Playwright under the hood, which means it actually renders JavaScript. It handles anti-bot measures, strips navigation and boilerplate, and returns clean markdown. It's the "give me the article text, not the page source" tool I'd been looking for.

n8n handles everything around the scrape. It gives me webhooks (so I can trigger scrapes via HTTP POST from any tool), retry logic, scheduling, response formatting, and a visual workflow editor for when I want to chain scrapes with other automations. It's the glue.

Both are open source and self-hostable. No API keys to manage, no per-request billing, no data leaving my network.

2. Self-Hosted Setup

Both services run as Docker containers on the same home server that hosts everything else. The compose files live in /opt/n8n/ and /opt/firecrawl/ respectively.

The n8n container is straightforward:

# /opt/n8n/docker-compose.yml
services:
  n8n:
    image: n8nio/n8n:latest
    container_name: n8n
    restart: unless-stopped
    ports:
      - "127.0.0.1:5678:5678"
    volumes:
      - /mnt/storage/n8n:/home/node/.n8n
    environment:
      - N8N_HOST=n8n.nurikhwanidris.my
      - N8N_PROTOCOL=https
      - WEBHOOK_URL=https://n8n.nurikhwanidris.my/
    networks:
      - scraper-net

networks:
  scraper-net:
    external: true

Firecrawl needs a bit more configuration since it bundles Playwright and a few worker processes:

# /opt/firecrawl/docker-compose.yml
services:
  firecrawl:
    image: mendableai/firecrawl:latest
    container_name: firecrawl
    restart: unless-stopped
    ports:
      - "127.0.0.1:3002:3002"
    environment:
      - PORT=3002
      - NUM_WORKERS_PER_QUEUE=2
    networks:
      - scraper-net

networks:
  scraper-net:
    external: true

The shared scraper-net Docker network is critical — I'll explain why in a moment.

3. The Docker Networking Gotcha

This one cost me about an hour of debugging. Inside the n8n workflow, I initially configured the HTTP Request node to call Firecrawl at http://127.0.0.1:3002/v1/scrape. It worked perfectly when I tested from the host machine with curl. But from inside the n8n container, it timed out every time.

The reason is obvious in hindsight: 127.0.0.1 inside a Docker container refers to that container's own loopback, not the host. n8n was trying to connect to itself on port 3002, which obviously doesn't exist.

The fix: connect both containers to the same Docker network and use the container name as the hostname.

# Create the shared network
docker network create scraper-net

# In n8n workflow, use this URL instead:
http://firecrawl:3002/v1/scrape

Docker's built-in DNS resolves firecrawl to the container's IP on the shared network. No port mapping needed, no host networking hacks. This is the correct way to do container-to-container communication.

4. The Webhook Workflow

The n8n workflow is simple — four nodes in a chain:

Webhook node — listens for POST requests with a JSON body containing a url field
HTTP Request node — sends the URL to Firecrawl's /v1/scrape endpoint
Function node — extracts the markdown content and metadata from Firecrawl's response
Respond to Webhook node — returns the formatted JSON to the caller

Here's what a call looks like from the outside:

# Scrape a single URL
curl -X POST https://n8n.nurikhwanidris.my/webhook/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/blog/some-article"}'

# Response
{
  "success": true,
  "url": "https://example.com/blog/some-article",
  "title": "Some Article Title",
  "markdown": "# Some Article Title\n\nThe full article content in clean markdown...",
  "word_count": 1247,
  "scraped_at": "2026-03-22T10:30:00Z"
}

The Firecrawl API call inside n8n sends a POST to http://firecrawl:3002/v1/scrape with the target URL and a few options:

{
  "url": "{{ $json.url }}",
  "formats": ["markdown"],
  "waitFor": 3000,
  "timeout": 30000
}

The formats field tells Firecrawl to return markdown (it can also do HTML, screenshots, and structured data extraction). The waitFor and timeout fields are where things get interesting for modern web apps.

5. Handling SPA Sites: The waitFor Trick

Most marketing sites and documentation portals today are built with Next.js, Nuxt, or plain React. They serve a minimal HTML shell and render everything client-side with JavaScript. A naive HTTP request gets you an empty <div id="root"></div> and nothing else.

Firecrawl uses Playwright to render the page in a real browser, which handles most of this automatically. But some SPAs are slow to hydrate — the initial JS bundle loads, then it fetches data from an API, then it renders. If Firecrawl captures the page too early, you get a loading spinner in your markdown.

The waitFor parameter tells Playwright to wait a specified number of milliseconds after the page load event before capturing the content. For most sites, 3000ms (3 seconds) is enough. For heavier SPAs — the kind with animated loading screens and multiple API calls — I bump it to 5000ms.

# For a heavy SPA site
curl -X POST https://n8n.nurikhwanidris.my/webhook/scrape \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://heavy-spa-site.com/products",
    "waitFor": 5000
  }'

The n8n workflow passes waitFor through to Firecrawl if provided in the request body, defaulting to 3000ms otherwise. This gives callers control without requiring them to understand Firecrawl internals.

6. Batch Scraping: Crawl an Entire Site

Single-page scraping is useful, but the real power comes from batch scraping — discovering all pages on a site and scraping each one. I wrote a Python script that does this in two phases: first it crawls the site to discover URLs, then it scrapes each one through the webhook.

#!/usr/bin/env python3
"""batch_scrape.py - Discover and scrape all pages from a website."""

import requests
import json
import time
import sys
from urllib.parse import urljoin, urlparse

WEBHOOK_URL = "https://n8n.nurikhwanidris.my/webhook/scrape"
FIRECRAWL_URL = "http://localhost:3002"


def discover_urls(base_url):
    """Use Firecrawl's map endpoint to discover all URLs on a site."""
    resp = requests.post(
        f"{FIRECRAWL_URL}/v1/map",
        json={"url": base_url},
        timeout=60,
    )
    resp.raise_for_status()
    data = resp.json()
    urls = data.get("links", [])

    # Filter to same domain only
    base_domain = urlparse(base_url).netloc
    same_domain = [u for u in urls if urlparse(u).netloc == base_domain]

    # Remove duplicates and fragments
    cleaned = list(set(u.split("#")[0] for u in same_domain))
    return sorted(cleaned)


def scrape_url(url, wait_for=3000):
    """Scrape a single URL through the n8n webhook."""
    resp = requests.post(
        WEBHOOK_URL,
        json={"url": url, "waitFor": wait_for},
        timeout=60,
    )
    resp.raise_for_status()
    return resp.json()


def batch_scrape(base_url, output_file="scraped_pages.json", wait_for=3000):
    """Discover all URLs and scrape each one."""
    print(f"Discovering URLs on {base_url}...")
    urls = discover_urls(base_url)
    print(f"Found {len(urls)} pages.")

    results = []
    for i, url in enumerate(urls, 1):
        print(f"  [{i}/{len(urls)}] Scraping {url}")
        try:
            result = scrape_url(url, wait_for)
            results.append(result)
            word_count = result.get("word_count", 0)
            print(f"           -> {word_count} words")
        except Exception as e:
            print(f"           -> FAILED: {e}")
            results.append({"url": url, "success": False, "error": str(e)})
        time.sleep(1)  # Be polite

    total_words = sum(r.get("word_count", 0) for r in results if r.get("success"))
    successful = sum(1 for r in results if r.get("success"))

    print(f"\nDone. {successful}/{len(urls)} pages scraped, {total_words} total words.")

    with open(output_file, "w") as f:
        json.dump(results, f, indent=2)
    print(f"Results saved to {output_file}")


if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python batch_scrape.py  [wait_ms]")
        sys.exit(1)

    base = sys.argv[1]
    wait = int(sys.argv[2]) if len(sys.argv) > 2 else 3000
    batch_scrape(base, wait_for=wait)

The script uses Firecrawl's /v1/map endpoint for URL discovery, which is essentially a sitemap crawler. It finds all internal links reachable from the homepage, then scrapes each one with a 1-second delay between requests to avoid hammering the target.

7. Real Results: Scraping fwd.com.my

To test the system end-to-end, I pointed it at fwd.com.my — a Next.js insurance site with heavy client-side rendering, dynamic content, and multiple nested pages.

python batch_scrape.py https://www.fwd.com.my 5000

Results:

Pages discovered: 28
Pages successfully scraped: 22
Total words extracted: 14,065
Average words per page: ~639
Total time: ~4 minutes (with the 5-second wait per page)

The 6 failed pages were mostly PDF links and redirects — not actual HTML pages. The successfully scraped pages came back as clean markdown with proper headings, paragraph breaks, and list formatting. No nav bars, no footers, no cookie consent text. Exactly what you'd want to feed into an LLM.

The waitFor: 5000 was necessary here. At 3 seconds, about half the pages came back with incomplete content because Next.js hadn't finished hydrating. At 5 seconds, everything rendered fully.

8. The Cloudflare Problem

Not everything went smoothly. Some Malaysian government sites (and a few banks) sit behind Cloudflare's aggressive bot protection — the kind that shows a "Checking your browser" interstitial and requires JavaScript challenges to pass through.

Even with Playwright rendering, Firecrawl sometimes gets stuck on these challenges. The page loads, the Cloudflare check spins, and eventually the request times out. The scraped content comes back as the Cloudflare challenge page itself — "Please wait while we verify your browser" in markdown. Not useful.

What I've found works in some cases:

Longer waits: bumping waitFor to 8-10 seconds gives Cloudflare time to pass the challenge automatically in some cases
Retry logic: n8n's built-in retry mechanism sometimes works because subsequent requests from the same IP get through after the first one warms the session
Alternative approaches: for heavily protected sites, I fall back to using the site's RSS feed, public API, or structured data (most sites expose JSON-LD that's easier to parse)

This is an inherent limitation of automated scraping. If a site really doesn't want to be scraped, there's only so much you can do from the technical side — and you should respect that.

9. The Production Endpoint

The webhook is live and accessible. Here's the complete interface:

# Basic scrape
curl -X POST https://n8n.nurikhwanidris.my/webhook/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

# With custom wait time for SPAs
curl -X POST https://n8n.nurikhwanidris.my/webhook/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://spa-site.com", "waitFor": 5000}'

# Response format
{
  "success": true,
  "url": "https://example.com",
  "title": "Page Title",
  "markdown": "# Page Title\n\nClean content...",
  "word_count": 842,
  "scraped_at": "2026-03-22T10:30:00Z"
}

The endpoint sits behind nginx and the Cloudflare Tunnel, same as every other service on the server. n8n binds to 127.0.0.1:5678, nginx proxies it, and cloudflared routes public traffic in.

I use this endpoint from Claude Code (via curl in Bash), from Python scripts for batch processing, and from other n8n workflows that need web content as input. It's become one of those utilities I reach for constantly — any time I need "what does this page say?" as structured data, it's one HTTP call away.

10. What's Next

A few things I'm planning to add:

Caching layer: store scraped results in SQLite with a TTL, so repeated scrapes of the same URL don't hit the target site again within a configurable window
Scheduled scrapes: n8n supports cron triggers — I want to monitor a few sites for content changes and get notified
Direct RAG integration: pipe scraped content directly into a vector database for retrieval-augmented generation with the local Ollama setup

The core system is solid, though. Two containers, one shared network, a four-node workflow. It turns any URL into clean markdown with a single POST request. If you're running n8n already, adding Firecrawl alongside it is a 10-minute setup — and if you're doing anything with LLMs, having a reliable "URL to text" pipeline is genuinely useful infrastructure.

Questions or want to try the endpoint? Reach out via the contact section.