Firecrawl

v1.0.04.4k callsAPI KeyGitHub

Documentation

Guide

Turn any website into clean, AI-ready data.

A Model Context Protocol (MCP) server that exposes Firecrawl's API for scraping, crawling, mapping, searching, parsing documents, browser automation, and academic research.

Overview

The Firecrawl MCP Server provides powerful web data extraction and research capabilities:

Scrape individual pages or crawl entire websites into markdown, HTML, JSON, and more
Search the web, map site structures, and run autonomous agent-based data extraction
Automate browsers with code or natural language, parse documents, and search academic papers and GitHub

Perfect for:

AI assistants that need to fetch and process live web content
Automating structured data extraction and research pipelines
Building competitive intelligence, literature review, and site auditing workflows

Tools

Scrape

Scrapes a single URL and returns its content in the requested formats. Returns the page as markdown, HTML, screenshot, links, or a summary. For public document URLs (PDF, DOCX) Firecrawl auto-detects and parses them. The response includes data.metadata.scrapeId which can be passed to browser_interact to continue interacting with the same live browser session.

Inputs:

- `url` (string, required) — Full URL to scrape, including https://.
- `formats` (list[string], optional, default: ["markdown"]) — Output formats to request: markdown (default), html, rawHtml, links, screenshot, summary, json, audio, video, branding, product, menu. Use ['markdown'] for text content, add 'screenshot' for visual capture.
- `only_main_content` (bool, optional, default: true) — Strip navigation, headers, footers, and ads — keep the article/content body.
- `wait_for` (int, optional, default: 0) — Milliseconds to wait after page load before capturing (0–30000). Use for JS-rendered pages.
- `timeout_ms` (int, optional, default: 30000) — Maximum time the page load may take in milliseconds (1000–300000).
- `mobile` (bool, optional, default: false) — Emulate a mobile viewport.
- `proxy` (string, optional, default: "auto") — Proxy tier: 'auto' (default), 'basic', or 'enhanced' (stealth, higher credit cost).

output data schema:

{
  markdown: string | null;
  summary: string | null;
  html: string | null;
  rawHtml: string | null;
  screenshot: string | null;
  links: string[] | null;
  metadata: {
    title: string | null

Starts an async batch scrape job for a list of URLs. Returns a job ID immediately. Use get_batch_scrape_status to poll for completion and retrieve scraped content. Ideal for scraping 5–1000 URLs in parallel without blocking.

Inputs:

- `urls` (list[string], required) — List of URLs to scrape.
- `formats` (list[string], optional, default: ["markdown"]) — Output formats to request: markdown (default), html, rawHtml, links, screenshot, summary, json, audio, video, branding, product, menu. Use ['markdown'] for text content, add 'screenshot' for visual capture.
- `only_main_content` (bool, optional, default: true) — Strip navigation, headers, footers, and ads from each page.
- `proxy` (string, optional, default: "auto") — Proxy tier: 'auto' (default), 'basic', or 'enhanced' (stealth, higher credit cost).
- `block_ads` (bool, optional, default:

Polls the status of a batch scrape job started by batch_scrape_urls. Returns status (scraping/completed/failed), progress counters, and scraped pages when done. If data.next is present in the response, call again with the same job_id to get the next page of results.

Inputs:

- `job_id` (string, required) — Batch scrape job ID returned by `batch_scrape_urls`.

output data schema:

{
  status: string;
  total: number | null

DESTRUCTIVE — REQUIRES EXPLICIT USER CONFIRMATION BEFORE CALLING. Stops a running batch scrape job. All in-progress scraping is terminated and any unfinished results are permanently lost — this cannot be undone. NEVER call this tool autonomously or as part of an automated flow. You MUST stop, tell the user exactly which batch scrape job will be cancelled and that unfinished results will be permanently lost, and wait for their explicit written confirmation before proceeding.

Inputs:

- `job_id` (string, required) — Batch scrape job ID to cancel.

output data schema:

{
  status: string;
}

Crawl

Starts an async crawl job from a seed URL, following internal links up to the specified depth and page limit. Returns a job ID immediately. Use get_crawl_status to poll for progress and results. Use include_paths/exclude_paths regex patterns to control which URLs are visited. Ideal for extracting all content from a site, documentation, or blog.

Inputs:

- `url` (string, required) — Seed URL to start crawling from.
- `limit` (int, optional, default: 10000) — Maximum number of pages to crawl (1–10000).
- `max_discovery_depth` (int, optional) — Maximum link depth from the seed URL. Omit for unlimited.
- `include_paths` (list[string], optional) — Regex patterns — only URLs matching at least one pattern are crawled.
- `exclude_paths` (list[string

Polls the status of a crawl job started by crawl_url. Returns status (scraping/completed/failed/cancelled), progress counters, and crawled pages. If data.next is present, call again to retrieve the next page of results.

Inputs:

- `job_id` (string, required) — Crawl job ID returned by `crawl_url`.

output data schema:

{
  status: string;
  total: number | null;

DESTRUCTIVE — REQUIRES EXPLICIT USER CONFIRMATION BEFORE CALLING. Stops a running crawl job. All in-progress crawling is terminated and any unfinished pages are permanently lost — this cannot be undone. NEVER call this tool autonomously or as part of an automated flow. You MUST stop, tell the user which crawl job will be cancelled and that unfinished pages will be permanently lost, and wait for their explicit written confirmation before proceeding.

Inputs:

- `job_id` (string, required) — Crawl job ID to cancel.

output data schema:

{
  status: string;
}

Discover

Discovers all URLs on a website without scraping their content. Returns a list of links with title and description. Use before crawl_url to understand site structure, or pass search to filter URLs by relevance to a topic. Much faster and cheaper than crawling when you only need the URL list.

Inputs:

- `url` (string, required) — Root URL of the site to map.
- `search` (string, optional) — Filter and rank URLs by relevance to this search query.
- `sitemap` (string, optional, default: "include") — 'include' (sitemap + crawl), 'skip' (crawl only), 'only' (sitemap only).
- `include_subdomains` (bool, optional, default: true) — Include URLs from subdomains of the root URL.
- `ignore_query_parameters` (bool, optional, default: true) — Deduplicate URLs that differ only in query parameters.
- `ignore_cache` (bool, optional, default: false) — Bypass sitemap cache to get the freshest URL list.
- `limit` (int, optional, default:

Searches the web and optionally scrapes the full content of each result. Returns web pages, images, or news depending on sources. Set scrape_formats to ['markdown'] to get full page content alongside each result — omit to get only title, description, and URL. Supports operator syntax: site:, filetype:, intitle:, -exclude, "exact phrase".

Inputs:

- `query` (string, required) — Search query. Supports operators: site:domain.com, filetype:pdf, intitle:keyword, -exclude, "exact phrase", related:domain.com.
- `limit` (int, optional, default: 10) — Number of results to return (1–100).
- `sources` (list[string], optional, default: ["web"]) — Result types to return: 'web', 'images', 'news'. Combine as needed.
- `categories` (list[string

Parse

Parses a local or private document (PDF, DOCX, XLSX, HTML, and more) into clean markdown or structured data. Use when the file is not publicly accessible by URL — for public URLs use scrape_url instead. The file must be provided as base64-encoded bytes, making this suitable for workflow chains where a previous step fetches and encodes the file content.

Inputs:

- `file_content_b64` (string, required) — Base64-encoded file bytes to parse.
- `file_name` (string, required) — Filename including extension (e.g. 'report.pdf', 'data.docx'). Extension determines parser.
- `formats` (list[string], optional, default: ["markdown"]) — Output formats: markdown, html, rawHtml, links, summary.
- `only_main_content` (bool, optional, default: true) — Strip headers, footers, and decorative content.

Agent

Starts an autonomous web research agent that searches, navigates, and extracts data based on a natural language prompt. No URLs required — the agent finds them. Use schema to get structured JSON output. Returns a job ID; use get_agent_status to poll. Use spark-1-mini (default, 60% cheaper) for most tasks; spark-1-pro for complex multi-domain research. Set max_credits to cap spending — the job fails without charges if the limit is hit.

Inputs:

- `prompt` (string, required) — Natural language description of the data to find (max 10000 chars). Be specific: 'Find the 5 most-funded AI startups in 2024

Polls the status of an agent job started by run_agent. Returns status (processing/completed/failed/cancelled), extracted data when done, and credit usage. Poll every 15–30 seconds; jobs typically complete in 1–5 minutes.

Inputs:

- `job_id` (string, required) — Agent job ID returned by `run_agent`.

output data schema:

{
  id: string | null;
  status: string | null;
  data

DESTRUCTIVE — REQUIRES EXPLICIT USER CONFIRMATION BEFORE CALLING. Requests cancellation of a running agent job. Any in-progress reasoning steps complete before the job transitions to cancelled — credits for completed steps may still be charged and cannot be recovered. NEVER call this tool autonomously or as part of an automated flow. You MUST stop, tell the user which agent job will be cancelled and the credit implications, and wait for their explicit written confirmation before proceeding.

Inputs:

- `job_id` (string, required) — Agent job ID to cancel.

output data schema:

{
  status: string;
}

Browser

Executes code or a natural language prompt in the live browser session bound to a previous scrape job. The scrape_id comes from data.metadata.scrapeId in a scrape_url response. First call creates the browser session at the same page state as the scrape. Subsequent calls on the same scrape_id reuse the live session. Provide either code (Playwright/Node/Python/Bash to run) or prompt_text (AI-driven navigation), not both. Returns CDP URL, live view URL, stdout, and AI output. Call browser_close when done to release the session.

Inputs:

- `scrape_id` (string, required) — Scrape job ID from `data.metadata.scrapeId` in a `scrape_url` response.

DESTRUCTIVE — REQUIRES EXPLICIT USER CONFIRMATION BEFORE CALLING. Destroys the browser session attached to a scrape job. All browser state, cookies, and session data are permanently lost and the session cannot be resumed — this cannot be undone. Always call this when done interacting to avoid leaking browser resources and credits. NEVER call this tool autonomously or as part of an automated flow. You MUST stop, confirm with the user that the browser session is no longer needed, and wait for their explicit written confirmation before proceeding.

Inputs:

- `scrape_id` (string, required) — Scrape job ID whose browser session to close (same ID used in browser_interact).

output data schema:

{
  status: string;
}

Research

Searches Firecrawl's academic research index by topic, method, benchmark, or author. Returns ranked papers with paperId, title, abstract, and relevance score. Use paperId from results to call get_paper or find_related_papers. Supports filtering by author name substring, category (e.g. 'cs.LG'), and date range.

Inputs:

- `query` (string, required) — Natural language search query (e.g. 'diffusion models image synthesis').
- `k` (int, optional, default: 40) — Maximum number of ranked papers to return (1–500).
- `authors` (string, optional) — Filter by author name substring (e.g. 'LeCun'). Comma-separate for multiple.
- `categories` (string, optional) — Filter by paper category (e.g. 'cs.LG', 'cs.CV'). Comma-separate for multiple.
- `from_date` (string, optional) — Inclusive lower bound on paper date in YYYY-MM-DD format (e.g. '2023-01-01

Retrieves full details for a specific research paper by its ID. Returns title, abstract, authors, categories, and dates. The paper_id can be a canonical paperId (e.g. '2014215642691656232') or a source-prefixed ID (e.g. 'arxiv:2105.05233') from search_papers results.

Inputs:

- `paper_id` (string, required) — Paper ID — either canonical paperId or source-prefixed ID like 'arxiv:2105.05233'.
- `k` (int, optional) — Number of related papers to include alongside the paper details.

output data schema:

Finds papers related to a seed paper, ranked by semantic relevance to an intent. Use mode to choose expansion strategy: 'similar' (semantically close), 'citers' (papers that cite the seed), 'references' (papers cited by the seed). Returns ranked results with relevance scores. Ideal for literature review workflows: search_papers → find_related_papers → get_paper.

Inputs:

- `paper_id` (string, required) — Seed paper ID (canonical paperId or 'arxiv:XXXX.XXXXX').
- `intent` (string, required) — Natural language ranking intent (e.g. 'applications in medical imaging').
- `mode` (string, optional, default: "similar") — Expansion mode: 'similar' (default), 'citers', or 'references'.
- `k` (int, optional, default: 40) — Maximum number of related papers to return (1–500).
- `rerank` (bool, optional, default: false) — Apply an additional reranking pass over the fused candidate set.

Searches GitHub issue history, pull requests, discussions, and repository READMEs using natural language. Returns matched content with repository metadata, URLs, and markdown snippets. Useful for researching how a bug was fixed, what a library's maintainers have said, or finding prior art in open source projects.

Inputs:

- `query` (string, required) — Natural language query (e.g. 'race condition in worker shutdown firecrawl').
- `k` (int, optional, default: 20) — Maximum number of results to return (1–100).

output data schema:

{
  results: {

API Parameters Reference

Every tool returns the same top-level envelope. Only data varies per tool.

// Success
{
  success: true;
  statusCode: number;
  retriable: false;
  retry_after_seconds: null;
  error: null;
  data: { ... };   // schema shown per tool above
}
 
// Error
{

Getting Your Firecrawl API Key

Troubleshooting

Help Improve This Server

Missing a tool?

Found a bug?

Have an idea for an improvement?

Share your feedback directly with the maintainers - every feedback helps make this server better for everyone.

Open GitHub Issues →