Crawl API

The Crawl API enables you to automatically discover and scrape content from an entire website or specific sections by following links and extracting data from multiple pages in a structured manner.

Authentication

Authorization

string

required

Bearer token using your API key: Bearer YOUR_API_KEY

Request Body

url

string

required

The starting URL to begin crawling. Must be a valid HTTP/HTTPS URL.

maxPages

number

default:"10"

Maximum number of pages to crawl (1-1000)

maxDepth

number

default:"2"

Maximum crawling depth from the starting URL (1-10)

format

string

default:"markdown"

Output format for scraped content

markdown - Clean markdown format (default)
html - Raw HTML content
text - Plain text only
json - Structured JSON format

sameDomain

boolean

default:"true"

Only crawl pages within the same domain as the starting URL

includeScreenshot

boolean

default:"false"

Capture screenshots of each page (+1 credit per page)

includePdf

boolean

default:"false"

Generate PDFs of each page (+1 credit per page)

includeLinks

boolean

default:"true"

Extract links from each page for further crawling

includeImages

boolean

default:"false"

Extract image URLs and metadata

includeMetadata

boolean

default:"true"

Include page metadata (title, description, etc.)

onlyMainContent

boolean

default:"false"

Extract only main content, excluding navigation and sidebars

mobile

boolean

default:"false"

Use mobile viewport for rendering pages

waitTime

number

default:"1000"

Time to wait after page load in milliseconds (0-30000)

timeout

number

default:"30000"

Maximum time to wait for each page load in milliseconds (5000-120000)

removeAds

boolean

default:"false"

Automatically remove advertisements and tracking scripts

removeScripts

boolean

default:"false"

Remove JavaScript from HTML output

removeStyles

boolean

default:"false"

Remove CSS styling from HTML output

userAgent

string

Custom user agent string for requests

priority

string

default:"low"

Job processing priority

low - Standard processing
normal - Higher priority
high - Highest priority (Pro+ plans)

headers

object

Custom HTTP headers to send with each request

stealth

boolean

default:"false"

Enable stealth mode with automatic escalation on bot detection for all crawled pages. Automatically upgrades from basic → advanced → maximum stealth if bot detection is encountered (+5 credits per page)

useProxy

boolean

default:"false"

Enable premium proxy rotation from a pool of 10+ rotating proxies for all crawled pages (+2 credits per page)

maxAge

number

default:"30"

Cache max age in days (1-365). Set to 0 to bypass cache.

authentication

object

Authentication credentials for protected pages

Show authentication properties

authentication.username

string

Basic auth username

authentication.password

string

Basic auth password

authentication.token

string

Bearer token for API authentication

extract

object

AI-powered data extraction configuration

Show extract properties

extract.schema

object

JSON schema for structured data extraction

extract.prompt

string

Custom prompt for AI extraction

Response

success

boolean

Indicates if the crawl was initiated successfully

jobId

string

Unique identifier for tracking the crawl job

queueJobId

string

Queue system job ID for monitoring (if using async processing)

status

string

Current job status: queued or running

startUrl

string

The initial URL that crawling started from

maxPages

number

Maximum pages configured to crawl

maxDepth

number

Maximum crawling depth configured

estimatedTime

string

Estimated completion time (for queued jobs)

statusUrl

string

URL to check job status and progress

creditsUsed

number

Estimated credits that will be consumed

processingMode

string

Processing method: async (queued) or direct (immediate)

results

array

Crawl results (only for direct processing mode)

Show results properties

url

string

URL of the crawled page

content

string

Extracted content in specified format

metadata

object

Page metadata including title, description, status code

links

array

Links discovered on this page

images

array

Images found on this page (if enabled)

Examples

Basic Website Crawl

from whizoai import WhizoAI

client = WhizoAI(api_key="whizo_YOUR-API-KEY")

# Crawl an entire website
result = client.crawl(
    url="https://example.com",
    options={
        "maxDepth": 2,
        "maxPages": 10,
        "format": "markdown",
        "excludePaths": ["/admin", "/login"]
    }
)

print(f"Job ID: {result['jobId']}")
print(f"Status: {result['status']}")

{
  "success": true,
  "jobId": "550e8400-e29b-41d4-a716-446655440000",
  "queueJobId": "12345",
  "status": "queued",
  "startUrl": "https://example.com",
  "maxPages": 5,
  "maxDepth": 2,
  "estimatedTime": "3 minutes",
  "statusUrl": "/v1/jobs/550e8400-e29b-41d4-a716-446655440000",
  "creditsUsed": 5,
  "processingMode": "async"
}

Advanced Crawl with Screenshots and AI Extraction

curl -X POST "https://api.whizo.ai/v1/crawl" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://ecommerce-site.com",
    "maxPages": 20,
    "maxDepth": 3,
    "format": "json",
    "includeScreenshot": true,
    "sameDomain": true,
    "extract": {
      "schema": {
        "type": "object",
        "properties": {
          "productName": {"type": "string"},
          "price": {"type": "string"},
          "description": {"type": "string"},
          "availability": {"type": "string"}
        }
      }
    }
  }'

{
  "success": true,
  "jobId": "750e8400-e29b-41d4-a716-446655440001",
  "status": "queued",
  "startUrl": "https://ecommerce-site.com",
  "maxPages": 20,
  "maxDepth": 3,
  "estimatedTime": "15 minutes",
  "statusUrl": "/v1/jobs/750e8400-e29b-41d4-a716-446655440001",
  "creditsUsed": 60,
  "processingMode": "async"
}

Monitor Crawl Progress

Once your crawl is initiated, monitor its progress:

curl -H "Authorization: Bearer YOUR_API_KEY" \
     https://api.whizo.ai/v1/jobs/550e8400-e29b-41d4-a716-446655440000/status

{
  "success": true,
  "data": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "running",
    "progress": 60,
    "currentStep": "Crawling page 3 of 5",
    "currentUrl": "https://example.com/page-3",
    "pagesCompleted": 3,
    "pagesTotal": 5,
    "creditsUsed": 3,
    "estimatedTimeRemaining": 120000
  }
}

Crawl Patterns

Sitemap-based Crawling

For more efficient crawling, the API automatically detects and uses sitemaps when available:

{
  "url": "https://example.com",
  "maxPages": 100,
  "priority": "high"
}

Filtered Crawling

Use patterns to include or exclude specific URL patterns:

{
  "url": "https://blog.example.com",
  "maxPages": 50,
  "includePatterns": ["/blog/*", "/articles/*"],
  "excludePatterns": ["/admin/*", "*.pdf"]
}

Credit Costs

Crawling costs vary based on features used:

Feature	Cost per Page
Basic crawling (markdown/html/text)	1 credit
JSON format	1 credit
Screenshot capture	+1 credit
PDF generation	+1 credit
AI extraction	+2 credits
Stealth mode with auto-escalation	+5 credits
Premium proxy rotation	+2 credits

Examples:

Crawling 10 pages with screenshots = 20 credits (10 × 2)
Crawling 10 pages with stealth + proxy = 80 credits (10 × (1 + 5 + 2))

Error Responses

error

object

Show error properties

code

string

Error code identifier

message

string

Human-readable error message

details

object

Additional error context

Common Errors

Status Code	Error Code	Description
400	`invalid_url`	The starting URL is invalid or unreachable
400	`invalid_options`	Crawl parameters are invalid
401	`unauthorized`	Invalid or missing API key
403	`insufficient_credits`	Not enough credits for the crawl
429	`rate_limited`	Rate limit exceeded
500	`crawl_failed`	Crawling process failed

{
  "success": false,
  "error": {
    "code": "invalid_url",
    "message": "The starting URL is not accessible",
    "details": {
      "url": "https://invalid-site.com",
      "statusCode": 404
    }
  }
}

Rate Limits

Crawl rate limits by plan:

Free: 2 crawls per hour, 10 per day
Starter: 10 crawls per hour, 50 per day
Pro: 50 crawls per hour, 200 per day
Enterprise: Custom limits

Best Practices

Optimize Crawl Depth

Start with depth 1-2 for initial exploration
Increase depth gradually based on site structure
Use sitemap URLs when available for efficiency

Configure Appropriate Timeouts

Use shorter timeouts (10-15s) for fast sites
Increase timeout (30s+) for slow-loading pages
Consider mobile viewport for mobile-first sites

Handle Large Sites

Use sameDomain: true to avoid external links
Set reasonable maxPages limits
Monitor progress via status endpoint
Use priority queuing for time-sensitive crawls

Extract Structured Data

Define clear JSON schemas for consistent extraction
Use specific prompts for better AI extraction
Test extraction on a few pages before large crawls

Scrape API - Single page scraping
Jobs API - Monitor crawl progress
Search API - Search and scrape results
Extract API - AI-powered data extraction

Core APIs

Job Management

User Management

Advanced Features

Authentication

Request Body

Response

Examples

Basic Website Crawl

Advanced Crawl with Screenshots and AI Extraction

Monitor Crawl Progress

Crawl Patterns

Sitemap-based Crawling

Filtered Crawling

Credit Costs

Error Responses

Common Errors

Rate Limits

Best Practices

Core APIs

Job Management

User Management

Advanced Features

​Authentication

​Request Body

​Response

​Examples

​Basic Website Crawl

​Advanced Crawl with Screenshots and AI Extraction

​Monitor Crawl Progress

​Crawl Patterns

​Sitemap-based Crawling

​Filtered Crawling

​Credit Costs

​Error Responses

​Common Errors

​Rate Limits

​Best Practices

​Related Endpoints

Authentication

Request Body

Response

Examples

Basic Website Crawl

Advanced Crawl with Screenshots and AI Extraction

Monitor Crawl Progress

Crawl Patterns

Sitemap-based Crawling

Filtered Crawling

Credit Costs

Error Responses

Common Errors

Rate Limits

Best Practices

Related Endpoints