Skip to main content
POST
https://api.whizo.ai
/
v1
/
crawl
Crawl API
curl --request POST \
  --url https://api.whizo.ai/v1/crawl \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "url": "<string>",
  "maxPages": 123,
  "maxDepth": 123,
  "format": "<string>",
  "sameDomain": true,
  "includeScreenshot": true,
  "includePdf": true,
  "includeLinks": true,
  "includeImages": true,
  "includeMetadata": true,
  "onlyMainContent": true,
  "mobile": true,
  "waitTime": 123,
  "timeout": 123,
  "removeAds": true,
  "removeScripts": true,
  "removeStyles": true,
  "userAgent": "<string>",
  "priority": "<string>",
  "headers": {},
  "stealth": true,
  "useProxy": true,
  "maxAge": 123,
  "authentication": {
    "authentication.username": "<string>",
    "authentication.password": "<string>",
    "authentication.token": "<string>"
  },
  "extract": {
    "extract.schema": {},
    "extract.prompt": "<string>"
  }
}
'
{
  "success": true,
  "jobId": "550e8400-e29b-41d4-a716-446655440000",
  "queueJobId": "12345",
  "status": "queued",
  "startUrl": "https://example.com",
  "maxPages": 5,
  "maxDepth": 2,
  "estimatedTime": "3 minutes",
  "statusUrl": "/v1/jobs/550e8400-e29b-41d4-a716-446655440000",
  "creditsUsed": 5,
  "processingMode": "async"
}
The Crawl API enables you to automatically discover and scrape content from an entire website or specific sections by following links and extracting data from multiple pages in a structured manner.

Authentication

Authorization
string
required
Bearer token using your API key: Bearer YOUR_API_KEY

Request Body

url
string
required
The starting URL to begin crawling. Must be a valid HTTP/HTTPS URL.
maxPages
number
default:"10"
Maximum number of pages to crawl (1-1000)
maxDepth
number
default:"2"
Maximum crawling depth from the starting URL (1-10)
format
string
default:"markdown"
Output format for scraped content
  • markdown - Clean markdown format (default)
  • html - Raw HTML content
  • text - Plain text only
  • json - Structured JSON format
sameDomain
boolean
default:"true"
Only crawl pages within the same domain as the starting URL
includeScreenshot
boolean
default:"false"
Capture screenshots of each page (+1 credit per page)
includePdf
boolean
default:"false"
Generate PDFs of each page (+1 credit per page)
Extract links from each page for further crawling
includeImages
boolean
default:"false"
Extract image URLs and metadata
includeMetadata
boolean
default:"true"
Include page metadata (title, description, etc.)
onlyMainContent
boolean
default:"false"
Extract only main content, excluding navigation and sidebars
mobile
boolean
default:"false"
Use mobile viewport for rendering pages
waitTime
number
default:"1000"
Time to wait after page load in milliseconds (0-30000)
timeout
number
default:"30000"
Maximum time to wait for each page load in milliseconds (5000-120000)
removeAds
boolean
default:"false"
Automatically remove advertisements and tracking scripts
removeScripts
boolean
default:"false"
Remove JavaScript from HTML output
removeStyles
boolean
default:"false"
Remove CSS styling from HTML output
userAgent
string
Custom user agent string for requests
priority
string
default:"low"
Job processing priority
  • low - Standard processing
  • normal - Higher priority
  • high - Highest priority (Pro+ plans)
headers
object
Custom HTTP headers to send with each request
stealth
boolean
default:"false"
Enable stealth mode with automatic escalation on bot detection for all crawled pages. Automatically upgrades from basic → advanced → maximum stealth if bot detection is encountered (+5 credits per page)
useProxy
boolean
default:"false"
Enable premium proxy rotation from a pool of 10+ rotating proxies for all crawled pages (+2 credits per page)
maxAge
number
default:"30"
Cache max age in days (1-365). Set to 0 to bypass cache.
authentication
object
Authentication credentials for protected pages
extract
object
AI-powered data extraction configuration

Response

success
boolean
Indicates if the crawl was initiated successfully
jobId
string
Unique identifier for tracking the crawl job
queueJobId
string
Queue system job ID for monitoring (if using async processing)
status
string
Current job status: queued or running
startUrl
string
The initial URL that crawling started from
maxPages
number
Maximum pages configured to crawl
maxDepth
number
Maximum crawling depth configured
estimatedTime
string
Estimated completion time (for queued jobs)
statusUrl
string
URL to check job status and progress
creditsUsed
number
Estimated credits that will be consumed
processingMode
string
Processing method: async (queued) or direct (immediate)
results
array
Crawl results (only for direct processing mode)

Examples

Basic Website Crawl

from whizoai import WhizoAI

client = WhizoAI(api_key="whizo_YOUR-API-KEY")

# Crawl an entire website
result = client.crawl(
    url="https://example.com",
    options={
        "maxDepth": 2,
        "maxPages": 10,
        "format": "markdown",
        "excludePaths": ["/admin", "/login"]
    }
)

print(f"Job ID: {result['jobId']}")
print(f"Status: {result['status']}")
{
  "success": true,
  "jobId": "550e8400-e29b-41d4-a716-446655440000",
  "queueJobId": "12345",
  "status": "queued",
  "startUrl": "https://example.com",
  "maxPages": 5,
  "maxDepth": 2,
  "estimatedTime": "3 minutes",
  "statusUrl": "/v1/jobs/550e8400-e29b-41d4-a716-446655440000",
  "creditsUsed": 5,
  "processingMode": "async"
}

Advanced Crawl with Screenshots and AI Extraction

curl -X POST "https://api.whizo.ai/v1/crawl" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://ecommerce-site.com",
    "maxPages": 20,
    "maxDepth": 3,
    "format": "json",
    "includeScreenshot": true,
    "sameDomain": true,
    "extract": {
      "schema": {
        "type": "object",
        "properties": {
          "productName": {"type": "string"},
          "price": {"type": "string"},
          "description": {"type": "string"},
          "availability": {"type": "string"}
        }
      }
    }
  }'
{
  "success": true,
  "jobId": "750e8400-e29b-41d4-a716-446655440001",
  "status": "queued",
  "startUrl": "https://ecommerce-site.com",
  "maxPages": 20,
  "maxDepth": 3,
  "estimatedTime": "15 minutes",
  "statusUrl": "/v1/jobs/750e8400-e29b-41d4-a716-446655440001",
  "creditsUsed": 60,
  "processingMode": "async"
}

Monitor Crawl Progress

Once your crawl is initiated, monitor its progress:
curl -H "Authorization: Bearer YOUR_API_KEY" \
     https://api.whizo.ai/v1/jobs/550e8400-e29b-41d4-a716-446655440000/status
{
  "success": true,
  "data": {
    "id": "550e8400-e29b-41d4-a716-446655440000",
    "status": "running",
    "progress": 60,
    "currentStep": "Crawling page 3 of 5",
    "currentUrl": "https://example.com/page-3",
    "pagesCompleted": 3,
    "pagesTotal": 5,
    "creditsUsed": 3,
    "estimatedTimeRemaining": 120000
  }
}

Crawl Patterns

Sitemap-based Crawling

For more efficient crawling, the API automatically detects and uses sitemaps when available:
{
  "url": "https://example.com",
  "maxPages": 100,
  "priority": "high"
}

Filtered Crawling

Use patterns to include or exclude specific URL patterns:
{
  "url": "https://blog.example.com",
  "maxPages": 50,
  "includePatterns": ["/blog/*", "/articles/*"],
  "excludePatterns": ["/admin/*", "*.pdf"]
}

Credit Costs

Crawling costs vary based on features used:
FeatureCost per Page
Basic crawling (markdown/html/text)1 credit
JSON format1 credit
Screenshot capture+1 credit
PDF generation+1 credit
AI extraction+2 credits
Stealth mode with auto-escalation+5 credits
Premium proxy rotation+2 credits
Examples:
  • Crawling 10 pages with screenshots = 20 credits (10 × 2)
  • Crawling 10 pages with stealth + proxy = 80 credits (10 × (1 + 5 + 2))

Error Responses

error
object

Common Errors

Status CodeError CodeDescription
400invalid_urlThe starting URL is invalid or unreachable
400invalid_optionsCrawl parameters are invalid
401unauthorizedInvalid or missing API key
403insufficient_creditsNot enough credits for the crawl
429rate_limitedRate limit exceeded
500crawl_failedCrawling process failed
{
  "success": false,
  "error": {
    "code": "invalid_url",
    "message": "The starting URL is not accessible",
    "details": {
      "url": "https://invalid-site.com",
      "statusCode": 404
    }
  }
}

Rate Limits

Crawl rate limits by plan:
  • Free: 2 crawls per hour, 10 per day
  • Starter: 10 crawls per hour, 50 per day
  • Pro: 50 crawls per hour, 200 per day
  • Enterprise: Custom limits

Best Practices

  • Start with depth 1-2 for initial exploration
  • Increase depth gradually based on site structure
  • Use sitemap URLs when available for efficiency
  • Use shorter timeouts (10-15s) for fast sites
  • Increase timeout (30s+) for slow-loading pages
  • Consider mobile viewport for mobile-first sites
  • Use sameDomain: true to avoid external links
  • Set reasonable maxPages limits
  • Monitor progress via status endpoint
  • Use priority queuing for time-sensitive crawls
  • Define clear JSON schemas for consistent extraction
  • Use specific prompts for better AI extraction
  • Test extraction on a few pages before large crawls