The Crawl API enables you to automatically discover and scrape content from an entire website or specific sections by following links and extracting data from multiple pages in a structured manner.
Authentication
Bearer token using your API key: Bearer YOUR_API_KEY
Request Body
The starting URL to begin crawling. Must be a valid HTTP/HTTPS URL.
Maximum number of pages to crawl (1-1000)
Maximum crawling depth from the starting URL (1-10)
Output format for scraped content
markdown - Clean markdown format (default)
html - Raw HTML content
text - Plain text only
json - Structured JSON format
Only crawl pages within the same domain as the starting URL
Capture screenshots of each page (+1 credit per page)
Generate PDFs of each page (+1 credit per page)
Extract links from each page for further crawling
Extract image URLs and metadata
Include page metadata (title, description, etc.)
Extract only main content, excluding navigation and sidebars
Use mobile viewport for rendering pages
Time to wait after page load in milliseconds (0-30000)
Maximum time to wait for each page load in milliseconds (5000-120000)
Automatically remove advertisements and tracking scripts
Remove JavaScript from HTML output
Remove CSS styling from HTML output
Custom user agent string for requests
Job processing priority
low - Standard processing
normal - Higher priority
high - Highest priority (Pro+ plans)
Custom HTTP headers to send with each request
Enable stealth mode with automatic escalation on bot detection for all crawled pages. Automatically upgrades from basic → advanced → maximum stealth if bot detection is encountered (+5 credits per page)
Enable premium proxy rotation from a pool of 10+ rotating proxies for all crawled pages (+2 credits per page)
Cache max age in days (1-365). Set to 0 to bypass cache.
Authentication credentials for protected pages Show authentication properties
Bearer token for API authentication
AI-powered data extraction configuration JSON schema for structured data extraction
Custom prompt for AI extraction
Response
Indicates if the crawl was initiated successfully
Unique identifier for tracking the crawl job
Queue system job ID for monitoring (if using async processing)
Current job status: queued or running
The initial URL that crawling started from
Maximum pages configured to crawl
Maximum crawling depth configured
Estimated completion time (for queued jobs)
URL to check job status and progress
Estimated credits that will be consumed
Processing method: async (queued) or direct (immediate)
Crawl results (only for direct processing mode) Extracted content in specified format
Page metadata including title, description, status code
Links discovered on this page
Images found on this page (if enabled)
Examples
Basic Website Crawl
from whizoai import WhizoAI
client = WhizoAI( api_key = "whizo_YOUR-API-KEY" )
# Crawl an entire website
result = client.crawl(
url = "https://example.com" ,
options = {
"maxDepth" : 2 ,
"maxPages" : 10 ,
"format" : "markdown" ,
"excludePaths" : [ "/admin" , "/login" ]
}
)
print ( f "Job ID: { result[ 'jobId' ] } " )
print ( f "Status: { result[ 'status' ] } " )
{
"success" : true ,
"jobId" : "550e8400-e29b-41d4-a716-446655440000" ,
"queueJobId" : "12345" ,
"status" : "queued" ,
"startUrl" : "https://example.com" ,
"maxPages" : 5 ,
"maxDepth" : 2 ,
"estimatedTime" : "3 minutes" ,
"statusUrl" : "/v1/jobs/550e8400-e29b-41d4-a716-446655440000" ,
"creditsUsed" : 5 ,
"processingMode" : "async"
}
Advanced Crawl with Screenshots and AI Extraction
curl -X POST "https://api.whizo.ai/v1/crawl" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://ecommerce-site.com",
"maxPages": 20,
"maxDepth": 3,
"format": "json",
"includeScreenshot": true,
"sameDomain": true,
"extract": {
"schema": {
"type": "object",
"properties": {
"productName": {"type": "string"},
"price": {"type": "string"},
"description": {"type": "string"},
"availability": {"type": "string"}
}
}
}
}'
{
"success" : true ,
"jobId" : "750e8400-e29b-41d4-a716-446655440001" ,
"status" : "queued" ,
"startUrl" : "https://ecommerce-site.com" ,
"maxPages" : 20 ,
"maxDepth" : 3 ,
"estimatedTime" : "15 minutes" ,
"statusUrl" : "/v1/jobs/750e8400-e29b-41d4-a716-446655440001" ,
"creditsUsed" : 60 ,
"processingMode" : "async"
}
Monitor Crawl Progress
Once your crawl is initiated, monitor its progress:
curl -H "Authorization: Bearer YOUR_API_KEY" \
https://api.whizo.ai/v1/jobs/550e8400-e29b-41d4-a716-446655440000/status
{
"success" : true ,
"data" : {
"id" : "550e8400-e29b-41d4-a716-446655440000" ,
"status" : "running" ,
"progress" : 60 ,
"currentStep" : "Crawling page 3 of 5" ,
"currentUrl" : "https://example.com/page-3" ,
"pagesCompleted" : 3 ,
"pagesTotal" : 5 ,
"creditsUsed" : 3 ,
"estimatedTimeRemaining" : 120000
}
}
Crawl Patterns
Sitemap-based Crawling
For more efficient crawling, the API automatically detects and uses sitemaps when available:
{
"url" : "https://example.com" ,
"maxPages" : 100 ,
"priority" : "high"
}
Filtered Crawling
Use patterns to include or exclude specific URL patterns:
{
"url" : "https://blog.example.com" ,
"maxPages" : 50 ,
"includePatterns" : [ "/blog/*" , "/articles/*" ],
"excludePatterns" : [ "/admin/*" , "*.pdf" ]
}
Credit Costs
Crawling costs vary based on features used:
Feature Cost per Page
Basic crawling (markdown/html/text) 1 credit JSON format 1 credit Screenshot capture +1 credit PDF generation +1 credit AI extraction +2 credits Stealth mode with auto-escalation +5 credits Premium proxy rotation +2 credits
Examples :
Crawling 10 pages with screenshots = 20 credits (10 × 2)
Crawling 10 pages with stealth + proxy = 80 credits (10 × (1 + 5 + 2))
Error Responses
Human-readable error message
Common Errors
Status Code Error Code Description
400 invalid_urlThe starting URL is invalid or unreachable 400 invalid_optionsCrawl parameters are invalid 401 unauthorizedInvalid or missing API key 403 insufficient_creditsNot enough credits for the crawl 429 rate_limitedRate limit exceeded 500 crawl_failedCrawling process failed
{
"success" : false ,
"error" : {
"code" : "invalid_url" ,
"message" : "The starting URL is not accessible" ,
"details" : {
"url" : "https://invalid-site.com" ,
"statusCode" : 404
}
}
}
Rate Limits
Crawl rate limits by plan:
Free : 2 crawls per hour, 10 per day
Starter : 10 crawls per hour, 50 per day
Pro : 50 crawls per hour, 200 per day
Enterprise : Custom limits
Best Practices
Start with depth 1-2 for initial exploration
Increase depth gradually based on site structure
Use sitemap URLs when available for efficiency
Configure Appropriate Timeouts
Use sameDomain: true to avoid external links
Set reasonable maxPages limits
Monitor progress via status endpoint
Use priority queuing for time-sensitive crawls