Skip to main content
POST
https://api.whizo.ai
/
v1
/
batch
Batch Scraping API (Deprecated)
curl --request POST \
  --url https://api.whizo.ai/v1/batch \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "urls": [
    {}
  ],
  "format": "<string>",
  "includeScreenshot": true,
  "includePdf": true,
  "includeLinks": true,
  "includeImages": true,
  "includeMetadata": true,
  "onlyMainContent": true,
  "removeAds": true,
  "removeScripts": true,
  "removeStyles": true,
  "mobile": true,
  "waitTime": 123,
  "timeout": 123,
  "concurrency": 123,
  "retryCount": 123,
  "delayBetweenRequests": 123,
  "priority": "<string>",
  "userAgent": "<string>",
  "headers": {},
  "authentication": {
    "authentication.username": "<string>",
    "authentication.password": "<string>",
    "authentication.token": "<string>"
  },
  "extract": {
    "extract.schema": {},
    "extract.prompt": "<string>"
  }
}
'
{
  "success": true,
  "jobId": "batch_abc123xyz",
  "queueJobId": "queue_456def",
  "status": "queued",
  "urlCount": 5,
  "estimatedCredits": 5,
  "estimatedTime": "30-60 seconds",
  "statusUrl": "https://api.whizo.ai/v1/jobs/batch_abc123xyz"
}
Deprecated as of v2.2.1 (January 16, 2025) The separate batch endpoint has been deprecated in favor of the unified /v1/scrape endpoint. Use the urls array parameter in /v1/scrape for multi-URL processing. This provides the same functionality with a cleaner API design.Migration Guide:
  • Change POST /v1/batch to POST /v1/scrape
  • Replace urls array in request body (same format)
  • All options and features remain supported
  • Historical batch jobs remain accessible
The Batch API allows you to scrape multiple URLs in a single request with automatic parallelization, retry logic, and rate limiting. Perfect for processing large lists of URLs efficiently.

Authentication

Authorization
string
required
Bearer token using your API key: Bearer YOUR_API_KEY

Request Body

urls
array
required
Array of URLs to scrape (1-100 URLs per request)
[
  "https://example.com/page1",
  "https://example.com/page2",
  "https://example.com/page3"
]

Output Options

format
string
default:"json"
Output format for scraped content
  • markdown - Clean markdown format
  • html - HTML content
  • text - Plain text only
  • json - Structured JSON format (default)
includeScreenshot
boolean
default:"false"
Capture screenshots for each URL (+1 credit per page)
includePdf
boolean
default:"false"
Generate PDFs for each URL (+1 credit per page)
Extract hyperlinks from content
includeImages
boolean
default:"false"
Extract image URLs and metadata
includeMetadata
boolean
default:"true"
Include page metadata (title, description, etc.)

Content Processing

onlyMainContent
boolean
default:"false"
Extract only main content, removing headers/footers/sidebars
removeAds
boolean
default:"false"
Remove advertisements and tracking scripts
removeScripts
boolean
default:"false"
Remove JavaScript from HTML output
removeStyles
boolean
default:"false"
Remove CSS styling from HTML output

Browser Options

mobile
boolean
default:"false"
Use mobile viewport for rendering
waitTime
number
default:"1000"
Time to wait after page load in milliseconds (0-30000)
timeout
number
default:"30000"
Maximum time to wait for each page load in milliseconds (5000-120000)

Batch Processing Options

concurrency
number
default:"3"
Number of URLs to process simultaneously (1-10)Higher values = faster processing but more resources
retryCount
number
default:"2"
Number of retry attempts for failed URLs (0-5)
delayBetweenRequests
number
default:"1000"
Delay between requests in milliseconds (0-10000)Useful for respecting rate limits on target websites
priority
string
default:"low"
Job processing priority
  • low - Standard processing
  • normal - Higher priority
  • high - Highest priority (Pro+ plans)

Request Customization

userAgent
string
Custom user agent string for all requests
headers
object
Custom HTTP headers to send with each request
{
  "Accept-Language": "en-US",
  "Custom-Header": "value"
}
authentication
object
Authentication credentials for protected pages

AI Extraction

extract
object
AI-powered data extraction configuration (applied to all URLs)

Response

success
boolean
Indicates if the batch job was initiated successfully
jobId
string
Unique identifier for tracking the batch job
queueJobId
string
Queue system job ID (if using async processing)
status
string
Current job status: queued, running, or completed
urlCount
number
Total number of URLs in the batch
estimatedCredits
number
Estimated total credits that will be consumed
estimatedTime
string
Estimated completion time
data
object
Batch scraping results (only for synchronous/direct processing)

Examples

Basic Batch Scraping

from whizoai import WhizoAI

client = WhizoAI(api_key="whizo_YOUR-API-KEY")

# Batch scrape multiple URLs
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

result = client.batch_scrape(
    urls=urls,
    options={
        "format": "markdown",
        "includeScreenshot": False
    }
)

# Results will be processed asynchronously
print(f"Job ID: {result['jobId']}")
print(f"Status: {result['status']}")
print(f"Total URLs: {len(urls)}")
{
  "success": true,
  "jobId": "batch_abc123xyz",
  "queueJobId": "queue_456def",
  "status": "queued",
  "urlCount": 5,
  "estimatedCredits": 5,
  "estimatedTime": "30-60 seconds",
  "statusUrl": "https://api.whizo.ai/v1/jobs/batch_abc123xyz"
}

Advanced Batch with Concurrency Control

curl -X POST "https://api.whizo.ai/v1/batch" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://site1.com",
      "https://site2.com",
      "https://site3.com"
    ],
    "format": "json",
    "concurrency": 5,
    "retryCount": 3,
    "delayBetweenRequests": 500,
    "includeMetadata": true,
    "onlyMainContent": true,
    "removeAds": true
  }'

Batch Scraping with AI Extraction

curl -X POST "https://api.whizo.ai/v1/batch" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://company1.com/about",
      "https://company2.com/about",
      "https://company3.com/about"
    ],
    "format": "json",
    "extract": {
      "schema": {
        "companyName": "string",
        "foundedYear": "number",
        "employees": "string",
        "revenue": "string",
        "description": "string"
      },
      "prompt": "Extract company information from the about page"
    }
  }'
{
  "success": true,
  "jobId": "batch_extract_789",
  "status": "queued",
  "urlCount": 3,
  "estimatedCredits": 9,
  "message": "Batch job queued. Use the statusUrl to track progress.",
  "statusUrl": "https://api.whizo.ai/v1/jobs/batch_extract_789"
}

Direct Processing Response (Synchronous)

When queue processing is disabled, the batch API returns results immediately with enhanced formatting:
{
  "success": true,
  "data": {
    "results": [
      {
        "url": "https://example.com/page1",
        "title": "Page 1 Title",
        "description": "Description of page 1",
        "markdown": "# Page 1 Content\n\nThis is the content...",
        "content": "# Page 1 Content\n\nThis is the content...",
        "scrapedAt": "2025-01-15T14:30:00Z",
        "status": "success",
        "wordCount": 450,
        "metadata": {
          "statusCode": 200,
          "contentType": "text/html",
          "loadTime": 1234
        }
      },
      {
        "url": "https://example.com/page2",
        "title": "Page 2 Title",
        "description": "Description of page 2",
        "markdown": "# Page 2 Content\n\nAnother page...",
        "content": "# Page 2 Content\n\nAnother page...",
        "scrapedAt": "2025-01-15T14:30:02Z",
        "status": "success",
        "wordCount": 320,
        "metadata": {
          "statusCode": 200,
          "contentType": "text/html",
          "loadTime": 987
        }
      },
      {
        "url": "https://example.com/page3",
        "status": "failed",
        "error": "Request timeout after 30000ms",
        "scrapedAt": "2025-01-15T14:30:35Z",
        "metadata": {
          "statusCode": 0,
          "error": "Request timeout after 30000ms"
        }
      }
    ],
    "successful": [
      {
        "url": "https://example.com/page1",
        "title": "Page 1 Title",
        "description": "Description of page 1",
        "markdown": "# Page 1 Content\n\nThis is the content...",
        "content": "# Page 1 Content\n\nThis is the content...",
        "scrapedAt": "2025-01-15T14:30:00Z",
        "status": "success",
        "wordCount": 450,
        "metadata": {
          "statusCode": 200,
          "contentType": "text/html",
          "loadTime": 1234
        }
      },
      {
        "url": "https://example.com/page2",
        "title": "Page 2 Title",
        "description": "Description of page 2",
        "markdown": "# Page 2 Content\n\nAnother page...",
        "content": "# Page 2 Content\n\nAnother page...",
        "scrapedAt": "2025-01-15T14:30:02Z",
        "status": "success",
        "wordCount": 320,
        "metadata": {
          "statusCode": 200,
          "contentType": "text/html",
          "loadTime": 987
        }
      }
    ],
    "failed": [
      {
        "url": "https://example.com/page3",
        "status": "failed",
        "error": "Request timeout after 30000ms",
        "scrapedAt": "2025-01-15T14:30:35Z",
        "metadata": {
          "statusCode": 0,
          "error": "Request timeout after 30000ms"
        }
      }
    ],
    "summary": {
      "total": 3,
      "successful": 2,
      "failed": 1,
      "successRate": 67,
      "processingTime": 35420,
      "averageTimePerUrl": 11807,
      "creditsUsed": 2,
      "creditsPerUrl": 1,
      "concurrency": 3
    }
  },
  "warning": "1 of 3 URLs failed to scrape"
}

Error Responses

error
object

Common Errors

Status CodeError CodeDescription
400invalid_urlsOne or more URLs are invalid
400too_many_urlsExceeded maximum of 100 URLs per batch
400invalid_optionsInvalid batch processing options
401unauthorizedInvalid or missing API key
402insufficient_creditsNot enough credits for batch job
429rate_limitedRate limit exceeded
500batch_failedBatch job failed to initialize
{
  "success": false,
  "error": {
    "code": "too_many_urls",
    "message": "Maximum 100 URLs allowed per batch request",
    "details": {
      "provided": 150,
      "maximum": 100
    }
  }
}

Credit Costs

Credit costs are calculated per URL in the batch:
FeatureCost Per URL
Basic scraping1 credit
Screenshot+1 credit
PDF generation+1 credit
AI extraction+2-3 credits
Examples:
  • 10 URLs, markdown format: 10 × 1 = 10 credits
  • 20 URLs with screenshots: 20 × (1 + 1) = 40 credits
  • 5 URLs with AI extraction: 5 × (1 + 3) = 20 credits

Rate Limits

Rate limits vary by plan:
  • Free: Max 10 URLs per batch, 2 concurrent batches
  • Starter: Max 50 URLs per batch, 5 concurrent batches
  • Pro: Max 100 URLs per batch, 10 concurrent batches
  • Enterprise: Custom limits

Best Practices

  1. Group similar URLs together in a batch for consistent processing
  2. Use appropriate concurrency - higher isn’t always better
  3. Set reasonable timeouts based on target site performance
  4. Enable retries for unreliable sites (retryCount: 2-3)
  5. Add delays when scraping the same domain to respect rate limits
  6. Monitor job progress using the statusUrl or /v1/jobs/:id endpoint
  7. Handle partial failures - some URLs may succeed while others fail
  8. Use authentication when scraping protected pages
  9. Start small - test with 5-10 URLs before scaling up
  10. Cache results - reuse batch results when possible

Monitoring Batch Jobs

After submitting a batch, use the Jobs API to monitor progress:
# Check job status
curl -X GET "https://api.whizo.ai/v1/jobs/{jobId}" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Get job results
curl -X GET "https://api.whizo.ai/v1/jobs/{jobId}/results" \
  -H "Authorization: Bearer YOUR_API_KEY"

Performance Tips

Optimal Concurrency Settings

Batch SizeRecommended Concurrency
1-10 URLs3-5
11-25 URLs5-7
26-50 URLs7-10
51-100 URLs10 (max)

Processing Time Estimates

  • Lightweight scraping: ~1-2 seconds per URL
  • JavaScript rendering: ~3-5 seconds per URL
  • With screenshots: ~4-6 seconds per URL
  • With AI extraction: ~5-10 seconds per URL
Example: Batch of 50 URLs with markdown format and concurrency=5
  • Sequential time: 50 × 2s = 100 seconds
  • Parallel time: (50 / 5) × 2s = 20 seconds