Batch Scraping API (Deprecated)

Deprecated as of v2.2.1 (January 16, 2025) The separate batch endpoint has been deprecated in favor of the unified /v1/scrape endpoint. Use the urls array parameter in /v1/scrape for multi-URL processing. This provides the same functionality with a cleaner API design.Migration Guide:

Change POST /v1/batch to POST /v1/scrape
Replace urls array in request body (same format)
All options and features remain supported
Historical batch jobs remain accessible

The Batch API allows you to scrape multiple URLs in a single request with automatic parallelization, retry logic, and rate limiting. Perfect for processing large lists of URLs efficiently.

Authentication

Authorization

string

required

Bearer token using your API key: Bearer YOUR_API_KEY

Request Body

urls

array

required

Array of URLs to scrape (1-100 URLs per request)

[
  "https://example.com/page1",
  "https://example.com/page2",
  "https://example.com/page3"
]

Output Options

format

string

default:"json"

Output format for scraped content

markdown - Clean markdown format
html - HTML content
text - Plain text only
json - Structured JSON format (default)

includeScreenshot

boolean

default:"false"

Capture screenshots for each URL (+1 credit per page)

includePdf

boolean

default:"false"

Generate PDFs for each URL (+1 credit per page)

includeLinks

boolean

default:"true"

Extract hyperlinks from content

includeImages

boolean

default:"false"

Extract image URLs and metadata

includeMetadata

boolean

default:"true"

Include page metadata (title, description, etc.)

Content Processing

onlyMainContent

boolean

default:"false"

Extract only main content, removing headers/footers/sidebars

removeAds

boolean

default:"false"

Remove advertisements and tracking scripts

removeScripts

boolean

default:"false"

Remove JavaScript from HTML output

removeStyles

boolean

default:"false"

Remove CSS styling from HTML output

Browser Options

mobile

boolean

default:"false"

Use mobile viewport for rendering

waitTime

number

default:"1000"

Time to wait after page load in milliseconds (0-30000)

timeout

number

default:"30000"

Maximum time to wait for each page load in milliseconds (5000-120000)

Batch Processing Options

concurrency

number

default:"3"

Number of URLs to process simultaneously (1-10)Higher values = faster processing but more resources

retryCount

number

default:"2"

Number of retry attempts for failed URLs (0-5)

delayBetweenRequests

number

default:"1000"

Delay between requests in milliseconds (0-10000)Useful for respecting rate limits on target websites

priority

string

default:"low"

Job processing priority

low - Standard processing
normal - Higher priority
high - Highest priority (Pro+ plans)

Request Customization

userAgent

string

Custom user agent string for all requests

headers

object

Custom HTTP headers to send with each request

{
  "Accept-Language": "en-US",
  "Custom-Header": "value"
}

authentication

object

Authentication credentials for protected pages

Show authentication properties

authentication.username

string

Basic auth username

authentication.password

string

Basic auth password

authentication.token

string

Bearer token for API authentication

AI Extraction

extract

object

AI-powered data extraction configuration (applied to all URLs)

Show extract properties

extract.schema

object

JSON schema defining the structure of data to extract

extract.prompt

string

Custom prompt for AI extraction

Response

success

boolean

Indicates if the batch job was initiated successfully

jobId

string

Unique identifier for tracking the batch job

queueJobId

string

Queue system job ID (if using async processing)

status

string

Current job status: queued, running, or completed

urlCount

number

Total number of URLs in the batch

estimatedCredits

number

Estimated total credits that will be consumed

estimatedTime

string

Estimated completion time

data

object

Batch scraping results (only for synchronous/direct processing)

Show data properties

results

array

All results combined (successful first, then failed)Array of all scraped URLs with standardized format. Successful results appear first, followed by failed results.

successful

array

Only successful scraping resultsFiltered array containing only URLs that were successfully scraped. Each result includes:

Show successful result properties

url

string

The URL that was scraped (standardized field)

title

string

Page title extracted from metadata

description

string

Page description from meta tags

markdown

string

Scraped content in markdown format

content

string

Scraped content (alias for markdown field)

scrapedAt

string

ISO timestamp when URL was scraped

status

string

Always “success” for successful results

wordCount

number

Total word count of scraped content

metadata

object

Additional metadata about the scraping result

Show metadata properties

statusCode

number

HTTP status code (typically 200)

contentType

string

MIME type of the response

loadTime

number

Page load time in milliseconds

failed

array

Only failed scraping resultsFiltered array containing only URLs that failed to scrape. Each result includes:

Show failed result properties

url

string

The URL that failed to scrape (standardized field)

status

string

Always “failed” for failed results

error

string

Error message describing why the scraping failed

scrapedAt

string

ISO timestamp when the scraping was attempted

metadata

object

Error metadata

Show metadata properties

statusCode

number

HTTP status code (0 if request failed completely)

error

string

Detailed error information

summary

object

Batch-level summary statisticsComprehensive statistics about the entire batch processing job:

Show summary properties

total

number

Total number of URLs in the batch

successful

number

Number of successfully scraped URLs

failed

number

Number of URLs that failed to scrape

successRate

number

Success rate as a percentage (0-100)Example: 80 (meaning 80% success rate)

processingTime

number

Total processing time in milliseconds

averageTimePerUrl

number

Average time per URL in milliseconds

creditsUsed

number

Total credits consumed (only for successful URLs)

creditsPerUrl

number

Credits charged per URL

concurrency

number

Number of URLs processed simultaneously

Examples

Basic Batch Scraping

from whizoai import WhizoAI

client = WhizoAI(api_key="whizo_YOUR-API-KEY")

# Batch scrape multiple URLs
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
]

result = client.batch_scrape(
    urls=urls,
    options={
        "format": "markdown",
        "includeScreenshot": False
    }
)

# Results will be processed asynchronously
print(f"Job ID: {result['jobId']}")
print(f"Status: {result['status']}")
print(f"Total URLs: {len(urls)}")

{
  "success": true,
  "jobId": "batch_abc123xyz",
  "queueJobId": "queue_456def",
  "status": "queued",
  "urlCount": 5,
  "estimatedCredits": 5,
  "estimatedTime": "30-60 seconds",
  "statusUrl": "https://api.whizo.ai/v1/jobs/batch_abc123xyz"
}

Advanced Batch with Concurrency Control

curl -X POST "https://api.whizo.ai/v1/batch" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://site1.com",
      "https://site2.com",
      "https://site3.com"
    ],
    "format": "json",
    "concurrency": 5,
    "retryCount": 3,
    "delayBetweenRequests": 500,
    "includeMetadata": true,
    "onlyMainContent": true,
    "removeAds": true
  }'

Batch Scraping with AI Extraction

curl -X POST "https://api.whizo.ai/v1/batch" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://company1.com/about",
      "https://company2.com/about",
      "https://company3.com/about"
    ],
    "format": "json",
    "extract": {
      "schema": {
        "companyName": "string",
        "foundedYear": "number",
        "employees": "string",
        "revenue": "string",
        "description": "string"
      },
      "prompt": "Extract company information from the about page"
    }
  }'

{
  "success": true,
  "jobId": "batch_extract_789",
  "status": "queued",
  "urlCount": 3,
  "estimatedCredits": 9,
  "message": "Batch job queued. Use the statusUrl to track progress.",
  "statusUrl": "https://api.whizo.ai/v1/jobs/batch_extract_789"
}

Direct Processing Response (Synchronous)

When queue processing is disabled, the batch API returns results immediately with enhanced formatting:

{
  "success": true,
  "data": {
    "results": [
      {
        "url": "https://example.com/page1",
        "title": "Page 1 Title",
        "description": "Description of page 1",
        "markdown": "# Page 1 Content\n\nThis is the content...",
        "content": "# Page 1 Content\n\nThis is the content...",
        "scrapedAt": "2025-01-15T14:30:00Z",
        "status": "success",
        "wordCount": 450,
        "metadata": {
          "statusCode": 200,
          "contentType": "text/html",
          "loadTime": 1234
        }
      },
      {
        "url": "https://example.com/page2",
        "title": "Page 2 Title",
        "description": "Description of page 2",
        "markdown": "# Page 2 Content\n\nAnother page...",
        "content": "# Page 2 Content\n\nAnother page...",
        "scrapedAt": "2025-01-15T14:30:02Z",
        "status": "success",
        "wordCount": 320,
        "metadata": {
          "statusCode": 200,
          "contentType": "text/html",
          "loadTime": 987
        }
      },
      {
        "url": "https://example.com/page3",
        "status": "failed",
        "error": "Request timeout after 30000ms",
        "scrapedAt": "2025-01-15T14:30:35Z",
        "metadata": {
          "statusCode": 0,
          "error": "Request timeout after 30000ms"
        }
      }
    ],
    "successful": [
      {
        "url": "https://example.com/page1",
        "title": "Page 1 Title",
        "description": "Description of page 1",
        "markdown": "# Page 1 Content\n\nThis is the content...",
        "content": "# Page 1 Content\n\nThis is the content...",
        "scrapedAt": "2025-01-15T14:30:00Z",
        "status": "success",
        "wordCount": 450,
        "metadata": {
          "statusCode": 200,
          "contentType": "text/html",
          "loadTime": 1234
        }
      },
      {
        "url": "https://example.com/page2",
        "title": "Page 2 Title",
        "description": "Description of page 2",
        "markdown": "# Page 2 Content\n\nAnother page...",
        "content": "# Page 2 Content\n\nAnother page...",
        "scrapedAt": "2025-01-15T14:30:02Z",
        "status": "success",
        "wordCount": 320,
        "metadata": {
          "statusCode": 200,
          "contentType": "text/html",
          "loadTime": 987
        }
      }
    ],
    "failed": [
      {
        "url": "https://example.com/page3",
        "status": "failed",
        "error": "Request timeout after 30000ms",
        "scrapedAt": "2025-01-15T14:30:35Z",
        "metadata": {
          "statusCode": 0,
          "error": "Request timeout after 30000ms"
        }
      }
    ],
    "summary": {
      "total": 3,
      "successful": 2,
      "failed": 1,
      "successRate": 67,
      "processingTime": 35420,
      "averageTimePerUrl": 11807,
      "creditsUsed": 2,
      "creditsPerUrl": 1,
      "concurrency": 3
    }
  },
  "warning": "1 of 3 URLs failed to scrape"
}

Error Responses

error

object

Show error properties

code

string

Error code identifier

message

string

Human-readable error message

details

object

Additional error context

Common Errors

Status Code	Error Code	Description
400	`invalid_urls`	One or more URLs are invalid
400	`too_many_urls`	Exceeded maximum of 100 URLs per batch
400	`invalid_options`	Invalid batch processing options
401	`unauthorized`	Invalid or missing API key
402	`insufficient_credits`	Not enough credits for batch job
429	`rate_limited`	Rate limit exceeded
500	`batch_failed`	Batch job failed to initialize

{
  "success": false,
  "error": {
    "code": "too_many_urls",
    "message": "Maximum 100 URLs allowed per batch request",
    "details": {
      "provided": 150,
      "maximum": 100
    }
  }
}

Credit Costs

Credit costs are calculated per URL in the batch:

Feature	Cost Per URL
Basic scraping	1 credit
Screenshot	+1 credit
PDF generation	+1 credit
AI extraction	+2-3 credits

Examples:

10 URLs, markdown format: 10 × 1 = 10 credits
20 URLs with screenshots: 20 × (1 + 1) = 40 credits
5 URLs with AI extraction: 5 × (1 + 3) = 20 credits

Rate Limits

Rate limits vary by plan:

Free: Max 10 URLs per batch, 2 concurrent batches
Starter: Max 50 URLs per batch, 5 concurrent batches
Pro: Max 100 URLs per batch, 10 concurrent batches
Enterprise: Custom limits

Best Practices

Group similar URLs together in a batch for consistent processing
Use appropriate concurrency - higher isn’t always better
Set reasonable timeouts based on target site performance
Enable retries for unreliable sites (retryCount: 2-3)
Add delays when scraping the same domain to respect rate limits
Monitor job progress using the statusUrl or /v1/jobs/:id endpoint
Handle partial failures - some URLs may succeed while others fail
Use authentication when scraping protected pages
Start small - test with 5-10 URLs before scaling up
Cache results - reuse batch results when possible

Monitoring Batch Jobs

After submitting a batch, use the Jobs API to monitor progress:

# Check job status
curl -X GET "https://api.whizo.ai/v1/jobs/{jobId}" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Get job results
curl -X GET "https://api.whizo.ai/v1/jobs/{jobId}/results" \
  -H "Authorization: Bearer YOUR_API_KEY"

Performance Tips

Optimal Concurrency Settings

Batch Size	Recommended Concurrency
1-10 URLs	3-5
11-25 URLs	5-7
26-50 URLs	7-10
51-100 URLs	10 (max)

Processing Time Estimates

Lightweight scraping: ~1-2 seconds per URL
JavaScript rendering: ~3-5 seconds per URL
With screenshots: ~4-6 seconds per URL
With AI extraction: ~5-10 seconds per URL

Example: Batch of 50 URLs with markdown format and concurrency=5

Sequential time: 50 × 2s = 100 seconds
Parallel time: (50 / 5) × 2s = 20 seconds

Scrape API - Single URL scraping
Crawl API - Website crawling
Jobs API - Job monitoring
Results API - Download results

Core APIs

Job Management

User Management

Advanced Features

Authentication

Request Body

Output Options

Content Processing

Browser Options

Batch Processing Options

Request Customization

AI Extraction

Response

Examples

Basic Batch Scraping

Advanced Batch with Concurrency Control

Batch Scraping with AI Extraction

Direct Processing Response (Synchronous)

Error Responses

Common Errors

Credit Costs

Rate Limits

Best Practices

Monitoring Batch Jobs

Performance Tips

Optimal Concurrency Settings

Processing Time Estimates

Core APIs

Job Management

User Management

Advanced Features

​Authentication

​Request Body

​Output Options

​Content Processing

​Browser Options

​Batch Processing Options

​Request Customization

​AI Extraction

​Response

​Examples

​Basic Batch Scraping

​Advanced Batch with Concurrency Control

​Batch Scraping with AI Extraction

​Direct Processing Response (Synchronous)

​Error Responses

​Common Errors

​Credit Costs

​Rate Limits

​Best Practices

​Monitoring Batch Jobs

​Performance Tips

​Optimal Concurrency Settings

​Processing Time Estimates

​Related Endpoints

Authentication

Request Body

Output Options

Content Processing

Browser Options

Batch Processing Options

Request Customization

AI Extraction

Response

Examples

Basic Batch Scraping

Advanced Batch with Concurrency Control

Batch Scraping with AI Extraction

Direct Processing Response (Synchronous)

Error Responses

Common Errors

Credit Costs

Rate Limits

Best Practices

Monitoring Batch Jobs

Performance Tips

Optimal Concurrency Settings

Processing Time Estimates

Related Endpoints