Skip to main content

Scrape API

The Scrape API allows you to extract clean, structured content from any webpage in multiple formats including markdown, HTML, and JSON. It’s perfect for content extraction, data mining, and web automation tasks.

Base URL

https://api.whizo.ai/v1

Authentication

All requests require authentication using your API key in the Authorization header:
Authorization: Bearer YOUR_API_KEY

Single Page Scraping

POST /v1/scrape

Extract content from a single webpage with customizable options.

Request Body

ParameterTypeRequiredDescription
urlstringYesThe URL to scrape
optionsobjectNoScraping configuration options

Options Object

ParameterTypeDefaultDescription
formatstringmarkdownOutput format: markdown, html, text, json, structured
enginestringlightweightScraping engine: lightweight, playwright, puppeteer
includeScreenshotbooleanfalseCapture a screenshot of the page
includePdfbooleanfalseGenerate a PDF of the page
mobilebooleanfalseUse mobile user agent
waitTimenumber0Time to wait before scraping (0-30 seconds)
javascriptbooleanfalseEnable JavaScript rendering
cookiesobject{}Custom cookies to send with request
headersobject{}Custom headers to send with request
timeoutnumber30Request timeout in seconds (5-120)
useCachebooleanfalseUse cached results if available
cacheTtlnumber300Cache time-to-live in seconds
webhookstring-Webhook URL for completion notification

Response

{
  "success": true,
  "data": {
    "content": "# Page Title\n\nPage content in markdown format...",
    "metadata": {
      "title": "Example Page Title",
      "description": "Page meta description",
      "url": "https://example.com",
      "statusCode": 200,
      "contentType": "text/html",
      "extractedAt": "2025-01-15T10:30:00Z",
      "processingTime": 1250,
      "creditsUsed": 1
    },
    "screenshots": ["https://storage.whizo.ai/screenshots/abc123.png"],
    "pdf": "https://storage.whizo.ai/pdfs/abc123.pdf",
    "files": []
  }
}

Code Examples

const response = await fetch("https://api.whizo.ai/v1/scrape", {
  method: "POST",
  headers: {
    Authorization: "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com",
    options: {
      format: "markdown",
      includeScreenshot: true,
      javascript: true,
      waitTime: 5,
    },
  }),
});

const data = await response.json();
console.log(data.data.content);

Batch Scraping

POST /v1/scrape/batch

Scrape multiple URLs simultaneously for efficient bulk operations.

Request Body

ParameterTypeRequiredDescription
urlsarrayYesArray of URLs to scrape (max 100)
optionsobjectNoGlobal scraping options
webhookstringNoWebhook URL for batch completion

Response

{
  "success": true,
  "data": {
    "jobId": "batch_abc123",
    "status": "processing",
    "totalUrls": 10,
    "estimatedCompletionTime": "2025-01-15T10:35:00Z",
    "creditsEstimate": 10
  }
}

Error Handling

HTTP Status Codes

CodeDescription
200Success
400Bad Request - Invalid parameters
401Unauthorized - Invalid API key
402Payment Required - Insufficient credits
429Too Many Requests - Rate limit exceeded
500Internal Server Error

Error Response Format

{
  "success": false,
  "error": {
    "code": "INVALID_URL",
    "message": "The provided URL is not valid",
    "details": {
      "url": "invalid-url",
      "reason": "malformed_url"
    }
  }
}

Rate Limits

Rate limits vary by plan:
PlanRequests/HourRequests/DayConcurrent Jobs
Free101001
Starter505003
Pro200200010
Enterprise10001000050

Credit Costs

FeatureCredits
Basic scraping1 credit per page
JavaScript rendering+1 credit
Screenshot capture+1 credit
PDF generation+1 credit
AI extraction+2 credits

Use Cases

Content Aggregation

Perfect for news sites, blogs, and content platforms that need to aggregate content from multiple sources.

Market Research

Extract product information, pricing data, and competitor analysis from e-commerce sites.

SEO Analysis

Scrape meta tags, headings, and content structure for SEO optimization and analysis.

Lead Generation

Extract contact information and business data from directories and websites.

Best Practices

  1. Respect robots.txt - Always check and respect website robots.txt files
  2. Use appropriate delays - Set reasonable wait times between requests
  3. Handle errors gracefully - Implement proper error handling and retry logic
  4. Cache when possible - Use caching to reduce API calls and costs
  5. Monitor rate limits - Track your usage to avoid hitting rate limits

Webhooks

Configure webhooks to receive notifications when scraping jobs complete:
{
  "event": "scrape.completed",
  "jobId": "job_abc123",
  "url": "https://example.com",
  "status": "completed",
  "creditsUsed": 3,
  "completedAt": "2025-01-15T10:35:00Z",
  "results": {
    "content": "...",
    "metadata": {...}
  }
}