Skip to main content
POST
https://api.whizo.ai
/
v1
/
scrape
Scrape API
curl --request POST \
  --url https://api.whizo.ai/v1/scrape \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "urls": [
    "<string>"
  ],
  "url": "<string>",
  "formats": [
    {}
  ],
  "format": "<string>",
  "onlyMainContent": true,
  "includeMetadata": true,
  "includeLinks": true,
  "includeImages": true,
  "removeAds": true,
  "removeScripts": true,
  "removeStyles": true,
  "includeTags": [
    {}
  ],
  "excludeTags": [
    {}
  ],
  "includeScreenshot": true,
  "screenshotType": "<string>",
  "includePdf": true,
  "parsePdf": true,
  "engine": "<string>",
  "javascript": true,
  "stealth": true,
  "mobile": true,
  "viewport": {
    "viewport.width": 123,
    "viewport.height": 123
  },
  "waitFor": 123,
  "waitTime": 123,
  "timeout": 123,
  "waitForSelector": "<string>",
  "actions": [
    {
      "actions[].type": "<string>",
      "actions[].selector": "<string>",
      "actions[].text": "<string>",
      "actions[].value": "<string>",
      "actions[].milliseconds": 123,
      "actions[].key": "<string>",
      "actions[].direction": "<string>",
      "actions[].pixels": 123,
      "actions[].description": "<string>"
    }
  ],
  "authentication": {
    "authentication.type": "<string>",
    "authentication.username": "<string>",
    "authentication.password": "<string>",
    "authentication.token": "<string>"
  },
  "extract": {
    "extract.schema": {},
    "extract.systemPrompt": "<string>",
    "extract.prompt": "<string>"
  },
  "agent": {
    "agent.model": "<string>",
    "agent.prompt": "<string>"
  },
  "headers": {},
  "userAgent": "<string>",
  "location": {
    "location.country": "<string>",
    "location.languages": [
      {}
    ]
  },
  "useProxy": true,
  "llmOptimization": true,
  "skipTlsVerification": true,
  "executeJS": "<string>",
  "replaceAllPathsWithAbsolutePaths": true,
  "priority": "<string>",
  "maxAge": 123,
  "htmlFormat": "<string>",
  "stripHtml": true,
  "removeTags": [
    {}
  ],
  "onlyIncludeTags": [
    {}
  ],
  "enableJsonExtraction": true,
  "jsonSchema": "<string>"
}
'
{
  "success": true,
  "data": {
    "extracted": {
      "name": "Example Product",
      "price": 29.99,
      "description": "High-quality product description...",
      "inStock": true,
      "rating": 4.5
    },
    "metadata": {
      "creditsUsed": 3
    }
  }
}
The Scrape API allows you to extract content from any webpage and convert it to your preferred format. It supports multiple scraping engines, JavaScript rendering, browser automation actions, AI extraction, and various output formats.

🧠 Smart Auto-Detection

WhizoAI automatically detects the content type from the URL and uses the appropriate extractor:
  • 🎥 YouTube Videos → Automatically extracts transcripts with timestamps
  • 📄 Google Docs/Sheets/Slides → Automatically parses document content
  • 📊 Excel/CSV Files → Automatically parses spreadsheet data to JSON
  • 📂 PDF/JSON Files → Automatically extracts structured content
  • 🌐 Regular Webpages → Smart browser-based scraping
No manual selection needed - just pass any URL to /v1/scrape and we handle the rest!
// YouTube transcript - automatically detected
await scrape({ url: "https://youtube.com/watch?v=dQw4w9WgXcQ" });

// Google Sheets - automatically parsed
await scrape({ url: "https://docs.google.com/spreadsheets/d/abc123" });

// Excel file - automatically converted to JSON
await scrape({ url: "https://example.com/data.xlsx" });

// Regular webpage - smart browser scraping
await scrape({ url: "https://example.com" });

📦 Multi-URL Scraping

New in v2.2.1: Native multi-URL support directly in the scrape endpoint! Process multiple URLs in a single request with automatic parallelization and progress tracking. Replaces the deprecated /v1/batch endpoint.
Scrape multiple URLs simultaneously by passing an array of URLs. The API handles parallel processing, rate limiting, and progress tracking automatically.

Key Features

  • Parallel Processing: Multiple URLs scraped simultaneously
  • Progress Tracking: Real-time SSE updates via /v1/jobs/:id/stream
  • Automatic Retry: Failed URLs automatically retried
  • Rate Limiting: Respects your plan’s concurrency limits
  • Cost Efficient: Same 1 credit per URL pricing

Multi-URL Request

urls
string[]
Array of URLs to scrape (1-100 URLs per request). When provided, the url field is ignored.
{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "options": {
    "format": "markdown"
  }
}
import { WhizoAI } from 'whizoai';

const client = new WhizoAI({ apiKey: 'YOUR_API_KEY' });

// Multi-URL scraping
const job = await client.scrape({
  urls: [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3'
  ],
  options: {
    format: 'markdown',
    includeScreenshot: true
  }
});

// Monitor progress
await client.jobs.waitFor(job.id, {
  onProgress: (progress) => {
    console.log(`Progress: ${progress.percentage}% - ${progress.pagesCompleted}/${progress.totalPages}`);
  }
});

// Get results
const results = await client.jobs.getResults(job.id);

Multi-URL Response

When using multiple URLs, the API returns a job ID immediately and processes URLs in the background:
{
  "success": true,
  "data": {
    "jobId": "550e8400-e29b-41d4-a716-446655440000",
    "status": "queued",
    "totalUrls": 3,
    "estimatedCredits": 3,
    "streamUrl": "/v1/jobs/550e8400-e29b-41d4-a716-446655440000/stream"
  }
}
Monitor progress via SSE stream or polling the job endpoint. See Job Management for details.

Authentication

Authorization
string
required
Bearer token using your API key: Bearer YOUR_API_KEY

Request Body

url
string
required
The URL of the webpage to scrape. Must be a valid HTTP/HTTPS URL.

Output Formats

formats
array
Array of output formats to generate (supports multiple formats in one request)
  • markdown - Clean markdown format (default)
  • html - Cleaned HTML content
  • rawHtml - Raw HTML without processing
  • text - Plain text only
  • json - Structured JSON format
  • structured - AI-enhanced structured data
  • links - Extract all links
  • screenshot - Capture screenshot
  • screenshot@fullPage - Full page screenshot
  • pdf - Generate PDF
  • extract - AI-powered extraction
format
string
default:"markdown"
Legacy single format parameter (use formats array for multiple outputs)

Content Processing Options

onlyMainContent
boolean
default:"false"
Extract only main content, removing headers, footers, and sidebars
includeMetadata
boolean
default:"true"
Include page metadata (title, description, author, etc.)
Include hyperlinks in the extracted content
includeImages
boolean
default:"true"
Include image URLs in the extracted content
removeAds
boolean
default:"true"
Automatically remove advertisements and tracking scripts
removeScripts
boolean
default:"true"
Remove JavaScript from HTML output
removeStyles
boolean
default:"false"
Remove CSS styling from HTML output
includeTags
array
Array of HTML tags to include (whitelist)
excludeTags
array
Array of HTML tags to exclude (blacklist)

Screenshot & PDF Options

includeScreenshot
boolean
default:"false"
Capture a screenshot of the page (+1 credit)
screenshotType
string
default:"viewport"
Screenshot type:
  • viewport - Visible viewport only
  • fullpage - Entire scrollable page
includePdf
boolean
default:"false"
Generate a PDF of the page (+1 credit)
parsePdf
boolean
default:"false"
Parse PDF content if URL points to a PDF file

Engine & Browser Options

engine
string
Scraping engine to use:
  • lightweight - Fast HTTP-based scraping (default)
  • playwright - Full browser with JavaScript support
  • puppeteer - Alternative browser engine
  • stealth - Anti-detection browser mode (+4 credits)
javascript
boolean
default:"false"
Enable JavaScript rendering for dynamic content
stealth
boolean
default:"false"
Enable stealth mode with automatic escalation on bot detection. Automatically upgrades from basic → advanced → maximum stealth if bot detection is encountered (+5 credits)
mobile
boolean
default:"false"
Use mobile viewport for rendering
viewport
object
Custom viewport dimensions

Timing Options

waitFor
number
default:"0"
Time to wait after page load in milliseconds (0-30000)
waitTime
number
default:"0"
Legacy: Same as waitFor, in seconds (0-30)
timeout
number
default:"30000"
Maximum time to wait for page load in milliseconds (1000-120000)
waitForSelector
string
CSS selector to wait for before considering page loaded

Browser Automation Actions

actions
array
Array of browser automation actions to perform (max 20 actions)

Authentication

authentication
object
Authentication credentials for protected pages

AI Extraction

extract
object
AI-powered data extraction configuration
agent
object
Whizo Agent configuration for advanced AI extraction

Request Customization

headers
object
Custom HTTP headers to send with the request
{
  "User-Agent": "CustomBot/1.0",
  "Accept-Language": "en-US"
}
userAgent
string
Custom user agent string
location
object
Geolocation and language settings

Advanced Options

useProxy
boolean
default:"false"
Enable premium proxy rotation from a pool of 10+ rotating proxies for better reliability and IP diversity (+2 credits)
llmOptimization
boolean
default:"false"
Optimize content format for LLM consumption
skipTlsVerification
boolean
default:"false"
Skip TLS/SSL certificate verification (use with caution)
executeJS
string
Custom JavaScript code to execute on the page
replaceAllPathsWithAbsolutePaths
boolean
default:"false"
Convert all relative URLs to absolute URLs
priority
string
default:"medium"
Job priority: low, medium, or high
maxAge
number
default:"30"
Cache max age in days (0-365). Set to 0 to bypass cache.

HTML Processing

htmlFormat
string
default:"cleaned"
HTML output format:
  • cleaned - Cleaned and formatted HTML
  • raw - Raw HTML without processing
stripHtml
boolean
default:"false"
Remove all HTML tags, leaving only text
removeTags
array
HTML tags to remove from output
onlyIncludeTags
array
Only include these HTML tags (whitelist)

JSON Extraction

enableJsonExtraction
boolean
default:"false"
Enable automatic JSON extraction from page
jsonSchema
string
JSON schema for structured extraction

Response

success
boolean
Indicates if the request was successful
data
object

Examples

Basic Scraping

curl -X POST https://api.whizo.ai/v1/scrape \
  -H "Authorization: Bearer whizo_YOUR-API-KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "options": {
      "format": "markdown",
      "includeScreenshot": false
    }
  }'
Response
{
  "success": true,
  "data": {
    "content": "# Example Page\n\nThis is the scraped content in markdown format...",
    "metadata": {
      "title": "Example Domain",
      "description": "Example page description",
      "url": "https://example.com",
      "statusCode": 200,
      "contentType": "text/html",
      "extractedAt": "2025-01-15T10:30:00Z",
      "processingTime": 1234,
      "creditsUsed": 1
    }
  }
}

Advanced: JavaScript Rendering & Screenshots

curl -X POST https://api.whizo.ai/v1/scrape \
  -H "Authorization: Bearer whizo_YOUR-API-KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "options": {
      "format": "markdown",
      "includeScreenshot": true,
      "includePdf": true,
      "javascript": true,
      "waitTime": 5,
      "mobile": false,
      "headers": {
        "User-Agent": "CustomBot/1.0"
      }
    }
  }'

Browser Automation with Actions

curl -X POST "https://api.whizo.ai/v1/scrape" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type": application/json" \
  -d '{
    "url": "https://example.com/search",
    "actions": [
      {
        "type": "input",
        "selector": "#search-input",
        "text": "web scraping"
      },
      {
        "type": "click",
        "selector": "#search-button"
      },
      {
        "type": "wait",
        "milliseconds": 2000
      },
      {
        "type": "screenshot",
        "description": "Search results page"
      }
    ]
  }'

AI-Powered Extraction

curl -X POST "https://api.whizo.ai/v1/scrape" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "formats": ["extract"],
    "extract": {
      "schema": {
        "name": "string",
        "price": "number",
        "description": "string",
        "inStock": "boolean",
        "rating": "number"
      },
      "prompt": "Extract product information from this page"
    }
  }'
{
  "success": true,
  "data": {
    "extracted": {
      "name": "Example Product",
      "price": 29.99,
      "description": "High-quality product description...",
      "inStock": true,
      "rating": 4.5
    },
    "metadata": {
      "creditsUsed": 3
    }
  }
}

Comprehensive Metadata Extraction

WhizoAI extracts 61 comprehensive metadata fields from every page, providing industry-leading metadata coverage that exceeds competitors like Firecrawl.
curl -X POST "https://api.whizo.ai/v1/scrape" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "format": "markdown",
    "engine": "stealth"
  }'
{
  "success": true,
  "data": {
    "content": "# Article Title\n\nArticle content in markdown...",
    "metadata": {
      // Engine Information (4 fields)
      "engine": "stealth",
      "parserName": "stealth",
      "contentSource": "browser",
      "processingTime": 8276,

      // Basic Metadata (10 fields)
      "title": "Example Article - Latest News",
      "description": "A comprehensive article about web scraping and metadata extraction",
      "language": "en",
      "keywords": "web scraping, metadata, SEO, content extraction",
      "robots": "index, follow",
      "canonical": "https://example.com/article",
      "author": "John Doe",
      "publisher": "Example Publishing",
      "favicon": "https://example.com/favicon.ico",

      // Open Graph (9 fields)
      "og": {
        "title": "Example Article - Latest News",
        "description": "A comprehensive article about web scraping",
        "url": "https://example.com/article",
        "image": "https://example.com/og-image.jpg",
        "type": "article",
        "siteName": "Example News",
        "locale": "en_US",
        "audio": null,
        "video": null
      },

      // Twitter Cards (7 fields)
      "twitter": {
        "card": "summary_large_image",
        "site": "@examplenews",
        "creator": "@johndoe",
        "title": "Example Article - Latest News",
        "description": "A comprehensive article about web scraping",
        "image": "https://example.com/twitter-image.jpg",
        "imageAlt": "Article cover image"
      },

      // Article Metadata (7 fields)
      "article": {
        "publishedTime": "2025-01-15T10:30:00Z",
        "modifiedTime": "2025-01-16T14:20:00Z",
        "expirationTime": null,
        "author": "John Doe",
        "section": "Technology",
        "tags": ["web scraping", "metadata", "SEO"]
      },

      // Structured Data (14 types)
      "structuredData": {
        "jsonLd": [
          {
            "@context": "https://schema.org",
            "@type": "Article",
            "headline": "Example Article",
            "author": {
              "@type": "Person",
              "name": "John Doe"
            }
          }
        ],
        "microdata": [],
        "rdfa": [],
        "openGraph": {},
        "schemaTypes": ["Article", "Person", "Organization"]
      },

      // Content Quality (29 metrics)
      "contentQuality": {
        "wordCount": 1250,
        "sentenceCount": 45,
        "characterCount": 6842,
        "readabilityScore": 65.5,
        "fleschKincaidGrade": 8.2,
        "averageSentenceLength": 152.04,
        "averageWordsPerSentence": 27.78,
        "contentDensity": 0.87,
        "estimatedReadingTimeMinutes": 6,
        "isContentRich": true,
        "headingCount": 8,
        "paragraphCount": 22,
        "linkCount": 15,
        "imageCount": 4,
        "listCount": 3,
        "tableCount": 1,
        "quoteCount": 2,
        "codeBlockCount": 0,
        "contentType": "article",
        "readingLevel": "medium",
        "lexicalDiversity": 0.68,
        "headingDistribution": {
          "h1": 1,
          "h2": 4,
          "h3": 3,
          "h4": 0,
          "h5": 0,
          "h6": 0
        }
      },

      // Counts (4 fields)
      "wordCount": 1250,
      "characterCount": 6842,
      "linksCount": 15,
      "imagesCount": 4,

      // Request Information
      "url": "https://example.com/article",
      "statusCode": 200,
      "creditsUsed": 3
    }
  }
}
Industry-Leading Metadata Coverage: WhizoAI extracts 61 comprehensive fields including:
  • Engine information (4 fields)
  • Basic metadata (10 fields)
  • Open Graph protocol (9 fields)
  • Twitter Cards (7 fields)
  • Article metadata (7 fields)
  • Structured data (14 types)
  • Content quality metrics (29 metrics)
  • Counts and statistics (4 fields)
This comprehensive extraction surpasses competitors like Firecrawl, providing superior SEO analysis, content quality assessment, and social media optimization data.

Error Responses

error
object

Common Errors

Status CodeError CodeDescription
400invalid_urlThe provided URL is invalid or malformed
400invalid_optionsOne or more options are invalid
400invalid_actionsBrowser automation actions are invalid
401unauthorizedInvalid or missing API key
402insufficient_creditsNot enough credits to complete the request
429rate_limitedRate limit exceeded
500scraping_failedFailed to scrape the webpage
500timeoutRequest timed out
500extraction_failedAI extraction failed
{
  "success": false,
  "error": {
    "code": "invalid_url",
    "message": "The provided URL is not valid",
    "details": {
      "url": "invalid-url"
    }
  }
}

Credit Costs

FeatureCost
Basic scraping1 credit
JavaScript rendering+1 credit
Screenshot (viewport)+1 credit
Screenshot (fullpage)+1 credit
PDF generation+1 credit
Structured format+1 credit
AI extraction+2-3 credits
Stealth mode with auto-escalation+5 credits
Premium proxy rotation+2 credits
Note: Costs are cumulative. For example, scraping with JavaScript + screenshot + PDF = 1 + 1 + 1 + 1 = 4 credits total.

Rate Limits

Rate limits vary by plan:
  • Free: 10 requests per hour, 100 per day
  • Starter: 50 requests per hour, 500 per day
  • Pro: 200 requests per hour, 2000 per day
  • Enterprise: Custom limits

Best Practices

  1. Use lightweight engine for static pages to save credits
  2. Enable javascript only when page uses dynamic content
  3. Set appropriate waitFor times for JavaScript-heavy pages
  4. Use actions for interactive pages requiring user input
  5. Cache results using maxAge for frequently accessed pages
  6. Use batch endpoint for scraping multiple URLs efficiently
  7. Implement retry logic for failed requests
  8. Monitor credit usage to avoid running out during critical operations