Scrape API

The Scrape API allows you to extract content from any webpage and convert it to your preferred format. It supports multiple scraping engines, JavaScript rendering, browser automation actions, AI extraction, and various output formats.

🧠 Smart Auto-Detection

WhizoAI automatically detects the content type from the URL and uses the appropriate extractor:

🎥 YouTube Videos → Automatically extracts transcripts with timestamps
📄 Google Docs/Sheets/Slides → Automatically parses document content
📊 Excel/CSV Files → Automatically parses spreadsheet data to JSON
📂 PDF/JSON Files → Automatically extracts structured content
🌐 Regular Webpages → Smart browser-based scraping

No manual selection needed - just pass any URL to /v1/scrape and we handle the rest!

// YouTube transcript - automatically detected
await scrape({ url: "https://youtube.com/watch?v=dQw4w9WgXcQ" });

// Google Sheets - automatically parsed
await scrape({ url: "https://docs.google.com/spreadsheets/d/abc123" });

// Excel file - automatically converted to JSON
await scrape({ url: "https://example.com/data.xlsx" });

// Regular webpage - smart browser scraping
await scrape({ url: "https://example.com" });

📦 Multi-URL Scraping

New in v2.2.1: Native multi-URL support directly in the scrape endpoint! Process multiple URLs in a single request with automatic parallelization and progress tracking. Replaces the deprecated /v1/batch endpoint.

Scrape multiple URLs simultaneously by passing an array of URLs. The API handles parallel processing, rate limiting, and progress tracking automatically.

Key Features

Parallel Processing: Multiple URLs scraped simultaneously
Progress Tracking: Real-time SSE updates via /v1/jobs/:id/stream
Automatic Retry: Failed URLs automatically retried
Rate Limiting: Respects your plan’s concurrency limits
Cost Efficient: Same 1 credit per URL pricing

Multi-URL Request

urls

string[]

Array of URLs to scrape (1-100 URLs per request). When provided, the url field is ignored.

{
  "urls": [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3"
  ],
  "options": {
    "format": "markdown"
  }
}

import { WhizoAI } from 'whizoai';

const client = new WhizoAI({ apiKey: 'YOUR_API_KEY' });

// Multi-URL scraping
const job = await client.scrape({
  urls: [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3'
  ],
  options: {
    format: 'markdown',
    includeScreenshot: true
  }
});

// Monitor progress
await client.jobs.waitFor(job.id, {
  onProgress: (progress) => {
    console.log(`Progress: ${progress.percentage}% - ${progress.pagesCompleted}/${progress.totalPages}`);
  }
});

// Get results
const results = await client.jobs.getResults(job.id);

Multi-URL Response

When using multiple URLs, the API returns a job ID immediately and processes URLs in the background:

{
  "success": true,
  "data": {
    "jobId": "550e8400-e29b-41d4-a716-446655440000",
    "status": "queued",
    "totalUrls": 3,
    "estimatedCredits": 3,
    "streamUrl": "/v1/jobs/550e8400-e29b-41d4-a716-446655440000/stream"
  }
}

Monitor progress via SSE stream or polling the job endpoint. See Job Management for details.

Authentication

Authorization

string

required

Bearer token using your API key: Bearer YOUR_API_KEY

Request Body

url

string

required

The URL of the webpage to scrape. Must be a valid HTTP/HTTPS URL.

Output Formats

formats

array

Array of output formats to generate (supports multiple formats in one request)

markdown - Clean markdown format (default)
html - Cleaned HTML content
rawHtml - Raw HTML without processing
text - Plain text only
json - Structured JSON format
structured - AI-enhanced structured data
links - Extract all links
screenshot - Capture screenshot
screenshot@fullPage - Full page screenshot
pdf - Generate PDF
extract - AI-powered extraction

format

string

default:"markdown"

Legacy single format parameter (use formats array for multiple outputs)

Content Processing Options

onlyMainContent

boolean

default:"false"

Extract only main content, removing headers, footers, and sidebars

includeMetadata

boolean

default:"true"

Include page metadata (title, description, author, etc.)

includeLinks

boolean

default:"true"

Include hyperlinks in the extracted content

includeImages

boolean

default:"true"

Include image URLs in the extracted content

removeAds

boolean

default:"true"

Automatically remove advertisements and tracking scripts

removeScripts

boolean

default:"true"

Remove JavaScript from HTML output

removeStyles

boolean

default:"false"

Remove CSS styling from HTML output

includeTags

array

Array of HTML tags to include (whitelist)

excludeTags

array

Array of HTML tags to exclude (blacklist)

Screenshot & PDF Options

includeScreenshot

boolean

default:"false"

Capture a screenshot of the page (+1 credit)

screenshotType

string

default:"viewport"

Screenshot type:

viewport - Visible viewport only
fullpage - Entire scrollable page

includePdf

boolean

default:"false"

Generate a PDF of the page (+1 credit)

parsePdf

boolean

default:"false"

Parse PDF content if URL points to a PDF file

Engine & Browser Options

engine

string

Scraping engine to use:

lightweight - Fast HTTP-based scraping (default)
playwright - Full browser with JavaScript support
puppeteer - Alternative browser engine
stealth - Anti-detection browser mode (+4 credits)

javascript

boolean

default:"false"

Enable JavaScript rendering for dynamic content

stealth

boolean

default:"false"

Enable stealth mode with automatic escalation on bot detection. Automatically upgrades from basic → advanced → maximum stealth if bot detection is encountered (+5 credits)

mobile

boolean

default:"false"

Use mobile viewport for rendering

viewport

object

Custom viewport dimensions

Show viewport properties

viewport.width

number

Viewport width in pixels (100-1920)

viewport.height

number

Viewport height in pixels (100-1080)

Timing Options

waitFor

number

default:"0"

Time to wait after page load in milliseconds (0-30000)

waitTime

number

default:"0"

Legacy: Same as waitFor, in seconds (0-30)

timeout

number

default:"30000"

Maximum time to wait for page load in milliseconds (1000-120000)

waitForSelector

string

CSS selector to wait for before considering page loaded

Browser Automation Actions

actions

array

Array of browser automation actions to perform (max 20 actions)

Show action types

actions[].type

string

required

Action type:

click - Click an element
scroll - Scroll the page
input - Type text into an input field
wait - Wait for specified duration
screenshot - Take a screenshot at this point
select - Select dropdown option
hover - Hover over an element
key - Press a keyboard key
evaluate - Execute JavaScript code

actions[].selector

string

CSS selector for the target element

actions[].text

string

Text to input or option to select

actions[].value

string

Value for the action

actions[].milliseconds

number

Wait duration in milliseconds (100-30000)

actions[].key

string

Keyboard key to press (e.g., “Enter”, “Tab”)

actions[].direction

string

Scroll direction: up, down, left, right

actions[].pixels

number

Number of pixels to scroll (1-5000)

actions[].description

string

Optional description of what this action does

Authentication

authentication

object

Authentication credentials for protected pages

Show authentication properties

authentication.type

string

required

Authentication type: basic or bearer

authentication.username

string

Username for basic auth

authentication.password

string

Password for basic auth

authentication.token

string

Token for bearer auth

AI Extraction

extract

object

AI-powered data extraction configuration

Show extract properties

extract.schema

object

JSON schema defining the structure of data to extract

extract.systemPrompt

string

System prompt for the AI model (max 3000 chars)

extract.prompt

string

User prompt describing what to extract (max 3000 chars)

agent

object

Whizo Agent configuration for advanced AI extraction

Show agent properties

agent.model

string

default:"Whizo-Agent"

AI model to use for extraction

agent.prompt

string

Prompt for the Whizo Agent (max 3000 chars)

Request Customization

headers

object

Custom HTTP headers to send with the request

{
  "User-Agent": "CustomBot/1.0",
  "Accept-Language": "en-US"
}

userAgent

string

Custom user agent string

location

object

Geolocation and language settings

Show location properties

location.country

string

ISO country code (e.g., “US”, “GB”, “JP”)

location.languages

array

Array of language codes (e.g., [“en-US”, “en”])

Advanced Options

useProxy

boolean

default:"false"

Enable premium proxy rotation from a pool of 10+ rotating proxies for better reliability and IP diversity (+2 credits)

llmOptimization

boolean

default:"false"

Optimize content format for LLM consumption

skipTlsVerification

boolean

default:"false"

Skip TLS/SSL certificate verification (use with caution)

executeJS

string

Custom JavaScript code to execute on the page

replaceAllPathsWithAbsolutePaths

boolean

default:"false"

Convert all relative URLs to absolute URLs

priority

string

default:"medium"

Job priority: low, medium, or high

maxAge

number

default:"30"

Cache max age in days (0-365). Set to 0 to bypass cache.

HTML Processing

htmlFormat

string

default:"cleaned"

HTML output format:

cleaned - Cleaned and formatted HTML
raw - Raw HTML without processing

stripHtml

boolean

default:"false"

Remove all HTML tags, leaving only text

removeTags

array

HTML tags to remove from output

onlyIncludeTags

array

Only include these HTML tags (whitelist)

JSON Extraction

enableJsonExtraction

boolean

default:"false"

Enable automatic JSON extraction from page

jsonSchema

string

JSON schema for structured extraction

Response

success

boolean

Indicates if the request was successful

data

object

Show data properties

content

string

The extracted content in the requested format

markdown

string

Markdown formatted content (if markdown format requested)

html

string

HTML content (if html format requested)

rawHtml

string

Raw HTML (if rawHtml format requested)

links

array

Extracted links (if links format requested)

metadata

object

Show metadata properties

engine

string

Scraping engine used (lightweight, playwright, stealth)

parserName

string

Parser that extracted the content (matches engine name)

contentSource

string

Source of content extraction (‘browser’ for Playwright/Stealth, ‘http’ for lightweight)

processingTime

number

Processing time in milliseconds

title

string

Page title extracted from <title> tag

description

string

Meta description from <meta name="description">

language

string

Page language code (e.g., “en”, “es”, “fr”)

keywords

string

Meta keywords from <meta name="keywords">

robots

string

Robot directives from <meta name="robots">

canonical

string

Canonical URL from <link rel="canonical">

author

string

Page author from <meta name="author">

publisher

string

Publisher name from meta tags

favicon

string

Favicon URL from <link rel="icon">

object

Open Graph protocol metadata for social sharing

Show Open Graph properties

title

string

OG title (og:title)

description

string

OG description (og:description)

url

string

Canonical OG URL (og:url)

image

string

Primary OG image URL (og:image)

type

string

Content type (og:type): website, article, video, etc.

siteName

string

Site name (og:site_name)

locale

string

Content locale (og:locale)

audio

string

Audio URL (og:audio)

video

string

Video URL (og:video)

twitter

object

Twitter Card metadata for Twitter sharing

Show Twitter Card properties

card

string

Card type: summary, summary_large_image, app, player

site

string

Twitter @username of website (twitter:site)

creator

string

Twitter @username of content creator (twitter:creator)

title

string

Twitter card title (twitter:title)

description

string

Twitter card description (twitter:description)

image

string

Twitter card image URL (twitter:image)

imageAlt

string

Image alt text (twitter:image:alt)

article

object

Article-specific metadata (for news/blog content)

Show Article metadata properties

publishedTime

string

Publication timestamp (article:published_time)

modifiedTime

string

Last modification timestamp (article:modified_time)

expirationTime

string

Expiration timestamp (article:expiration_time)

author

string

Article author (article:author)

section

string

Article section/category (article:section)

Examples

Basic Scraping

curl -X POST https://api.whizo.ai/v1/scrape \
  -H "Authorization: Bearer whizo_YOUR-API-KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "options": {
      "format": "markdown",
      "includeScreenshot": false
    }
  }'

Response

{
  "success": true,
  "data": {
    "content": "# Example Page\n\nThis is the scraped content in markdown format...",
    "metadata": {
      "title": "Example Domain",
      "description": "Example page description",
      "url": "https://example.com",
      "statusCode": 200,
      "contentType": "text/html",
      "extractedAt": "2025-01-15T10:30:00Z",
      "processingTime": 1234,
      "creditsUsed": 1
    }
  }
}

Advanced: JavaScript Rendering & Screenshots

curl -X POST https://api.whizo.ai/v1/scrape \
  -H "Authorization: Bearer whizo_YOUR-API-KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "options": {
      "format": "markdown",
      "includeScreenshot": true,
      "includePdf": true,
      "javascript": true,
      "waitTime": 5,
      "mobile": false,
      "headers": {
        "User-Agent": "CustomBot/1.0"
      }
    }
  }'

Browser Automation with Actions

curl -X POST "https://api.whizo.ai/v1/scrape" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type": application/json" \
  -d '{
    "url": "https://example.com/search",
    "actions": [
      {
        "type": "input",
        "selector": "#search-input",
        "text": "web scraping"
      },
      {
        "type": "click",
        "selector": "#search-button"
      },
      {
        "type": "wait",
        "milliseconds": 2000
      },
      {
        "type": "screenshot",
        "description": "Search results page"
      }
    ]
  }'

AI-Powered Extraction

curl -X POST "https://api.whizo.ai/v1/scrape" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product",
    "formats": ["extract"],
    "extract": {
      "schema": {
        "name": "string",
        "price": "number",
        "description": "string",
        "inStock": "boolean",
        "rating": "number"
      },
      "prompt": "Extract product information from this page"
    }
  }'

{
  "success": true,
  "data": {
    "extracted": {
      "name": "Example Product",
      "price": 29.99,
      "description": "High-quality product description...",
      "inStock": true,
      "rating": 4.5
    },
    "metadata": {
      "creditsUsed": 3
    }
  }
}

Comprehensive Metadata Extraction

WhizoAI extracts 61 comprehensive metadata fields from every page, providing industry-leading metadata coverage that exceeds competitors like Firecrawl.

curl -X POST "https://api.whizo.ai/v1/scrape" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "format": "markdown",
    "engine": "stealth"
  }'

{
  "success": true,
  "data": {
    "content": "# Article Title\n\nArticle content in markdown...",
    "metadata": {
      // Engine Information (4 fields)
      "engine": "stealth",
      "parserName": "stealth",
      "contentSource": "browser",
      "processingTime": 8276,

      // Basic Metadata (10 fields)
      "title": "Example Article - Latest News",
      "description": "A comprehensive article about web scraping and metadata extraction",
      "language": "en",
      "keywords": "web scraping, metadata, SEO, content extraction",
      "robots": "index, follow",
      "canonical": "https://example.com/article",
      "author": "John Doe",
      "publisher": "Example Publishing",
      "favicon": "https://example.com/favicon.ico",

      // Open Graph (9 fields)
      "og": {
        "title": "Example Article - Latest News",
        "description": "A comprehensive article about web scraping",
        "url": "https://example.com/article",
        "image": "https://example.com/og-image.jpg",
        "type": "article",
        "siteName": "Example News",
        "locale": "en_US",
        "audio": null,
        "video": null
      },

      // Twitter Cards (7 fields)
      "twitter": {
        "card": "summary_large_image",
        "site": "@examplenews",
        "creator": "@johndoe",
        "title": "Example Article - Latest News",
        "description": "A comprehensive article about web scraping",
        "image": "https://example.com/twitter-image.jpg",
        "imageAlt": "Article cover image"
      },

      // Article Metadata (7 fields)
      "article": {
        "publishedTime": "2025-01-15T10:30:00Z",
        "modifiedTime": "2025-01-16T14:20:00Z",
        "expirationTime": null,
        "author": "John Doe",
        "section": "Technology",
        "tags": ["web scraping", "metadata", "SEO"]
      },

      // Structured Data (14 types)
      "structuredData": {
        "jsonLd": [
          {
            "@context": "https://schema.org",
            "@type": "Article",
            "headline": "Example Article",
            "author": {
              "@type": "Person",
              "name": "John Doe"
            }
          }
        ],
        "microdata": [],
        "rdfa": [],
        "openGraph": {},
        "schemaTypes": ["Article", "Person", "Organization"]
      },

      // Content Quality (29 metrics)
      "contentQuality": {
        "wordCount": 1250,
        "sentenceCount": 45,
        "characterCount": 6842,
        "readabilityScore": 65.5,
        "fleschKincaidGrade": 8.2,
        "averageSentenceLength": 152.04,
        "averageWordsPerSentence": 27.78,
        "contentDensity": 0.87,
        "estimatedReadingTimeMinutes": 6,
        "isContentRich": true,
        "headingCount": 8,
        "paragraphCount": 22,
        "linkCount": 15,
        "imageCount": 4,
        "listCount": 3,
        "tableCount": 1,
        "quoteCount": 2,
        "codeBlockCount": 0,
        "contentType": "article",
        "readingLevel": "medium",
        "lexicalDiversity": 0.68,
        "headingDistribution": {
          "h1": 1,
          "h2": 4,
          "h3": 3,
          "h4": 0,
          "h5": 0,
          "h6": 0
        }
      },

      // Counts (4 fields)
      "wordCount": 1250,
      "characterCount": 6842,
      "linksCount": 15,
      "imagesCount": 4,

      // Request Information
      "url": "https://example.com/article",
      "statusCode": 200,
      "creditsUsed": 3
    }
  }
}

Industry-Leading Metadata Coverage: WhizoAI extracts 61 comprehensive fields including:

Engine information (4 fields)
Basic metadata (10 fields)
Open Graph protocol (9 fields)
Twitter Cards (7 fields)
Article metadata (7 fields)
Structured data (14 types)
Content quality metrics (29 metrics)
Counts and statistics (4 fields)

This comprehensive extraction surpasses competitors like Firecrawl, providing superior SEO analysis, content quality assessment, and social media optimization data.

Error Responses

error

object

Show error properties

code

string

Error code identifier

message

string

Human-readable error message

details

object

Additional error context

Common Errors

Status Code	Error Code	Description
400	`invalid_url`	The provided URL is invalid or malformed
400	`invalid_options`	One or more options are invalid
400	`invalid_actions`	Browser automation actions are invalid
401	`unauthorized`	Invalid or missing API key
402	`insufficient_credits`	Not enough credits to complete the request
429	`rate_limited`	Rate limit exceeded
500	`scraping_failed`	Failed to scrape the webpage
500	`timeout`	Request timed out
500	`extraction_failed`	AI extraction failed

{
  "success": false,
  "error": {
    "code": "invalid_url",
    "message": "The provided URL is not valid",
    "details": {
      "url": "invalid-url"
    }
  }
}

Credit Costs

Feature	Cost
Basic scraping	1 credit
JavaScript rendering	+1 credit
Screenshot (viewport)	+1 credit
Screenshot (fullpage)	+1 credit
PDF generation	+1 credit
Structured format	+1 credit
AI extraction	+2-3 credits
Stealth mode with auto-escalation	+5 credits
Premium proxy rotation	+2 credits

Note: Costs are cumulative. For example, scraping with JavaScript + screenshot + PDF = 1 + 1 + 1 + 1 = 4 credits total.

Rate Limits

Rate limits vary by plan:

Free: 10 requests per hour, 100 per day
Starter: 50 requests per hour, 500 per day
Pro: 200 requests per hour, 2000 per day
Enterprise: Custom limits

Best Practices

Use lightweight engine for static pages to save credits
Enable javascript only when page uses dynamic content
Set appropriate waitFor times for JavaScript-heavy pages
Use actions for interactive pages requiring user input
Cache results using maxAge for frequently accessed pages
Use batch endpoint for scraping multiple URLs efficiently
Implement retry logic for failed requests
Monitor credit usage to avoid running out during critical operations

Batch Scraping - Scrape multiple URLs
Crawl API - Crawl entire websites
Extract API - AI-powered data extraction
Map API - Discover website URLs

Core APIs

Job Management

User Management

Advanced Features

🧠 Smart Auto-Detection

📦 Multi-URL Scraping

Key Features

Multi-URL Request

Multi-URL Response

Authentication

Request Body

Output Formats

Content Processing Options

Screenshot & PDF Options

Engine & Browser Options

Timing Options

Browser Automation Actions

Authentication

AI Extraction

Request Customization

Advanced Options

HTML Processing

JSON Extraction

Response

Examples

Basic Scraping

Advanced: JavaScript Rendering & Screenshots

Browser Automation with Actions

AI-Powered Extraction

Comprehensive Metadata Extraction

Error Responses

Common Errors

Credit Costs

Rate Limits

Best Practices

Core APIs

Job Management

User Management

Advanced Features

​🧠 Smart Auto-Detection

​📦 Multi-URL Scraping

​Key Features

​Multi-URL Request

​Multi-URL Response

​Authentication

​Request Body

​Output Formats

​Content Processing Options

​Screenshot & PDF Options

​Engine & Browser Options

​Timing Options

​Browser Automation Actions

​Authentication

​AI Extraction

​Request Customization

​Advanced Options

​HTML Processing

​JSON Extraction

​Response

​Examples

​Basic Scraping

​Advanced: JavaScript Rendering & Screenshots

​Browser Automation with Actions

​AI-Powered Extraction

​Comprehensive Metadata Extraction

​Error Responses

​Common Errors

​Credit Costs

​Rate Limits

​Best Practices

​Related Endpoints

🧠 Smart Auto-Detection

📦 Multi-URL Scraping

Key Features

Multi-URL Request

Multi-URL Response

Authentication

Request Body

Output Formats

Content Processing Options

Screenshot & PDF Options

Engine & Browser Options

Timing Options

Browser Automation Actions

Authentication

AI Extraction

Request Customization

Advanced Options

HTML Processing

JSON Extraction

Response

Examples

Basic Scraping

Advanced: JavaScript Rendering & Screenshots

Browser Automation with Actions

AI-Powered Extraction

Comprehensive Metadata Extraction

Error Responses

Common Errors

Credit Costs

Rate Limits

Best Practices

Related Endpoints