Extract API

The Extract API uses advanced AI models to extract structured data from web pages based on your custom schema definitions. Perfect for automated data collection, lead generation, and content analysis.

Authentication

Authorization

string

required

Bearer token using your API key: Bearer YOUR_API_KEY

Request Body

url

string

required

The URL to extract data from. Must be a valid HTTP/HTTPS URL.

schema

object

required

JSON schema defining the structure of data to extract

Show schema properties

schema.type

string

required

Must be “object” for structured extraction

schema.properties

object

required

Definition of fields to extract with their data types

schema.required

array

Array of required field names

schema.description

string

Description of what the schema represents

agent

object

Whizo Agent configuration for AI-powered extraction

Show agent properties

agent.model

string

default:"Whizo-Agent"

AI agent model to use. Currently supported: Whizo-AgentWhizo Agent is WhizoAI’s proprietary AI agent that enhances scraping capabilities by controlling browser actions and navigating complex website structures for superior data extraction.

prompt

string

Custom prompt to guide the AI extraction (max 3000 chars)Provide specific instructions for what data to extract and how to interpret the content

options

object

Configuration options for extraction behavior

Show options properties

options.maxPages

number

default:"50"

Maximum number of pages to process (1-1000)

options.includePatterns

array

URL patterns to include (supports wildcards)

options.excludePatterns

array

URL patterns to exclude (supports wildcards)

options.temperature

number

default:"0.1"

AI creativity level (0.0-1.0). Lower = more consistent resultsLower values (0.0-0.3) produce more consistent, factual extractions

options.timeout

number

default:"30000"

Maximum processing time per page in milliseconds (5000-120000)

options.maxAge

number

default:"30"

Cache max age in days (1-365). Set to 0 to bypass cache.

options.processingOptions

object

Content processing configuration

Show processingOptions properties

options.processingOptions.removeNoise

boolean

default:"true"

Remove ads, navigation, and other non-content elements

options.processingOptions.optimizeForLLM

boolean

default:"true"

Optimize content format for AI processing

options.processingOptions.preserveStructure

boolean

default:"true"

Maintain original content structure and formatting

Response

success

boolean

Indicates if extraction was successful

jobId

string

Unique job identifier for tracking

data

object

Complete extraction results

Show data properties

success

boolean

Whether extraction completed successfully

extractedData

array

Array of extracted data objects matching your schema

sourceUrls

array

URLs that were successfully processed

totalPages

number

Number of pages processed

creditsUsed

number

Total credits consumed for extraction

processingTime

number

Total processing time in milliseconds

model

string

AI model used for extraction

errors

array

Any errors encountered during processing

Examples

Basic Product Data Extraction

curl -X POST "https://api.whizo.ai/v1/extract" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://store.example.com/product-1",
    "schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "price": {"type": "string"},
        "description": {"type": "string"},
        "inStock": {"type": "boolean"},
        "category": {"type": "string"},
        "rating": {"type": "number"}
      },
      "required": ["name", "price"],
      "description": "E-commerce product information"
    },
    "agent": {
      "model": "Whizo-Agent"
    },
    "prompt": "Extract complete product details including pricing and availability"
  }'

{
  "success": true,
  "jobId": "extract_550e8400-e29b-41d4-a716-446655440000",
  "data": {
    "success": true,
    "extractedData": [
      {
        "name": "Wireless Bluetooth Headphones",
        "price": "$89.99",
        "description": "Premium noise-canceling wireless headphones with 30-hour battery life",
        "inStock": true,
        "category": "Electronics",
        "rating": 4.5
      }
    ],
    "sourceUrls": ["https://store.example.com/product-1"],
    "totalPages": 1,
    "creditsUsed": 3,
    "processingTime": 2800,
    "tokenUsage": {
      "base": 300,
      "output": 2500,
      "thinking": 0,
      "total": 2800,
      "formula": "2,800 tokens (300 base + 2,500 output + 0 thinking)"
    },
    "metadata": {
      "billing": {
        "method": "token-based",
        "formula": "300 + (output_length / 0.5) + thinking",
        "outputSizeBytes": 1250,
        "creditsPerToken": 1
      }
    }
  }
}

Lead Generation from Business Directory

curl -X POST "https://api.whizo.ai/v1/extract" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://directory.example.com/business-1",
    "schema": {
      "type": "object",
      "properties": {
        "companyName": {"type": "string"},
        "contactEmail": {"type": "string"},
        "phoneNumber": {"type": "string"},
        "address": {"type": "string"},
        "website": {"type": "string"},
        "industry": {"type": "string"},
        "employees": {"type": "number"},
        "description": {"type": "string"}
      },
      "required": ["companyName"],
      "description": "Business contact information for lead generation"
    },
    "agent": {
      "model": "Whizo-Agent"
    },
    "prompt": "Focus on extracting accurate contact information including email addresses and phone numbers. If multiple contacts are available, prioritize the main business contact.",
    "options": {
      "temperature": 0.0
    }
  }'

{
  "success": true,
  "jobId": "extract_750e8400-e29b-41d4-a716-446655440001",
  "data": {
    "success": true,
    "extractedData": [
      {
        "companyName": "Tech Solutions Inc.",
        "contactEmail": "[email protected]",
        "phoneNumber": "+1-555-0123",
        "address": "123 Innovation Drive, San Francisco, CA 94105",
        "website": "https://techsolutions.com",
        "industry": "Software Development",
        "employees": 150,
        "description": "Leading provider of cloud-based business solutions"
      }
    ],
    "sourceUrls": ["https://directory.example.com/business-1"],
    "totalPages": 1,
    "creditsUsed": 3,
    "processingTime": 3200
  }
}

News Article Analysis

curl -X POST "https://api.whizo.ai/v1/extract" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.example.com/article-1",
    "schema": {
      "type": "object",
      "properties": {
        "headline": {"type": "string"},
        "author": {"type": "string"},
        "publishDate": {"type": "string"},
        "summary": {"type": "string"},
        "keyPoints": {"type": "array", "items": {"type": "string"}},
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
        "category": {"type": "string"},
        "readingTime": {"type": "number"}
      },
      "required": ["headline", "summary"],
      "description": "News article analysis and key information extraction"
    },
    "agent": {
      "model": "Whizo-Agent"
    },
    "prompt": "Extract key information from news articles. For keyPoints, identify 3-5 most important facts or developments mentioned. For sentiment, analyze the overall tone of the article."
  }'

Schema Generation

WhizoAI can automatically generate optimal JSON schemas for you based on sample URLs. This saves time and ensures you get the most relevant data structure.

Generate Schema from URL

curl -X POST "https://api.whizo.ai/v1/extract/generate-schema" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://store.example.com/product-1",
    "description": "E-commerce product page",
    "prompt": "Generate a schema to extract product details including name, price, description, availability, and ratings"
  }'

{
  "success": true,
  "data": {
    "schema": {
      "type": "object",
      "properties": {
        "productName": {
          "type": "string",
          "description": "Full name of the product"
        },
        "price": {
          "type": "string",
          "description": "Product price with currency symbol"
        },
        "description": {
          "type": "string",
          "description": "Product description or features"
        },
        "inStock": {
          "type": "boolean",
          "description": "Whether the product is currently available"
        },
        "rating": {
          "type": "number",
          "description": "Product rating out of 5 stars"
        },
        "reviewCount": {
          "type": "number",
          "description": "Number of customer reviews"
        },
        "imageUrl": {
          "type": "string",
          "description": "Main product image URL"
        }
      },
      "required": ["productName", "price"],
      "description": "E-commerce product page"
    },
    "recommendations": [
      "Consider adding 'category' field to track product classification",
      "Add 'brand' field if brand information is important for your use case",
      "Include 'sku' or product ID if you need to track inventory"
    ]
  }
}

Using Generated Schema

Once you have a generated schema, use it directly in your extraction requests:

// Step 1: Generate schema
const schemaResponse = await fetch('https://api.whizo.ai/v1/extract/generate-schema', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://store.example.com/product-1',
    description: 'Product listing page',
    prompt: 'Extract product information including price and availability'
  })
});

const { schema } = await schemaResponse.json();

// Step 2: Use schema for extraction
const extractResponse = await fetch('https://api.whizo.ai/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://store.example.com/product-1',
    schema: schema.data.schema,
    agent: {
      model: 'Whizo-Agent'
    }
  })
});

const data = await extractResponse.json();
console.log('Extracted data:', data.extractedData);

Batch URL Extraction

Extract data from multiple URLs in a single request for efficient parallel processing:

Multiple URLs with Same Schema

curl -X POST "https://api.whizo.ai/v1/extract/batch" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://store.example.com/product-1",
      "https://store.example.com/product-2",
      "https://store.example.com/product-3"
    ],
    "schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "price": {"type": "string"},
        "inStock": {"type": "boolean"}
      },
      "required": ["name", "price"]
    },
    "agent": {
      "model": "Whizo-Agent"
    },
    "options": {
      "concurrency": 3,
      "timeout": 30000
    }
  }'

{
  "success": true,
  "jobId": "extract_batch_850e8400-e29b-41d4-a716-446655440000",
  "data": {
    "status": "queued",
    "totalUrls": 3,
    "estimatedCredits": 9,
    "estimatedTime": "2-3 minutes",
    "concurrency": 3
  }
}

Batch Extraction Options

urls

array

required

Array of URLs to extract data from (max 100 per request)

options.concurrency

number

default:"5"

Number of URLs to process simultaneously (1-10)Higher concurrency = faster completion but higher server load

options.continueOnError

boolean

default:"true"

Whether to continue processing remaining URLs if some fail

options.includeFailures

boolean

default:"true"

Include failed URLs in the response with error details

Monitoring Batch Progress

Use Server-Sent Events (SSE) to monitor real-time progress:

const eventSource = new EventSource(
  `https://api.whizo.ai/v1/extract/batch/${jobId}/stream`,
  {
    headers: {
      'Authorization': 'Bearer YOUR_API_KEY'
    }
  }
);

eventSource.addEventListener('progress', (event) => {
  const data = JSON.parse(event.data);
  console.log(`Progress: ${data.completed}/${data.total} URLs`);
  console.log(`Success rate: ${data.successRate}%`);
});

eventSource.addEventListener('url_completed', (event) => {
  const data = JSON.parse(event.data);
  console.log(`Completed: ${data.url}`);
});

eventSource.addEventListener('completed', (event) => {
  console.log('Batch extraction complete!');
  eventSource.close();
});

Retrieving Batch Results

Once the batch job is complete, retrieve all results:

curl -X GET "https://api.whizo.ai/v1/extract/batch/extract_batch_850e8400/results" \
  -H "Authorization: Bearer YOUR_API_KEY"

{
  "success": true,
  "data": {
    "jobId": "extract_batch_850e8400-e29b-41d4-a716-446655440000",
    "status": "completed",
    "totalUrls": 3,
    "successfulUrls": 3,
    "failedUrls": 0,
    "creditsUsed": 9,
    "processingTime": 8500,
    "results": [
      {
        "url": "https://store.example.com/product-1",
        "success": true,
        "extractedData": {
          "name": "Wireless Headphones",
          "price": "$89.99",
          "inStock": true
        },
        "creditsUsed": 3
      },
      {
        "url": "https://store.example.com/product-2",
        "success": true,
        "extractedData": {
          "name": "Smart Watch",
          "price": "$299.99",
          "inStock": false
        },
        "creditsUsed": 3
      },
      {
        "url": "https://store.example.com/product-3",
        "success": true,
        "extractedData": {
          "name": "Bluetooth Speaker",
          "price": "$49.99",
          "inStock": true
        },
        "creditsUsed": 3
      }
    ]
  }
}

Domain-Based Extraction

Extract data from all pages within a domain using the /domain endpoint:

curl -X POST "https://api.whizo.ai/v1/extract/domain" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "domain": "jobs.example.com",
    "schema": {
      "type": "object",
      "properties": {
        "jobTitle": {"type": "string"},
        "company": {"type": "string"},
        "location": {"type": "string"},
        "salary": {"type": "string"},
        "description": {"type": "string"},
        "requirements": {"type": "array", "items": {"type": "string"}},
        "remote": {"type": "boolean"}
      },
      "required": ["jobTitle", "company"],
      "description": "Job posting information"
    },
    "options": {
      "maxPages": 100,
      "includePatterns": ["/jobs/*"],
      "excludePatterns": ["/about", "/contact"]
    }
  }'

Schema Design Best Practices

Simple Product Schema

{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "price": {"type": "string"},
    "availability": {"type": "boolean"}
  },
  "required": ["name"],
  "description": "Basic product information"
}

Complex Contact Schema

{
  "type": "object",
  "properties": {
    "person": {
      "type": "object",
      "properties": {
        "firstName": {"type": "string"},
        "lastName": {"type": "string"},
        "title": {"type": "string"},
        "email": {"type": "string"},
        "phone": {"type": "string"}
      }
    },
    "company": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "industry": {"type": "string"},
        "size": {"type": "string"}
      }
    },
    "socialMedia": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "platform": {"type": "string"},
          "url": {"type": "string"}
        }
      }
    }
  },
  "description": "Comprehensive contact information"
}

Credit Costs

Extract API uses token-based pricing that scales fairly with the amount of data extracted.

Pricing Formula

Total Tokens = 300 (base) + (output_size / 0.5) + thinking_cost
Total Credits = Total Tokens (1:1 ratio)

Cost Examples

Extraction Type	Output Size	Tokens	Credits
Small (company info)	500 bytes	~1,300	2
Medium (product list)	5 KB	~10,300	11
Large (full catalog)	50 KB	~100,300	101

Cost Components

Base Cost: 300 tokens (covers API overhead and schema analysis)
Output Cost: 2 tokens per character of extracted data
Thinking Cost: AI processing cost converted to tokens (varies by complexity)

Why Token-Based?

✅ Fair pricing - Pay for what you get
✅ Predictable - Estimate costs from schema size
✅ Industry standard - Matches OpenAI, Anthropic, Firecrawl
✅ Encourages efficiency - Optimize schemas for cost savings

Token Breakdown in Response

Every Extract API response includes a detailed token breakdown:

{
  "success": true,
  "creditsUsed": 11,
  "tokenUsage": {
    "base": 300,
    "output": 10000,
    "thinking": 0,
    "total": 10300,
    "formula": "10,300 tokens (300 base + 10,000 output + 0 thinking)"
  },
  "metadata": {
    "billing": {
      "method": "token-based",
      "formula": "300 + (output_length / 0.5) + thinking",
      "outputSizeBytes": 5000,
      "creditsPerToken": 1
    }
  }
}

This transparency allows you to:

Understand your costs - See exactly what you’re paying for
Optimize your schemas - Reduce output size to save credits
Predict future costs - Estimate costs for similar extractions

Monitoring Progress

Check extraction status using the jobs endpoint:

curl -H "Authorization: Bearer YOUR_API_KEY" \
     https://api.whizo.ai/v1/extract/extract_550e8400-e29b-41d4-a716-446655440000

Error Responses

error

object

Show error properties

code

string

Error code identifier

message

string

Human-readable error message

details

object

Additional error context

Common Errors

Status Code	Error Code	Description
400	`invalid_schema`	JSON schema is malformed or invalid
400	`invalid_urls`	One or more URLs are invalid
400	`schema_too_complex`	Schema exceeds complexity limits
401	`unauthorized`	Invalid or missing API key
402	`insufficient_credits`	Not enough credits for extraction
429	`rate_limited`	Rate limit exceeded
500	`extraction_failed`	AI extraction process failed

{
  "success": false,
  "error": {
    "code": "invalid_schema",
    "message": "Schema validation failed",
    "details": {
      "issues": [
        "Missing required 'type' field in schema root",
        "Property 'price' has invalid type definition"
      ]
    }
  }
}

Rate Limits

Extract API rate limits by plan:

Free: 5 extractions per hour, 20 per day
Starter: 20 extractions per hour, 100 per day
Pro: 100 extractions per hour, 500 per day
Enterprise: Custom limits

Use Cases

E-commerce Price Monitoring

Monitor competitor prices across multiple product pages with consistent schema extraction for automated price tracking systems.

Lead Generation

Extract contact information from business directories, company websites, and professional profiles for sales prospecting.

Content Analysis

Analyze news articles, blog posts, and social media content for sentiment, key topics, and structured insights.

Real Estate Data

Extract property listings, prices, features, and contact information from real estate websites and portals.

Job Market Analysis

Gather job posting data including requirements, salaries, and company information for market research.

Scrape API - Basic content extraction
Crawl API - Multi-page website crawling
Search API - Search and extract from results
Jobs API - Monitor extraction progress

Core APIs

Job Management

User Management

Advanced Features

Authentication

Request Body

Response

Examples

Basic Product Data Extraction

Lead Generation from Business Directory

News Article Analysis

Schema Generation

Generate Schema from URL

Using Generated Schema

Batch URL Extraction

Multiple URLs with Same Schema

Batch Extraction Options

Monitoring Batch Progress

Retrieving Batch Results

Domain-Based Extraction

Schema Design Best Practices

Simple Product Schema

Complex Contact Schema

Credit Costs

Pricing Formula

Cost Examples

Cost Components

Why Token-Based?

Token Breakdown in Response

Monitoring Progress

Error Responses

Common Errors

Rate Limits

Use Cases

Core APIs

Job Management

User Management

Advanced Features

​Authentication

​Request Body

​Response

​Examples

​Basic Product Data Extraction

​Lead Generation from Business Directory

​News Article Analysis

​Schema Generation

​Generate Schema from URL

​Using Generated Schema

​Batch URL Extraction

​Multiple URLs with Same Schema

​Batch Extraction Options

​Monitoring Batch Progress

​Retrieving Batch Results

​Domain-Based Extraction

​Schema Design Best Practices

​Simple Product Schema

​Complex Contact Schema

​Credit Costs

​Pricing Formula

​Cost Examples

​Cost Components

​Why Token-Based?

​Token Breakdown in Response

​Monitoring Progress

​Error Responses

​Common Errors

​Rate Limits

​Use Cases

​Related Endpoints

Authentication

Request Body

Response

Examples

Basic Product Data Extraction

Lead Generation from Business Directory

News Article Analysis

Schema Generation

Generate Schema from URL

Using Generated Schema

Batch URL Extraction

Multiple URLs with Same Schema

Batch Extraction Options

Monitoring Batch Progress

Retrieving Batch Results

Domain-Based Extraction

Schema Design Best Practices

Simple Product Schema

Complex Contact Schema

Credit Costs

Pricing Formula

Cost Examples

Cost Components

Why Token-Based?

Token Breakdown in Response

Monitoring Progress

Error Responses

Common Errors

Rate Limits

Use Cases

Related Endpoints