Skip to main content
POST
https://api.whizo.ai
/
v1
/
extract
Extract API
curl --request POST \
  --url https://api.whizo.ai/v1/extract \
  --header 'Authorization: <authorization>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "url": "<string>",
  "schema": {
    "schema.type": "<string>",
    "schema.properties": {},
    "schema.required": [
      {}
    ],
    "schema.description": "<string>"
  },
  "agent": {
    "agent.model": "<string>"
  },
  "prompt": "<string>",
  "options": {
    "options.maxPages": 123,
    "options.includePatterns": [
      {}
    ],
    "options.excludePatterns": [
      {}
    ],
    "options.temperature": 123,
    "options.timeout": 123,
    "options.maxAge": 123,
    "options.processingOptions": {
      "options.processingOptions.removeNoise": true,
      "options.processingOptions.optimizeForLLM": true,
      "options.processingOptions.preserveStructure": true
    }
  },
  "urls": [
    {}
  ],
  "options.concurrency": 123,
  "options.continueOnError": true,
  "options.includeFailures": true
}
'
{
  "success": true,
  "jobId": "extract_550e8400-e29b-41d4-a716-446655440000",
  "data": {
    "success": true,
    "extractedData": [
      {
        "name": "Wireless Bluetooth Headphones",
        "price": "$89.99",
        "description": "Premium noise-canceling wireless headphones with 30-hour battery life",
        "inStock": true,
        "category": "Electronics",
        "rating": 4.5
      }
    ],
    "sourceUrls": ["https://store.example.com/product-1"],
    "totalPages": 1,
    "creditsUsed": 3,
    "processingTime": 2800,
    "tokenUsage": {
      "base": 300,
      "output": 2500,
      "thinking": 0,
      "total": 2800,
      "formula": "2,800 tokens (300 base + 2,500 output + 0 thinking)"
    },
    "metadata": {
      "billing": {
        "method": "token-based",
        "formula": "300 + (output_length / 0.5) + thinking",
        "outputSizeBytes": 1250,
        "creditsPerToken": 1
      }
    }
  }
}
The Extract API uses advanced AI models to extract structured data from web pages based on your custom schema definitions. Perfect for automated data collection, lead generation, and content analysis.

Authentication

Authorization
string
required
Bearer token using your API key: Bearer YOUR_API_KEY

Request Body

url
string
required
The URL to extract data from. Must be a valid HTTP/HTTPS URL.
schema
object
required
JSON schema defining the structure of data to extract
agent
object
Whizo Agent configuration for AI-powered extraction
prompt
string
Custom prompt to guide the AI extraction (max 3000 chars)Provide specific instructions for what data to extract and how to interpret the content
options
object
Configuration options for extraction behavior

Response

success
boolean
Indicates if extraction was successful
jobId
string
Unique job identifier for tracking
data
object
Complete extraction results

Examples

Basic Product Data Extraction

curl -X POST "https://api.whizo.ai/v1/extract" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://store.example.com/product-1",
    "schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "price": {"type": "string"},
        "description": {"type": "string"},
        "inStock": {"type": "boolean"},
        "category": {"type": "string"},
        "rating": {"type": "number"}
      },
      "required": ["name", "price"],
      "description": "E-commerce product information"
    },
    "agent": {
      "model": "Whizo-Agent"
    },
    "prompt": "Extract complete product details including pricing and availability"
  }'
{
  "success": true,
  "jobId": "extract_550e8400-e29b-41d4-a716-446655440000",
  "data": {
    "success": true,
    "extractedData": [
      {
        "name": "Wireless Bluetooth Headphones",
        "price": "$89.99",
        "description": "Premium noise-canceling wireless headphones with 30-hour battery life",
        "inStock": true,
        "category": "Electronics",
        "rating": 4.5
      }
    ],
    "sourceUrls": ["https://store.example.com/product-1"],
    "totalPages": 1,
    "creditsUsed": 3,
    "processingTime": 2800,
    "tokenUsage": {
      "base": 300,
      "output": 2500,
      "thinking": 0,
      "total": 2800,
      "formula": "2,800 tokens (300 base + 2,500 output + 0 thinking)"
    },
    "metadata": {
      "billing": {
        "method": "token-based",
        "formula": "300 + (output_length / 0.5) + thinking",
        "outputSizeBytes": 1250,
        "creditsPerToken": 1
      }
    }
  }
}

Lead Generation from Business Directory

curl -X POST "https://api.whizo.ai/v1/extract" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://directory.example.com/business-1",
    "schema": {
      "type": "object",
      "properties": {
        "companyName": {"type": "string"},
        "contactEmail": {"type": "string"},
        "phoneNumber": {"type": "string"},
        "address": {"type": "string"},
        "website": {"type": "string"},
        "industry": {"type": "string"},
        "employees": {"type": "number"},
        "description": {"type": "string"}
      },
      "required": ["companyName"],
      "description": "Business contact information for lead generation"
    },
    "agent": {
      "model": "Whizo-Agent"
    },
    "prompt": "Focus on extracting accurate contact information including email addresses and phone numbers. If multiple contacts are available, prioritize the main business contact.",
    "options": {
      "temperature": 0.0
    }
  }'
{
  "success": true,
  "jobId": "extract_750e8400-e29b-41d4-a716-446655440001",
  "data": {
    "success": true,
    "extractedData": [
      {
        "companyName": "Tech Solutions Inc.",
        "contactEmail": "[email protected]",
        "phoneNumber": "+1-555-0123",
        "address": "123 Innovation Drive, San Francisco, CA 94105",
        "website": "https://techsolutions.com",
        "industry": "Software Development",
        "employees": 150,
        "description": "Leading provider of cloud-based business solutions"
      }
    ],
    "sourceUrls": ["https://directory.example.com/business-1"],
    "totalPages": 1,
    "creditsUsed": 3,
    "processingTime": 3200
  }
}

News Article Analysis

curl -X POST "https://api.whizo.ai/v1/extract" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://news.example.com/article-1",
    "schema": {
      "type": "object",
      "properties": {
        "headline": {"type": "string"},
        "author": {"type": "string"},
        "publishDate": {"type": "string"},
        "summary": {"type": "string"},
        "keyPoints": {"type": "array", "items": {"type": "string"}},
        "sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
        "category": {"type": "string"},
        "readingTime": {"type": "number"}
      },
      "required": ["headline", "summary"],
      "description": "News article analysis and key information extraction"
    },
    "agent": {
      "model": "Whizo-Agent"
    },
    "prompt": "Extract key information from news articles. For keyPoints, identify 3-5 most important facts or developments mentioned. For sentiment, analyze the overall tone of the article."
  }'

Schema Generation

WhizoAI can automatically generate optimal JSON schemas for you based on sample URLs. This saves time and ensures you get the most relevant data structure.

Generate Schema from URL

curl -X POST "https://api.whizo.ai/v1/extract/generate-schema" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://store.example.com/product-1",
    "description": "E-commerce product page",
    "prompt": "Generate a schema to extract product details including name, price, description, availability, and ratings"
  }'
{
  "success": true,
  "data": {
    "schema": {
      "type": "object",
      "properties": {
        "productName": {
          "type": "string",
          "description": "Full name of the product"
        },
        "price": {
          "type": "string",
          "description": "Product price with currency symbol"
        },
        "description": {
          "type": "string",
          "description": "Product description or features"
        },
        "inStock": {
          "type": "boolean",
          "description": "Whether the product is currently available"
        },
        "rating": {
          "type": "number",
          "description": "Product rating out of 5 stars"
        },
        "reviewCount": {
          "type": "number",
          "description": "Number of customer reviews"
        },
        "imageUrl": {
          "type": "string",
          "description": "Main product image URL"
        }
      },
      "required": ["productName", "price"],
      "description": "E-commerce product page"
    },
    "recommendations": [
      "Consider adding 'category' field to track product classification",
      "Add 'brand' field if brand information is important for your use case",
      "Include 'sku' or product ID if you need to track inventory"
    ]
  }
}

Using Generated Schema

Once you have a generated schema, use it directly in your extraction requests:
// Step 1: Generate schema
const schemaResponse = await fetch('https://api.whizo.ai/v1/extract/generate-schema', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://store.example.com/product-1',
    description: 'Product listing page',
    prompt: 'Extract product information including price and availability'
  })
});

const { schema } = await schemaResponse.json();

// Step 2: Use schema for extraction
const extractResponse = await fetch('https://api.whizo.ai/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer YOUR_API_KEY',
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({
    url: 'https://store.example.com/product-1',
    schema: schema.data.schema,
    agent: {
      model: 'Whizo-Agent'
    }
  })
});

const data = await extractResponse.json();
console.log('Extracted data:', data.extractedData);

Batch URL Extraction

Extract data from multiple URLs in a single request for efficient parallel processing:

Multiple URLs with Same Schema

curl -X POST "https://api.whizo.ai/v1/extract/batch" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://store.example.com/product-1",
      "https://store.example.com/product-2",
      "https://store.example.com/product-3"
    ],
    "schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "price": {"type": "string"},
        "inStock": {"type": "boolean"}
      },
      "required": ["name", "price"]
    },
    "agent": {
      "model": "Whizo-Agent"
    },
    "options": {
      "concurrency": 3,
      "timeout": 30000
    }
  }'
{
  "success": true,
  "jobId": "extract_batch_850e8400-e29b-41d4-a716-446655440000",
  "data": {
    "status": "queued",
    "totalUrls": 3,
    "estimatedCredits": 9,
    "estimatedTime": "2-3 minutes",
    "concurrency": 3
  }
}

Batch Extraction Options

urls
array
required
Array of URLs to extract data from (max 100 per request)
options.concurrency
number
default:"5"
Number of URLs to process simultaneously (1-10)Higher concurrency = faster completion but higher server load
options.continueOnError
boolean
default:"true"
Whether to continue processing remaining URLs if some fail
options.includeFailures
boolean
default:"true"
Include failed URLs in the response with error details

Monitoring Batch Progress

Use Server-Sent Events (SSE) to monitor real-time progress:
const eventSource = new EventSource(
  `https://api.whizo.ai/v1/extract/batch/${jobId}/stream`,
  {
    headers: {
      'Authorization': 'Bearer YOUR_API_KEY'
    }
  }
);

eventSource.addEventListener('progress', (event) => {
  const data = JSON.parse(event.data);
  console.log(`Progress: ${data.completed}/${data.total} URLs`);
  console.log(`Success rate: ${data.successRate}%`);
});

eventSource.addEventListener('url_completed', (event) => {
  const data = JSON.parse(event.data);
  console.log(`Completed: ${data.url}`);
});

eventSource.addEventListener('completed', (event) => {
  console.log('Batch extraction complete!');
  eventSource.close();
});

Retrieving Batch Results

Once the batch job is complete, retrieve all results:
curl -X GET "https://api.whizo.ai/v1/extract/batch/extract_batch_850e8400/results" \
  -H "Authorization: Bearer YOUR_API_KEY"
{
  "success": true,
  "data": {
    "jobId": "extract_batch_850e8400-e29b-41d4-a716-446655440000",
    "status": "completed",
    "totalUrls": 3,
    "successfulUrls": 3,
    "failedUrls": 0,
    "creditsUsed": 9,
    "processingTime": 8500,
    "results": [
      {
        "url": "https://store.example.com/product-1",
        "success": true,
        "extractedData": {
          "name": "Wireless Headphones",
          "price": "$89.99",
          "inStock": true
        },
        "creditsUsed": 3
      },
      {
        "url": "https://store.example.com/product-2",
        "success": true,
        "extractedData": {
          "name": "Smart Watch",
          "price": "$299.99",
          "inStock": false
        },
        "creditsUsed": 3
      },
      {
        "url": "https://store.example.com/product-3",
        "success": true,
        "extractedData": {
          "name": "Bluetooth Speaker",
          "price": "$49.99",
          "inStock": true
        },
        "creditsUsed": 3
      }
    ]
  }
}

Domain-Based Extraction

Extract data from all pages within a domain using the /domain endpoint:
curl -X POST "https://api.whizo.ai/v1/extract/domain" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "domain": "jobs.example.com",
    "schema": {
      "type": "object",
      "properties": {
        "jobTitle": {"type": "string"},
        "company": {"type": "string"},
        "location": {"type": "string"},
        "salary": {"type": "string"},
        "description": {"type": "string"},
        "requirements": {"type": "array", "items": {"type": "string"}},
        "remote": {"type": "boolean"}
      },
      "required": ["jobTitle", "company"],
      "description": "Job posting information"
    },
    "options": {
      "maxPages": 100,
      "includePatterns": ["/jobs/*"],
      "excludePatterns": ["/about", "/contact"]
    }
  }'

Schema Design Best Practices

Simple Product Schema

{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "price": {"type": "string"},
    "availability": {"type": "boolean"}
  },
  "required": ["name"],
  "description": "Basic product information"
}

Complex Contact Schema

{
  "type": "object",
  "properties": {
    "person": {
      "type": "object",
      "properties": {
        "firstName": {"type": "string"},
        "lastName": {"type": "string"},
        "title": {"type": "string"},
        "email": {"type": "string"},
        "phone": {"type": "string"}
      }
    },
    "company": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "industry": {"type": "string"},
        "size": {"type": "string"}
      }
    },
    "socialMedia": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "platform": {"type": "string"},
          "url": {"type": "string"}
        }
      }
    }
  },
  "description": "Comprehensive contact information"
}

Credit Costs

Extract API uses token-based pricing that scales fairly with the amount of data extracted.

Pricing Formula

Total Tokens = 300 (base) + (output_size / 0.5) + thinking_cost
Total Credits = Total Tokens (1:1 ratio)

Cost Examples

Extraction TypeOutput SizeTokensCredits
Small (company info)500 bytes~1,3002
Medium (product list)5 KB~10,30011
Large (full catalog)50 KB~100,300101

Cost Components

  • Base Cost: 300 tokens (covers API overhead and schema analysis)
  • Output Cost: 2 tokens per character of extracted data
  • Thinking Cost: AI processing cost converted to tokens (varies by complexity)

Why Token-Based?

  1. Fair pricing - Pay for what you get
  2. Predictable - Estimate costs from schema size
  3. Industry standard - Matches OpenAI, Anthropic, Firecrawl
  4. Encourages efficiency - Optimize schemas for cost savings

Token Breakdown in Response

Every Extract API response includes a detailed token breakdown:
{
  "success": true,
  "creditsUsed": 11,
  "tokenUsage": {
    "base": 300,
    "output": 10000,
    "thinking": 0,
    "total": 10300,
    "formula": "10,300 tokens (300 base + 10,000 output + 0 thinking)"
  },
  "metadata": {
    "billing": {
      "method": "token-based",
      "formula": "300 + (output_length / 0.5) + thinking",
      "outputSizeBytes": 5000,
      "creditsPerToken": 1
    }
  }
}
This transparency allows you to:
  • Understand your costs - See exactly what you’re paying for
  • Optimize your schemas - Reduce output size to save credits
  • Predict future costs - Estimate costs for similar extractions

Monitoring Progress

Check extraction status using the jobs endpoint:
curl -H "Authorization: Bearer YOUR_API_KEY" \
     https://api.whizo.ai/v1/extract/extract_550e8400-e29b-41d4-a716-446655440000

Error Responses

error
object

Common Errors

Status CodeError CodeDescription
400invalid_schemaJSON schema is malformed or invalid
400invalid_urlsOne or more URLs are invalid
400schema_too_complexSchema exceeds complexity limits
401unauthorizedInvalid or missing API key
402insufficient_creditsNot enough credits for extraction
429rate_limitedRate limit exceeded
500extraction_failedAI extraction process failed
{
  "success": false,
  "error": {
    "code": "invalid_schema",
    "message": "Schema validation failed",
    "details": {
      "issues": [
        "Missing required 'type' field in schema root",
        "Property 'price' has invalid type definition"
      ]
    }
  }
}

Rate Limits

Extract API rate limits by plan:
  • Free: 5 extractions per hour, 20 per day
  • Starter: 20 extractions per hour, 100 per day
  • Pro: 100 extractions per hour, 500 per day
  • Enterprise: Custom limits

Use Cases

Monitor competitor prices across multiple product pages with consistent schema extraction for automated price tracking systems.
Extract contact information from business directories, company websites, and professional profiles for sales prospecting.
Analyze news articles, blog posts, and social media content for sentiment, key topics, and structured insights.
Extract property listings, prices, features, and contact information from real estate websites and portals.
Gather job posting data including requirements, salaries, and company information for market research.