Skip to main content

Overview

WhizoAI’s AI extraction feature uses advanced Language Learning Models (LLMs) to intelligently extract structured data from web pages. Simply define your schema, and let AI handle the extraction logic.

Supported Models

GPT-3.5 Turbo

3 credits/page Fast and cost-effective for simple extraction

GPT-4

6 credits/page Advanced reasoning for complex data structures

Claude 3

5 credits/page Excellent for long-form content and nuanced extraction

Basic Usage

from whizoai import WhizoAI

client = WhizoAI(api_key="whizo_YOUR-API-KEY")

# Extract structured data using AI
result = client.extract(
    url="https://example.com/product",
    schema={
        "name": "Product name",
        "price": "Product price",
        "description": "Product description",
        "inStock": "Is the product in stock (boolean)"
    },
    options={
        "model": "gpt-4",
        "format": "json"
    }
)

print(result["extractedData"])

Advanced Schema Definitions

Complex Data Structures

Extract nested objects and arrays:
schema = {
    "product": {
        "name": "Product name",
        "price": {
            "amount": "Price amount (number)",
            "currency": "Currency code (e.g., USD)"
        },
        "specs": "List of product specifications (array of objects with name and value)",
        "reviews": "Array of customer reviews with rating and comment"
    },
    "seller": {
        "name": "Seller name",
        "rating": "Seller rating (number)",
        "location": "Seller location"
    }
}

result = client.extract(url=product_url, schema=schema)

Type Hints and Validation

Specify data types for better accuracy:
schema = {
    "title": "string - Article title",
    "publishDate": "datetime - Publication date in ISO 8601 format",
    "author": "string - Author name",
    "wordCount": "number - Approximate word count",
    "tags": "array of strings - Article tags/categories",
    "isPremium": "boolean - Is this premium content?"
}

Extraction Options

Model Selection

Choose the best model for your use case:
# Fast and cost-effective
result = client.extract(
    url=url,
    schema=schema,
    options={"model": "gpt-3.5-turbo"}  # 3 credits
)

# Best quality for complex data
result = client.extract(
    url=url,
    schema=schema,
    options={"model": "gpt-4"}  # 6 credits
)

# Balanced performance
result = client.extract(
    url=url,
    schema=schema,
    options={"model": "claude-3-sonnet"}  # 5 credits
)

Custom Instructions

Add context for better extraction:
result = client.extract(
    url=url,
    schema=schema,
    options={
        "model": "gpt-4",
        "instructions": "Focus on extracting product details from the main content area. Ignore sidebar ads and recommendations.",
        "language": "en"
    }
)

Output Formatting

# JSON output (default)
result = client.extract(url=url, schema=schema, options={"format": "json"})

# Structured markdown
result = client.extract(url=url, schema=schema, options={"format": "markdown"})

# CSV for tabular data
result = client.extract(url=url, schema=schema, options={"format": "csv"})

Handling Large Pages

For content-heavy pages, optimize extraction:
result = client.extract(
    url=url,
    schema=schema,
    options={
        "model": "gpt-4",
        "maxTokens": 4000,  # Limit context size
        "targetSelector": "#main-content",  # Focus on specific element
        "removeSelectors": [".ads", ".sidebar", "footer"]  # Exclude noise
    }
)

Batch AI Extraction

Extract from multiple pages efficiently:
urls = [
    "https://example.com/product1",
    "https://example.com/product2",
    "https://example.com/product3"
]

result = client.batch_extract(
    urls=urls,
    schema=schema,
    options={
        "model": "gpt-3.5-turbo",
        "concurrency": 3
    }
)

# Process results
for page in result['results']:
    print(f"URL: {page['url']}")
    print(f"Data: {page['extractedData']}")

Error Handling

Handle extraction failures gracefully:
try:
    result = client.extract(url=url, schema=schema)

    if result['success']:
        data = result['extractedData']
        confidence = result['metadata']['confidence']  # 0-1 score

        if confidence < 0.7:
            print("Low confidence extraction. Manual review recommended.")

except WhizoAIError as e:
    if e.code == 'EXTRACTION_FAILED':
        print("AI couldn't extract data. Try simplifying the schema.")
    elif e.code == 'INSUFFICIENT_CREDITS':
        print("Not enough credits for AI extraction.")

Validation and Confidence Scores

WhizoAI provides confidence scores for extractions:
result = client.extract(url=url, schema=schema)

print(f"Overall Confidence: {result['metadata']['confidence']}")
print(f"Field Confidences: {result['metadata']['fieldConfidence']}")

# Example output:
# Overall Confidence: 0.92
# Field Confidences: {
#   'name': 0.98,
#   'price': 0.95,
#   'inStock': 0.85
# }

Common Use Cases

Extract product names, prices, descriptions, specs, and reviews from online stores
Parse job titles, salaries, requirements, and company info from career pages
Extract property details, prices, locations, and features from listing sites
Extract headlines, authors, publish dates, and article content from news sites
Extract company names, addresses, phone numbers, and services from directories

Best Practices

Cost Optimization
  • Start with GPT-3.5 for simple schemas
  • Use GPT-4 only for complex or nested data
  • Test your schema on a few pages before batch processing
Schema Design Tips
  • Be specific in field descriptions
  • Include data type hints (string, number, boolean, array)
  • Provide examples in descriptions when helpful
  • Keep schemas focused—extract only what you need

Credit Costs

ModelCost per PageBest For
GPT-3.5 Turbo3 creditsSimple, flat data structures
GPT-46 creditsComplex nested data, high accuracy needed
Claude 3 Sonnet5 creditsLong-form content, nuanced extraction

Comparison: AI vs Traditional Scraping

FeatureTraditional ScrapingAI Extraction
Setup TimeHigh (write custom selectors)Low (define schema)
AdaptabilityBreaks when HTML changesAdapts to layout changes
Complex DataManual nested parsing neededHandles nesting automatically
Cost1 credit/page3-6 credits/page
SpeedFasterSlower (LLM processing)
Best ForStatic, predictable layoutsDynamic, complex structures

LLM SDK Integrations

Integrate WhizoAI with LangChain, LlamaIndex, and more

Batch Processing

Process thousands of extractions efficiently

Extract API Reference

Full API documentation for extraction endpoints