AI-Powered Data Extraction

Overview

WhizoAI’s AI extraction feature uses advanced Language Learning Models (LLMs) to intelligently extract structured data from web pages. Simply define your schema, and let AI handle the extraction logic.

Supported Models

GPT-3.5 Turbo

3 credits/page Fast and cost-effective for simple extraction

GPT-4

6 credits/page Advanced reasoning for complex data structures

Claude 3

5 credits/page Excellent for long-form content and nuanced extraction

Basic Usage

from whizoai import WhizoAI

client = WhizoAI(api_key="whizo_YOUR-API-KEY")

# Extract structured data using AI
result = client.extract(
    url="https://example.com/product",
    schema={
        "name": "Product name",
        "price": "Product price",
        "description": "Product description",
        "inStock": "Is the product in stock (boolean)"
    },
    options={
        "model": "gpt-4",
        "format": "json"
    }
)

print(result["extractedData"])

Advanced Schema Definitions

Complex Data Structures

Extract nested objects and arrays:

schema = {
    "product": {
        "name": "Product name",
        "price": {
            "amount": "Price amount (number)",
            "currency": "Currency code (e.g., USD)"
        },
        "specs": "List of product specifications (array of objects with name and value)",
        "reviews": "Array of customer reviews with rating and comment"
    },
    "seller": {
        "name": "Seller name",
        "rating": "Seller rating (number)",
        "location": "Seller location"
    }
}

result = client.extract(url=product_url, schema=schema)

Type Hints and Validation

Specify data types for better accuracy:

schema = {
    "title": "string - Article title",
    "publishDate": "datetime - Publication date in ISO 8601 format",
    "author": "string - Author name",
    "wordCount": "number - Approximate word count",
    "tags": "array of strings - Article tags/categories",
    "isPremium": "boolean - Is this premium content?"
}

Extraction Options

Model Selection

Choose the best model for your use case:

# Fast and cost-effective
result = client.extract(
    url=url,
    schema=schema,
    options={"model": "gpt-3.5-turbo"}  # 3 credits
)

# Best quality for complex data
result = client.extract(
    url=url,
    schema=schema,
    options={"model": "gpt-4"}  # 6 credits
)

# Balanced performance
result = client.extract(
    url=url,
    schema=schema,
    options={"model": "claude-3-sonnet"}  # 5 credits
)

Custom Instructions

Add context for better extraction:

result = client.extract(
    url=url,
    schema=schema,
    options={
        "model": "gpt-4",
        "instructions": "Focus on extracting product details from the main content area. Ignore sidebar ads and recommendations.",
        "language": "en"
    }
)

Output Formatting

# JSON output (default)
result = client.extract(url=url, schema=schema, options={"format": "json"})

# Structured markdown
result = client.extract(url=url, schema=schema, options={"format": "markdown"})

# CSV for tabular data
result = client.extract(url=url, schema=schema, options={"format": "csv"})

Handling Large Pages

For content-heavy pages, optimize extraction:

result = client.extract(
    url=url,
    schema=schema,
    options={
        "model": "gpt-4",
        "maxTokens": 4000,  # Limit context size
        "targetSelector": "#main-content",  # Focus on specific element
        "removeSelectors": [".ads", ".sidebar", "footer"]  # Exclude noise
    }
)

Batch AI Extraction

Extract from multiple pages efficiently:

urls = [
    "https://example.com/product1",
    "https://example.com/product2",
    "https://example.com/product3"
]

result = client.batch_extract(
    urls=urls,
    schema=schema,
    options={
        "model": "gpt-3.5-turbo",
        "concurrency": 3
    }
)

# Process results
for page in result['results']:
    print(f"URL: {page['url']}")
    print(f"Data: {page['extractedData']}")

Error Handling

Handle extraction failures gracefully:

try:
    result = client.extract(url=url, schema=schema)

    if result['success']:
        data = result['extractedData']
        confidence = result['metadata']['confidence']  # 0-1 score

        if confidence < 0.7:
            print("Low confidence extraction. Manual review recommended.")

except WhizoAIError as e:
    if e.code == 'EXTRACTION_FAILED':
        print("AI couldn't extract data. Try simplifying the schema.")
    elif e.code == 'INSUFFICIENT_CREDITS':
        print("Not enough credits for AI extraction.")

Validation and Confidence Scores

WhizoAI provides confidence scores for extractions:

result = client.extract(url=url, schema=schema)

print(f"Overall Confidence: {result['metadata']['confidence']}")
print(f"Field Confidences: {result['metadata']['fieldConfidence']}")

# Example output:
# Overall Confidence: 0.92
# Field Confidences: {
#   'name': 0.98,
#   'price': 0.95,
#   'inStock': 0.85
# }

Common Use Cases

E-commerce Product Data

Extract product names, prices, descriptions, specs, and reviews from online stores

Job Listings

Parse job titles, salaries, requirements, and company info from career pages

Real Estate Listings

Extract property details, prices, locations, and features from listing sites

News Articles

Extract headlines, authors, publish dates, and article content from news sites

Business Directories

Extract company names, addresses, phone numbers, and services from directories

Best Practices

Cost Optimization

Start with GPT-3.5 for simple schemas
Use GPT-4 only for complex or nested data
Test your schema on a few pages before batch processing

Schema Design Tips

Be specific in field descriptions
Include data type hints (string, number, boolean, array)
Provide examples in descriptions when helpful
Keep schemas focused—extract only what you need

Credit Costs

Model	Cost per Page	Best For
GPT-3.5 Turbo	3 credits	Simple, flat data structures
GPT-4	6 credits	Complex nested data, high accuracy needed
Claude 3 Sonnet	5 credits	Long-form content, nuanced extraction

Comparison: AI vs Traditional Scraping

Feature	Traditional Scraping	AI Extraction
Setup Time	High (write custom selectors)	Low (define schema)
Adaptability	Breaks when HTML changes	Adapts to layout changes
Complex Data	Manual nested parsing needed	Handles nesting automatically
Cost	1 credit/page	3-6 credits/page
Speed	Faster	Slower (LLM processing)
Best For	Static, predictable layouts	Dynamic, complex structures

LLM SDK Integrations

Integrate WhizoAI with LangChain, LlamaIndex, and more

Batch Processing

Process thousands of extractions efficiently

Extract API Reference

Full API documentation for extraction endpoints

Core Features

Advanced Features

Overview

Supported Models

GPT-3.5 Turbo

GPT-4

Claude 3

Basic Usage

Advanced Schema Definitions

Complex Data Structures

Type Hints and Validation

Extraction Options

Model Selection

Custom Instructions

Output Formatting

Handling Large Pages

Batch AI Extraction

Error Handling

Validation and Confidence Scores

Common Use Cases

Best Practices

Credit Costs

Comparison: AI vs Traditional Scraping

LLM SDK Integrations

Batch Processing

Extract API Reference

Core Features

Advanced Features

​Overview

​Supported Models

GPT-3.5 Turbo

GPT-4

Claude 3

​Basic Usage

​Advanced Schema Definitions

​Complex Data Structures

​Type Hints and Validation

​Extraction Options

​Model Selection

​Custom Instructions

​Output Formatting

​Handling Large Pages

​Batch AI Extraction

​Error Handling

​Validation and Confidence Scores

​Common Use Cases

​Best Practices

​Credit Costs

​Comparison: AI vs Traditional Scraping

​Related Resources

LLM SDK Integrations

Batch Processing

Extract API Reference

Overview

Supported Models

Basic Usage

Advanced Schema Definitions

Complex Data Structures

Type Hints and Validation

Extraction Options

Model Selection

Custom Instructions

Output Formatting

Handling Large Pages

Batch AI Extraction

Error Handling

Validation and Confidence Scores

Common Use Cases

Best Practices

Credit Costs

Comparison: AI vs Traditional Scraping

Related Resources