Skip to main content

Overview

WhizoAI’s lightweight engine provides blazing-fast scraping for static websites that don’t require JavaScript rendering. Perfect for content-heavy sites, APIs, and bulk data collection.

Lightweight vs Browser Engines

FeatureLightweightPlaywright/Puppeteer
Speed⚡ 5-10x fasterStandard
Cost1 credit2 credits (with JS)
JavaScript❌ No✅ Yes
Best ForStatic HTML, APIsSPAs, Dynamic content
MemoryLowHigh
Concurrent RequestsVery HighLimited

Basic Usage

from whizoai import WhizoAI

client = WhizoAI(api_key="whizo_YOUR-API-KEY")

# Fast scraping with lightweight engine
result = client.scrape(
    url="https://example.com",
    options={
        "engine": "lightweight",  # Default for static sites
        "format": "markdown"
    }
)

print(f"Processing time: {result['metadata']['processingTime']}ms")

Performance Optimization

Concurrent Processing

Process multiple URLs simultaneously:
from concurrent.futures import ThreadPoolExecutor

urls = ["https://example.com/page" + str(i) for i in range(100)]

def scrape_url(url):
    return client.scrape(url, options={"engine": "lightweight"})

with ThreadPoolExecutor(max_workers=20) as executor:
    results = list(executor.map(scrape_url, urls))

print(f"Scraped {len(results)} pages in parallel")

HTTP/2 Support

Enable HTTP/2 for better performance:
result = client.scrape(
    url="https://example.com",
    options={
        "engine": "lightweight",
        "http2": True,  # Enable HTTP/2
        "keepAlive": True  # Reuse connections
    }
)

Compression

Reduce bandwidth with compression:
result = client.scrape(
    url="https://example.com",
    options={
        "engine": "lightweight",
        "compression": True,  # Enable gzip/brotli
        "headers": {
            "Accept-Encoding": "gzip, deflate, br"
        }
    }
)

Selective Content Loading

Block Unnecessary Resources

result = client.scrape(
    url="https://example.com",
    options={
        "engine": "lightweight",
        "blockResources": ["image", "font", "media"],  # Don't load images/fonts
        "blockDomains": [
            "analytics.google.com",
            "facebook.com",
            "doubleclick.net"
        ]
    }
)

Target Specific Elements

Only extract what you need:
result = client.scrape(
    url="https://example.com",
    options={
        "engine": "lightweight",
        "targetSelector": "#main-content",  # Only extract this element
        "removeSelectors": [".ads", ".sidebar", "footer"]  # Remove noise
    }
)

Caching Strategy

Enable Result Caching

result = client.scrape(
    url="https://example.com",
    options={
        "engine": "lightweight",
        "useCache": True,
        "cacheTtl": 3600  # Cache for 1 hour
    }
)

# Subsequent requests within 1 hour return cached result (no credits charged)

Cache Control Headers

Respect site’s cache headers:
result = client.scrape(
    url="https://example.com",
    options={
        "engine": "lightweight",
        "respectCacheHeaders": True,  # Follow Cache-Control headers
        "headers": {
            "Cache-Control": "max-age=3600"
        }
    }
)

Timeout Management

Optimize Timeouts

result = client.scrape(
    url="https://example.com",
    options={
        "engine": "lightweight",
        "timeout": 10000,  # 10 second total timeout
        "connectionTimeout": 5000,  # 5 second connection timeout
        "readTimeout": 5000  # 5 second read timeout
    }
)

Parallel Timeout Strategy

from concurrent.futures import TimeoutError, ThreadPoolExecutor
import time

def scrape_with_timeout(url, timeout=15):
    with ThreadPoolExecutor(max_workers=1) as executor:
        future = executor.submit(client.scrape, url, {"engine": "lightweight"})
        try:
            return future.result(timeout=timeout)
        except TimeoutError:
            print(f"Timeout scraping {url}")
            return None

results = [scrape_with_timeout(url) for url in urls]

Format Optimization

Choose Efficient Formats

# Markdown - Balanced (default)
result = client.scrape(url, options={"format": "markdown"})

# HTML - Full content, larger size
result = client.scrape(url, options={"format": "html"})

# Text - Smallest, fastest
result = client.scrape(url, options={"format": "text"})

# JSON - Structured, easy to parse
result = client.scrape(url, options={"format": "json"})

Batch Processing Best Practices

Chunked Processing

Process large batches in chunks:
def chunk_list(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i:i + n]

all_urls = range(10000)  # 10,000 URLs
chunk_size = 100

for chunk in chunk_list(all_urls, chunk_size):
    result = client.batch_scrape(
        urls=[f"https://example.com/page{i}" for i in chunk],
        options={
            "engine": "lightweight",
            "concurrency": 20
        }
    )
    print(f"Processed chunk: {len(chunk)} URLs")

Rate Limit Management

Respect rate limits:
import time

for url in urls:
    result = client.scrape(url, options={"engine": "lightweight"})

    # Respect rate limits (adjust based on your plan)
    time.sleep(0.1)  # 100ms delay between requests

Performance Benchmarks

Speed Comparison

PagesLightweightPlaywrightTime Saved
105 seconds30 seconds83%
10045 seconds5 minutes85%
10007 minutes50 minutes86%

Cost Comparison

FeatureLightweightWith JavaScript
Base scraping1 credit2 credits
1000 pages1000 credits2000 credits
Savings-50%

When to Use Lightweight Engine

News sites, blogs, documentation sites, government portals
REST APIs, XML feeds, JSON endpoints
Large-scale scraping where speed matters
Traditional server-rendered HTML (PHP, Ruby, Python backends)

When NOT to Use Lightweight

Use browser engines instead for:
  • Single Page Applications (React, Vue, Angular)
  • JavaScript-heavy websites
  • Sites requiring user interaction
  • Dynamic content that loads via AJAX
  • Infinite scroll pages

Monitoring Performance

result = client.scrape(
    url="https://example.com",
    options={
        "engine": "lightweight",
        "includeMetrics": True  # Include detailed metrics
    }
)

metrics = result['metadata']['performance']
print(f"DNS Lookup: {metrics['dnsTime']}ms")
print(f"Connection: {metrics['connectionTime']}ms")
print(f"TLS Handshake: {metrics['tlsTime']}ms")
print(f"Download: {metrics['downloadTime']}ms")
print(f"Total: {metrics['totalTime']}ms")

Common Optimizations

Reduce Payload

Block images, fonts, and ads to load only essential content

Parallel Processing

Use concurrent requests to maximize throughput

Smart Caching

Cache frequently accessed pages to save credits

Efficient Formats

Choose text/markdown over HTML for smaller payloads

Batch Processing

Process thousands of pages efficiently

Browser Automation

When you need JavaScript rendering

API Reference

Full scraping API documentation