Skip to main content

Overview

WhizoAI can parse and extract content from various document formats including PDFs, Word documents (DOCX), Excel files, and images with OCR capabilities.

Supported Formats

PDF Documents

Extract text, tables, and images from PDF files

Word Documents

Parse DOCX, DOC files and extract formatted content

Images (OCR)

Extract text from PNG, JPG, TIFF using OCR

PDF Parsing

Basic PDF Extraction

from whizoai import WhizoAI

client = WhizoAI(api_key="whizo_YOUR-API-KEY")

# Parse PDF document
result = client.parse_document(
    url="https://example.com/document.pdf",
    options={
        "format": "markdown",
        "extractImages": False
    }
)

print(result['content'])

Extract Text with Layout Preservation

result = client.parse_document(
    url="https://example.com/document.pdf",
    options={
        "format": "markdown",
        "preserveLayout": True,  # Maintain original formatting
        "includePageNumbers": True
    }
)

# Output includes page breaks and numbers
for page in result['pages']:
    print(f"Page {page['number']}:")
    print(page['content'])

Table Extraction from PDFs

result = client.parse_document(
    url="https://example.com/report.pdf",
    options={
        "extractTables": True,
        "tableFormat": "markdown"  # or "csv", "json"
    }
)

# Access extracted tables
for table in result['tables']:
    print(f"Table {table['index']} on page {table['page']}:")
    print(table['data'])  # Structured table data

Image Extraction from PDFs

result = client.parse_document(
    url="https://example.com/document.pdf",
    options={
        "extractImages": True,
        "imageFormat": "png",
        "minImageSize": 100  # Skip small images (pixels)
    }
)

# Download extracted images
for image in result['images']:
    print(f"Image URL: {image['url']}")
    print(f"Page: {image['page']}, Size: {image['width']}x{image['height']}")

Word Document Parsing

DOCX Extraction

result = client.parse_document(
    url="https://example.com/document.docx",
    options={
        "format": "markdown",
        "preserveFormatting": True,  # Keep bold, italic, etc.
        "extractComments": True,  # Include document comments
        "extractFootnotes": True
    }
)

print(result['content'])

Extract Document Metadata

result = client.parse_document(
    url="https://example.com/document.docx",
    options={
        "includeMetadata": True
    }
)

metadata = result['metadata']
print(f"Author: {metadata['author']}")
print(f"Created: {metadata['createdDate']}")
print(f"Modified: {metadata['modifiedDate']}")
print(f"Word Count: {metadata['wordCount']}")
print(f"Page Count: {metadata['pageCount']}")

OCR (Optical Character Recognition)

Extract Text from Images

result = client.parse_document(
    url="https://example.com/scanned-document.jpg",
    options={
        "ocr": True,
        "ocrLanguage": "eng",  # English
        "format": "text"
    }
)

print(result['content'])

Multi-Language OCR

result = client.parse_document(
    url="https://example.com/multilingual.png",
    options={
        "ocr": True,
        "ocrLanguage": "eng+fra+spa",  # English, French, Spanish
        "format": "text"
    }
)

Supported OCR Languages

LanguageCodeLanguageCode
EnglishengSpanishspa
FrenchfraGermandeu
ItalianitaPortuguesepor
Chinese (Simplified)chi_simJapanesejpn
KoreankorArabicara

Excel & Spreadsheet Parsing

Parse Excel Files

result = client.parse_document(
    url="https://example.com/data.xlsx",
    options={
        "format": "json",
        "includeAllSheets": True
    }
)

# Access sheet data
for sheet in result['sheets']:
    print(f"Sheet: {sheet['name']}")
    print(f"Rows: {sheet['data']}")

Convert to CSV

result = client.parse_document(
    url="https://example.com/data.xlsx",
    options={
        "format": "csv",
        "sheetName": "Sales Q1"  # Specific sheet
    }
)

# Get CSV output
csv_data = result['content']

Batch Document Processing

Process Multiple Documents

document_urls = [
    "https://example.com/doc1.pdf",
    "https://example.com/doc2.docx",
    "https://example.com/doc3.xlsx"
]

result = client.batch_parse_documents(
    urls=document_urls,
    options={
        "format": "markdown",
        "extractTables": True
    }
)

for doc in result['documents']:
    print(f"Processed: {doc['url']}")
    print(f"Pages: {doc['pageCount']}")
    print(f"Content length: {len(doc['content'])}")

Advanced Features

PDF Page Range Selection

result = client.parse_document(
    url="https://example.com/large-document.pdf",
    options={
        "pages": "1-5,10,15-20",  # Specific pages only
        "format": "markdown"
    }
)

Password-Protected Documents

result = client.parse_document(
    url="https://example.com/secure.pdf",
    options={
        "password": "document_password",
        "format": "text"
    }
)

Form Field Extraction

Extract data from PDF forms:
result = client.parse_document(
    url="https://example.com/form.pdf",
    options={
        "extractFormFields": True
    }
)

# Access form data
for field in result['formFields']:
    print(f"{field['name']}: {field['value']}")

AI-Powered Document Analysis

Structured Data Extraction from Documents

result = client.extract_from_document(
    url="https://example.com/invoice.pdf",
    schema={
        "invoiceNumber": "Invoice number",
        "date": "Invoice date",
        "total": "Total amount (number)",
        "items": "List of line items with description and price"
    },
    options={
        "model": "gpt-4"
    }
)

print(result['extractedData'])

Document Classification

result = client.classify_document(
    url="https://example.com/document.pdf",
    categories=["invoice", "contract", "resume", "report"],
    options={
        "model": "gpt-3.5-turbo"
    }
)

print(f"Document type: {result['classification']}")
print(f"Confidence: {result['confidence']}")

Error Handling

try:
    result = client.parse_document(
        url="https://example.com/document.pdf",
        options={"format": "markdown"}
    )

except WhizoAIError as e:
    if e.code == 'UNSUPPORTED_FORMAT':
        print("Document format not supported")
    elif e.code == 'PASSWORD_REQUIRED':
        print("Document is password protected")
    elif e.code == 'CORRUPTED_FILE':
        print("Document file is corrupted or invalid")

Credit Costs

OperationCost
PDF Parsing (per page)1 credit
DOCX Parsing (per page)1 credit
OCR (per image)2 credits
Table Extraction+1 credit per table
AI Extraction from Document+3-6 credits (LLM cost)
Image Extraction from PDFIncluded

Performance Tips

Optimize Large Documents
  • Extract specific pages instead of full document
  • Disable image extraction if not needed
  • Use lower quality settings for faster processing
File Size Limits
  • Maximum file size: 100MB
  • For larger files, split into smaller parts
  • Consider using pagination for very long documents

Common Use Cases

Extract invoice numbers, amounts, line items from PDF invoices
Extract candidate information from resume PDFs/DOCX files
Extract key terms, dates, parties from legal contracts
Convert legacy documents to structured data formats
Extract amounts, dates, merchant info from receipt images

Integration Examples

With AI Extraction

# First parse the document
parsed = client.parse_document(
    url="https://example.com/invoice.pdf",
    options={"format": "markdown"}
)

# Then extract structured data
extracted = client.extract(
    content=parsed['content'],  # Use parsed content
    schema={
        "invoiceNumber": "Invoice number",
        "amount": "Total amount",
        "items": "Line items"
    }
)

With Webhooks

result = client.parse_document(
    url="https://example.com/large-report.pdf",
    options={
        "webhook": "https://your-server.com/webhook",
        "format": "markdown"
    }
)

# Webhook receives result when processing completes

AI Extraction

Extract structured data from parsed documents

Batch Processing

Process multiple documents efficiently

Webhooks

Get notifications when document parsing completes