Document Parsing - WhizoAI API Documentation

Overview

WhizoAI can parse and extract content from various document formats including PDFs, Word documents (DOCX), Excel files, and images with OCR capabilities.

Supported Formats

PDF Documents

Extract text, tables, and images from PDF files

Word Documents

Parse DOCX, DOC files and extract formatted content

Images (OCR)

Extract text from PNG, JPG, TIFF using OCR

PDF Parsing

Basic PDF Extraction

from whizoai import WhizoAI

client = WhizoAI(api_key="whizo_YOUR-API-KEY")

# Parse PDF document
result = client.parse_document(
    url="https://example.com/document.pdf",
    options={
        "format": "markdown",
        "extractImages": False
    }
)

print(result['content'])

Extract Text with Layout Preservation

result = client.parse_document(
    url="https://example.com/document.pdf",
    options={
        "format": "markdown",
        "preserveLayout": True,  # Maintain original formatting
        "includePageNumbers": True
    }
)

# Output includes page breaks and numbers
for page in result['pages']:
    print(f"Page {page['number']}:")
    print(page['content'])

Table Extraction from PDFs

result = client.parse_document(
    url="https://example.com/report.pdf",
    options={
        "extractTables": True,
        "tableFormat": "markdown"  # or "csv", "json"
    }
)

# Access extracted tables
for table in result['tables']:
    print(f"Table {table['index']} on page {table['page']}:")
    print(table['data'])  # Structured table data

Image Extraction from PDFs

result = client.parse_document(
    url="https://example.com/document.pdf",
    options={
        "extractImages": True,
        "imageFormat": "png",
        "minImageSize": 100  # Skip small images (pixels)
    }
)

# Download extracted images
for image in result['images']:
    print(f"Image URL: {image['url']}")
    print(f"Page: {image['page']}, Size: {image['width']}x{image['height']}")

Word Document Parsing

DOCX Extraction

result = client.parse_document(
    url="https://example.com/document.docx",
    options={
        "format": "markdown",
        "preserveFormatting": True,  # Keep bold, italic, etc.
        "extractComments": True,  # Include document comments
        "extractFootnotes": True
    }
)

print(result['content'])

Extract Document Metadata

result = client.parse_document(
    url="https://example.com/document.docx",
    options={
        "includeMetadata": True
    }
)

metadata = result['metadata']
print(f"Author: {metadata['author']}")
print(f"Created: {metadata['createdDate']}")
print(f"Modified: {metadata['modifiedDate']}")
print(f"Word Count: {metadata['wordCount']}")
print(f"Page Count: {metadata['pageCount']}")

OCR (Optical Character Recognition)

Extract Text from Images

result = client.parse_document(
    url="https://example.com/scanned-document.jpg",
    options={
        "ocr": True,
        "ocrLanguage": "eng",  # English
        "format": "text"
    }
)

print(result['content'])

Multi-Language OCR

result = client.parse_document(
    url="https://example.com/multilingual.png",
    options={
        "ocr": True,
        "ocrLanguage": "eng+fra+spa",  # English, French, Spanish
        "format": "text"
    }
)

Supported OCR Languages

Language	Code	Language	Code
English	`eng`	Spanish	`spa`
French	`fra`	German	`deu`
Italian	`ita`	Portuguese	`por`
Chinese (Simplified)	`chi_sim`	Japanese	`jpn`
Korean	`kor`	Arabic	`ara`

Excel & Spreadsheet Parsing

Parse Excel Files

result = client.parse_document(
    url="https://example.com/data.xlsx",
    options={
        "format": "json",
        "includeAllSheets": True
    }
)

# Access sheet data
for sheet in result['sheets']:
    print(f"Sheet: {sheet['name']}")
    print(f"Rows: {sheet['data']}")

Convert to CSV

result = client.parse_document(
    url="https://example.com/data.xlsx",
    options={
        "format": "csv",
        "sheetName": "Sales Q1"  # Specific sheet
    }
)

# Get CSV output
csv_data = result['content']

Batch Document Processing

Process Multiple Documents

document_urls = [
    "https://example.com/doc1.pdf",
    "https://example.com/doc2.docx",
    "https://example.com/doc3.xlsx"
]

result = client.batch_parse_documents(
    urls=document_urls,
    options={
        "format": "markdown",
        "extractTables": True
    }
)

for doc in result['documents']:
    print(f"Processed: {doc['url']}")
    print(f"Pages: {doc['pageCount']}")
    print(f"Content length: {len(doc['content'])}")

Advanced Features

PDF Page Range Selection

result = client.parse_document(
    url="https://example.com/large-document.pdf",
    options={
        "pages": "1-5,10,15-20",  # Specific pages only
        "format": "markdown"
    }
)

Password-Protected Documents

result = client.parse_document(
    url="https://example.com/secure.pdf",
    options={
        "password": "document_password",
        "format": "text"
    }
)

Form Field Extraction

Extract data from PDF forms:

result = client.parse_document(
    url="https://example.com/form.pdf",
    options={
        "extractFormFields": True
    }
)

# Access form data
for field in result['formFields']:
    print(f"{field['name']}: {field['value']}")

AI-Powered Document Analysis

Structured Data Extraction from Documents

result = client.extract_from_document(
    url="https://example.com/invoice.pdf",
    schema={
        "invoiceNumber": "Invoice number",
        "date": "Invoice date",
        "total": "Total amount (number)",
        "items": "List of line items with description and price"
    },
    options={
        "model": "gpt-4"
    }
)

print(result['extractedData'])

Document Classification

result = client.classify_document(
    url="https://example.com/document.pdf",
    categories=["invoice", "contract", "resume", "report"],
    options={
        "model": "gpt-3.5-turbo"
    }
)

print(f"Document type: {result['classification']}")
print(f"Confidence: {result['confidence']}")

Error Handling

try:
    result = client.parse_document(
        url="https://example.com/document.pdf",
        options={"format": "markdown"}
    )

except WhizoAIError as e:
    if e.code == 'UNSUPPORTED_FORMAT':
        print("Document format not supported")
    elif e.code == 'PASSWORD_REQUIRED':
        print("Document is password protected")
    elif e.code == 'CORRUPTED_FILE':
        print("Document file is corrupted or invalid")

Credit Costs

Operation	Cost
PDF Parsing (per page)	1 credit
DOCX Parsing (per page)	1 credit
OCR (per image)	2 credits
Table Extraction	+1 credit per table
AI Extraction from Document	+3-6 credits (LLM cost)
Image Extraction from PDF	Included

Performance Tips

Optimize Large Documents

Extract specific pages instead of full document
Disable image extraction if not needed
Use lower quality settings for faster processing

File Size Limits

Maximum file size: 100MB
For larger files, split into smaller parts
Consider using pagination for very long documents

Common Use Cases

Invoice Processing

Extract invoice numbers, amounts, line items from PDF invoices

Resume Parsing

Extract candidate information from resume PDFs/DOCX files

Contract Analysis

Extract key terms, dates, parties from legal contracts

Data Migration

Convert legacy documents to structured data formats

Receipt OCR

Extract amounts, dates, merchant info from receipt images

Integration Examples

With AI Extraction

# First parse the document
parsed = client.parse_document(
    url="https://example.com/invoice.pdf",
    options={"format": "markdown"}
)

# Then extract structured data
extracted = client.extract(
    content=parsed['content'],  # Use parsed content
    schema={
        "invoiceNumber": "Invoice number",
        "amount": "Total amount",
        "items": "Line items"
    }
)

With Webhooks

result = client.parse_document(
    url="https://example.com/large-report.pdf",
    options={
        "webhook": "https://your-server.com/webhook",
        "format": "markdown"
    }
)

# Webhook receives result when processing completes

AI Extraction

Extract structured data from parsed documents

Batch Processing

Process multiple documents efficiently

Webhooks

Get notifications when document parsing completes

Core Features

Advanced Features

​Overview

​Supported Formats

PDF Documents

Word Documents

Images (OCR)

​PDF Parsing

​Basic PDF Extraction

​Extract Text with Layout Preservation

​Table Extraction from PDFs

​Image Extraction from PDFs

​Word Document Parsing

​DOCX Extraction

​Extract Document Metadata

​OCR (Optical Character Recognition)

​Extract Text from Images

​Multi-Language OCR

​Supported OCR Languages

​Excel & Spreadsheet Parsing

​Parse Excel Files

​Convert to CSV

​Batch Document Processing

​Process Multiple Documents

​Advanced Features

​PDF Page Range Selection

​Password-Protected Documents

​Form Field Extraction

​AI-Powered Document Analysis

​Structured Data Extraction from Documents

​Document Classification

​Error Handling

​Credit Costs

​Performance Tips

​Common Use Cases

​Integration Examples

​With AI Extraction

​With Webhooks

​Related Resources

AI Extraction

Batch Processing

Webhooks

Overview

Supported Formats

PDF Parsing

Basic PDF Extraction

Extract Text with Layout Preservation

Table Extraction from PDFs

Image Extraction from PDFs

Word Document Parsing

DOCX Extraction

Extract Document Metadata

OCR (Optical Character Recognition)

Extract Text from Images

Multi-Language OCR

Supported OCR Languages

Excel & Spreadsheet Parsing

Parse Excel Files

Convert to CSV

Batch Document Processing

Process Multiple Documents

Advanced Features

PDF Page Range Selection

Password-Protected Documents

Form Field Extraction

AI-Powered Document Analysis

Structured Data Extraction from Documents

Document Classification

Error Handling

Credit Costs

Performance Tips

Common Use Cases

Integration Examples

With AI Extraction

With Webhooks

Related Resources