Unstructured Folder Loader
Use Unstructured.io to load data from a folder. Note: Currently doesn't support .png and .heic until unstructured is updated.
Last updated
Use Unstructured.io to load data from a folder. Note: Currently doesn't support .png and .heic until unstructured is updated.
Last updated
The Unstructured Folder Loader uses to load and process multiple documents from a folder. It provides advanced document parsing capabilities with extensive configuration options for OCR, chunking, and metadata extraction.
Currently doesn't support .png and .heic files until unstructured is updated.
Batch processing of multiple documents
Multiple processing strategies
OCR support with 15+ languages
Flexible chunking strategies
Table structure inference
XML processing options
Page break handling
Coordinate extraction
Metadata customization
Default API URL: http://localhost:8000/general/v0/general
Can be configured via environment variable: UNSTRUCTURED_API_URL
Optional API key authentication
Folder Path: Path to the folder containing documents to process
Unstructured API URL: API endpoint (default: http://localhost:8000/general/v0/general)
Strategy: Processing strategy (default: auto)
hi_res: High resolution processing
fast: Quick processing
ocr_only: OCR-focused processing
auto: Automatic selection
Encoding: Document encoding (default: utf-8)
OCR Languages: Multiple language support including:
English (eng)
Spanish (spa)
Mandarin Chinese (cmn)
Hindi (hin)
Arabic (ara)
Portuguese (por)
Bengali (ben)
Russian (rus)
Japanese (jpn)
And more...
Skip Infer Table Types: File types to skip table extraction (default: ["pdf", "jpg", "png"])
Hi-Res Model Name: Model selection for hi_res strategy (default: detectron2_onnx)
chipper: Unstructured's in-house VDU model
detectron2_onnx: Facebook AI's fast object detection
yolox: Single-stage real-time detector
yolox_quantized: Optimized YOLOX version
Coordinates: Extract element coordinates (default: false)
Include Page Breaks: Include page break elements
XML Keep Tags: Preserve XML tags
Multi-Page Sections: Handle multi-page sections
Chunking Strategy: Text chunking method (default: by_title)
None: No chunking
by_title: Chunk by document titles
Combine Under N Chars: Minimum chunk size
New After N Chars: Soft maximum chunk size
Max Characters: Hard maximum chunk size (default: 500)
Source ID Key: Key for document source identification (default: source)
Additional Metadata: Custom metadata as JSON
Omit Metadata Keys: Keys to exclude from metadata
Documents: .doc, .docx, .odt, .ppt, .pptx, .pdf
Spreadsheets: .xls, .xlsx
Text: .txt, .text, .md, .rtf
Web: .html, .htm
Email: .eml, .msg
Images: .jpg, .jpeg (Note: .png and .heic currently unsupported)
Each processed document includes:
pageContent: Extracted text content
metadata:
source: Document source identifier
Additional metadata from processing
Custom metadata (if specified)
Choose appropriate strategy based on document quality and processing needs
Configure OCR languages based on document content
Adjust chunking parameters for optimal text segmentation
Use appropriate hi-res model for your use case
Consider memory usage when processing large folders
Monitor API usage and response times
Handle potential API errors in your workflow
Process multiple documents in batch
Supports various file formats
Memory-efficient processing
Automatic metadata handling
Flexible output formats
Error handling for API responses
Configurable processing options
This section is a work in progress. We appreciate any help you can provide in completing this section. Please check our to get started.