FlowiseAI
English
English
  • Introduction
  • Get Started
  • Contribution Guide
    • Building Node
  • API Reference
    • Assistants
    • Attachments
    • Chat Message
    • Chatflows
    • Document Store
    • Feedback
    • Leads
    • Ping
    • Prediction
    • Tools
    • Upsert History
    • Variables
    • Vector Upsert
  • CLI Reference
    • User
  • Using Flowise
    • Agentflow V2
    • Agentflow V1 (Deprecating)
      • Multi-Agents
      • Sequential Agents
        • Video Tutorials
    • API
    • Analytic
      • Arize
      • Langfuse
      • Lunary
      • Opik
      • Phoenix
    • Document Stores
    • Embed
    • Monitoring
    • Streaming
    • Uploads
    • Variables
    • Workspaces
    • Evaluations
  • Configuration
    • Auth
      • Application
      • Flows
    • Databases
    • Deployment
      • AWS
      • Azure
      • Alibaba Cloud
      • Digital Ocean
      • Elestio
      • GCP
      • Hugging Face
      • Kubernetes using Helm
      • Railway
      • Render
      • Replit
      • RepoCloud
      • Sealos
      • Zeabur
    • Environment Variables
    • Rate Limit
    • Running Flowise behind company proxy
    • SSO
    • Running Flowise using Queue
    • Running in Production
  • Integrations
    • LangChain
      • Agents
        • Airtable Agent
        • AutoGPT
        • BabyAGI
        • CSV Agent
        • Conversational Agent
        • Conversational Retrieval Agent
        • MistralAI Tool Agent
        • OpenAI Assistant
          • Threads
        • OpenAI Function Agent
        • OpenAI Tool Agent
        • ReAct Agent Chat
        • ReAct Agent LLM
        • Tool Agent
        • XML Agent
      • Cache
        • InMemory Cache
        • InMemory Embedding Cache
        • Momento Cache
        • Redis Cache
        • Redis Embeddings Cache
        • Upstash Redis Cache
      • Chains
        • GET API Chain
        • OpenAPI Chain
        • POST API Chain
        • Conversation Chain
        • Conversational Retrieval QA Chain
        • LLM Chain
        • Multi Prompt Chain
        • Multi Retrieval QA Chain
        • Retrieval QA Chain
        • Sql Database Chain
        • Vectara QA Chain
        • VectorDB QA Chain
      • Chat Models
        • AWS ChatBedrock
        • Azure ChatOpenAI
        • NVIDIA NIM
        • ChatAnthropic
        • ChatCohere
        • Chat Fireworks
        • ChatGoogleGenerativeAI
        • Google VertexAI
        • ChatHuggingFace
        • ChatLocalAI
        • ChatMistralAI
        • IBM Watsonx
        • ChatOllama
        • ChatOpenAI
        • ChatTogetherAI
        • GroqChat
      • Document Loaders
        • Airtable
        • API Loader
        • Apify Website Content Crawler
        • BraveSearch Loader
        • Cheerio Web Scraper
        • Confluence
        • Csv File
        • Custom Document Loader
        • Document Store
        • Docx File
        • Epub File
        • Figma
        • File
        • FireCrawl
        • Folder
        • GitBook
        • Github
        • Google Drive
        • Google Sheets
        • Jira
        • Json File
        • Json Lines File
        • Microsoft Excel
        • Microsoft Powerpoint
        • Microsoft Word
        • Notion
        • PDF Files
        • Plain Text
        • Playwright Web Scraper
        • Puppeteer Web Scraper
        • S3 File Loader
        • SearchApi For Web Search
        • SerpApi For Web Search
        • Spider - web search & crawler
        • Text File
        • Unstructured File Loader
        • Unstructured Folder Loader
      • Embeddings
        • AWS Bedrock Embeddings
        • Azure OpenAI Embeddings
        • Cohere Embeddings
        • Google GenerativeAI Embeddings
        • Google VertexAI Embeddings
        • HuggingFace Inference Embeddings
        • LocalAI Embeddings
        • MistralAI Embeddings
        • Ollama Embeddings
        • OpenAI Embeddings
        • OpenAI Embeddings Custom
        • TogetherAI Embedding
        • VoyageAI Embeddings
      • LLMs
        • AWS Bedrock
        • Azure OpenAI
        • Cohere
        • GoogleVertex AI
        • HuggingFace Inference
        • Ollama
        • OpenAI
        • Replicate
      • Memory
        • Buffer Memory
        • Buffer Window Memory
        • Conversation Summary Memory
        • Conversation Summary Buffer Memory
        • DynamoDB Chat Memory
        • MongoDB Atlas Chat Memory
        • Redis-Backed Chat Memory
        • Upstash Redis-Backed Chat Memory
        • Zep Memory
      • Moderation
        • OpenAI Moderation
        • Simple Prompt Moderation
      • Output Parsers
        • CSV Output Parser
        • Custom List Output Parser
        • Structured Output Parser
        • Advanced Structured Output Parser
      • Prompts
        • Chat Prompt Template
        • Few Shot Prompt Template
        • Prompt Template
      • Record Managers
      • Retrievers
        • Extract Metadata Retriever
        • Custom Retriever
        • Cohere Rerank Retriever
        • Embeddings Filter Retriever
        • HyDE Retriever
        • LLM Filter Retriever
        • Multi Query Retriever
        • Prompt Retriever
        • Reciprocal Rank Fusion Retriever
        • Similarity Score Threshold Retriever
        • Vector Store Retriever
        • Voyage AI Rerank Retriever
      • Text Splitters
        • Character Text Splitter
        • Code Text Splitter
        • Html-To-Markdown Text Splitter
        • Markdown Text Splitter
        • Recursive Character Text Splitter
        • Token Text Splitter
      • Tools
        • BraveSearch API
        • Calculator
        • Chain Tool
        • Chatflow Tool
        • Custom Tool
        • Exa Search
        • Gmail
        • Google Calendar
        • Google Custom Search
        • Google Drive
        • Google Sheets
        • Microsoft Outlook
        • Microsoft Teams
        • OpenAPI Toolkit
        • Code Interpreter by E2B
        • Read File
        • Request Get
        • Request Post
        • Retriever Tool
        • SearchApi
        • SearXNG
        • Serp API
        • Serper
        • Tavily
        • Web Browser
        • Write File
      • Vector Stores
        • AstraDB
        • Chroma
        • Couchbase
        • Elastic
        • Faiss
        • In-Memory Vector Store
        • Milvus
        • MongoDB Atlas
        • OpenSearch
        • Pinecone
        • Postgres
        • Qdrant
        • Redis
        • SingleStore
        • Supabase
        • Upstash Vector
        • Vectara
        • Weaviate
        • Zep Collection - Open Source
        • Zep Collection - Cloud
    • LiteLLM Proxy
    • LlamaIndex
      • Agents
        • OpenAI Tool Agent
        • Anthropic Tool Agent
      • Chat Models
        • AzureChatOpenAI
        • ChatAnthropic
        • ChatMistral
        • ChatOllama
        • ChatOpenAI
        • ChatTogetherAI
        • ChatGroq
      • Embeddings
        • Azure OpenAI Embeddings
        • OpenAI Embedding
      • Engine
        • Query Engine
        • Simple Chat Engine
        • Context Chat Engine
        • Sub-Question Query Engine
      • Response Synthesizer
        • Refine
        • Compact And Refine
        • Simple Response Builder
        • Tree Summarize
      • Tools
        • Query Engine Tool
      • Vector Stores
        • Pinecone
        • SimpleStore
    • Utilities
      • Custom JS Function
      • Set/Get Variable
      • If Else
      • Sticky Note
    • External Integrations
      • Zapier Zaps
  • Migration Guide
    • Cloud Migration
    • v1.3.0 Migration Guide
    • v1.4.3 Migration Guide
    • v2.1.4 Migration Guide
  • Tutorials
    • RAG
    • Agentic RAG
    • SQL Agent
    • Agent as Tool
    • Interacting with API
  • Use Cases
    • Calling Children Flows
    • Calling Webhook
    • Interacting with API
    • Multiple Documents QnA
    • SQL QnA
    • Upserting Data
    • Web Scrape QnA
  • Flowise
    • Flowise GitHub
    • Flowise Cloud
Powered by GitBook
On this page
  • Features
  • Configuration
  • API Setup
  • Parameters
  • Required Parameters
  • Optional Parameters
  • Supported File Types
  • Output Structure
  • Document Format
  • Usage Examples
  • Basic Configuration
  • Advanced Processing
  • Best Practices
  • Notes
Edit on GitHub
  1. Integrations
  2. LangChain
  3. Document Loaders

Unstructured Folder Loader

Use Unstructured.io to load data from a folder. Note: Currently doesn't support .png and .heic until unstructured is updated.

PreviousUnstructured File LoaderNextEmbeddings

Last updated 5 days ago

The Unstructured Folder Loader uses to load and process multiple documents from a folder. It provides advanced document parsing capabilities with extensive configuration options for OCR, chunking, and metadata extraction.

Currently doesn't support .png and .heic files until unstructured is updated.

Features

  • Batch processing of multiple documents

  • Multiple processing strategies

  • OCR support with 15+ languages

  • Flexible chunking strategies

  • Table structure inference

  • XML processing options

  • Page break handling

  • Coordinate extraction

  • Metadata customization

Configuration

API Setup

  • Default API URL: http://localhost:8000/general/v0/general

  • Can be configured via environment variable: UNSTRUCTURED_API_URL

  • Optional API key authentication

Parameters

Required Parameters

  • Folder Path: Path to the folder containing documents to process

Optional Parameters

Basic Configuration

  • Unstructured API URL: API endpoint (default: http://localhost:8000/general/v0/general)

  • Strategy: Processing strategy (default: auto)

    • hi_res: High resolution processing

    • fast: Quick processing

    • ocr_only: OCR-focused processing

    • auto: Automatic selection

  • Encoding: Document encoding (default: utf-8)

OCR Options

  • OCR Languages: Multiple language support including:

    • English (eng)

    • Spanish (spa)

    • Mandarin Chinese (cmn)

    • Hindi (hin)

    • Arabic (ara)

    • Portuguese (por)

    • Bengali (ben)

    • Russian (rus)

    • Japanese (jpn)

    • And more...

Processing Options

  • Skip Infer Table Types: File types to skip table extraction (default: ["pdf", "jpg", "png"])

  • Hi-Res Model Name: Model selection for hi_res strategy (default: detectron2_onnx)

    • chipper: Unstructured's in-house VDU model

    • detectron2_onnx: Facebook AI's fast object detection

    • yolox: Single-stage real-time detector

    • yolox_quantized: Optimized YOLOX version

  • Coordinates: Extract element coordinates (default: false)

  • Include Page Breaks: Include page break elements

  • XML Keep Tags: Preserve XML tags

  • Multi-Page Sections: Handle multi-page sections

Text Chunking Options

  • Chunking Strategy: Text chunking method (default: by_title)

    • None: No chunking

    • by_title: Chunk by document titles

  • Combine Under N Chars: Minimum chunk size

  • New After N Chars: Soft maximum chunk size

  • Max Characters: Hard maximum chunk size (default: 500)

Metadata Options

  • Source ID Key: Key for document source identification (default: source)

  • Additional Metadata: Custom metadata as JSON

  • Omit Metadata Keys: Keys to exclude from metadata

Supported File Types

  • Documents: .doc, .docx, .odt, .ppt, .pptx, .pdf

  • Spreadsheets: .xls, .xlsx

  • Text: .txt, .text, .md, .rtf

  • Web: .html, .htm

  • Email: .eml, .msg

  • Images: .jpg, .jpeg (Note: .png and .heic currently unsupported)

Output Structure

Document Format

Each processed document includes:

  • pageContent: Extracted text content

  • metadata:

    • source: Document source identifier

    • Additional metadata from processing

    • Custom metadata (if specified)

Usage Examples

Basic Configuration

{
  "folderPath": "/path/to/documents",
  "strategy": "auto",
  "encoding": "utf-8"
}

Advanced Processing

{
  "folderPath": "/path/to/documents",
  "strategy": "hi_res",
  "hiResModelName": "detectron2_onnx",
  "ocrLanguages": ["eng", "spa", "fra"],
  "chunkingStrategy": "by_title",
  "maxCharacters": 500,
  "coordinates": true,
  "metadata": {
    "source": "company_docs",
    "department": "legal"
  }
}

Best Practices

  1. Choose appropriate strategy based on document quality and processing needs

  2. Configure OCR languages based on document content

  3. Adjust chunking parameters for optimal text segmentation

  4. Use appropriate hi-res model for your use case

  5. Consider memory usage when processing large folders

  6. Monitor API usage and response times

  7. Handle potential API errors in your workflow

Notes

  • Process multiple documents in batch

  • Supports various file formats

  • Memory-efficient processing

  • Automatic metadata handling

  • Flexible output formats

  • Error handling for API responses

  • Configurable processing options

This section is a work in progress. We appreciate any help you can provide in completing this section. Please check our to get started.

Contribution Guide
Unstructured.io
Unstructured Folder Loader Node