Cheerio Web Scraper
Cheerio is a fast, flexible, and lean implementation of core jQuery designed specifically for the server. This module provides powerful web scraping capabilities using Cheerio to extract content from web pages.
This module provides a sophisticated web scraper that can:
Load content from single or multiple web pages
Crawl relative links from websites
Extract content using CSS selectors
Handle XML sitemaps
Process web content with text splitters
Inputs
URL: The webpage URL to scrape
Text Splitter (optional): A text splitter to process the extracted content
Get Relative Links Method (optional): Choose between:
Web Crawl: Crawl relative links from HTML URL
Scrape XML Sitemap: Scrape relative links from XML sitemap URL
Get Relative Links Limit (optional): Limit for number of relative links to process (default: 10, 0 for all links)
Selector (CSS) (optional): CSS selector to target specific content
Additional Metadata (optional): JSON object with additional metadata to add to documents
Omit Metadata Keys (optional): Comma-separated list of metadata keys to omit
Outputs
Document: Array of document objects containing metadata and pageContent
Text: Concatenated string from pageContent of documents
Features
CSS selector-based content extraction
Web crawling capabilities
XML sitemap processing
Configurable link limits
Error handling for invalid URLs and PDFs
Metadata customization
Debug logging support
Notes
PDF files are not supported and will be skipped
Invalid URLs will throw an error
Setting link limit to 0 will retrieve all available links (may take longer)
Debug mode provides detailed logging of the scraping process
Scrape One URL
Input desired URL to be scraped.
Crawl & Scrape Multiple URLs
Select
Web Crawl
orScrape XML Sitemap
in Get Relative Links Method.Input
0
in Get Relative Links Limit to retrieve all links available from the provided URL.
Manage Links (Optional)
Input desired URL to be crawled.
Click Fetch Links to retrieve links based on the inputs of the Get Relative Links Method and Get Relative Links Limit in Additional Parameters.
In Crawled Links section, remove unwanted links by clicking Red Trash Bin Icon.
Lastly, click Save.
Output
Loads URL content as Document
Resources
Last updated