Why Diffbot?

We're focused exclusively on getting you better web data. Hundreds of customers make (hundreds of) millions of calls every month. Why?

The Web's Best Content Extractor

Diffbot works without rules or training. There's no better way to extract data from web pages. See how Diffbot stacks up to other content extraction methods:

Identify Pages Automatically

Use the Analyze API to automatically find and extract all products, articles, discussions, videos or images while crawling any site.

Analyze API »

Detailed product data

The Product API automatically returns complete product info, including all pricing data, product IDs, brand and full specifications tables.

Product API »

Clean text and html

Articles, discussion threads, product descriptions and image captions are returned in pure text and sanitized HTML.

Start testing today »

Structured Search

Search structured content from any crawl on-the-fly using our Search API, returning only the matching results.

See more on Crawlbot »


Plus...

  • All APIs execute Javascript so content is parsed like a regular browser.
  • Works on most non-English pages thanks to visual processing.
  • Date normalization: Datestamps are normalized to RFC 1123 (HTTP/1.1).
  • Multipage articles are automatically joined together in a API response.
  • Entity extraction: automatic tagging identifies major topics and entities within article text.
  • Fix any issues realtime with the API Toolkit.
  • Bulk API allows the extraction of hundreds to hundreds-of-thousands of pages.
  • Access Crawlbot and Bulk job data in full JSON or CSV formats.
  • Optionally crawl using a diverse array of IP addresses.

Web Extractor Feature Comparison

Diffbot Alchemy Boilerpipe Embedly Readability Kimono Import.io Mozenda
Article extraction API      
Product extraction API              
Discussion extraction API              
Video extraction API              
Image extraction API            
Page classifying API              
Custom rule creator for custom fields/APIs        
Executes Javascript for full page rendering              
Automatic tag/entity extraction          
Returns normalized HTML              
Returns clean plaintext          
Author identification          
Date extraction and normalization              
Works in any language              
Multipage article concatenation            
Multipage discussion concatenation              
Sentiment analysis            
Language detection          
Article comment extraction              
Video extraction from articles              
Product review extraction              
Product specification table extraction              
Link extraction (return all links on a page)            
Integrated web crawler          
Automatic page classifying while crawling              
Crawler API              
Repeating crawls              
Custom crawling controls/filters              
Crawling anonymity / proxying            
Bulk API (up to 1M URLs)              
Searchable crawl and bulk API data              
Fully-hosted SaaS        
Open source              
Relation extraction              
Face detection            
Taxonomies/categorization              

Ready to get started? Sign up for a free 14-day trial.