Automatic Extraction APIs

Automatic data extraction from articles, products, discussions and more. Diffbot uses advanced AI technology to retrieve clean, structured data without need for manual rules or site-specific training.

Schedule a Demo

Test Drive The Analyze API

The Analyze API automatically determines the "page type" of unknown URLs, and returns complete structured data for supported page types (articles, products, images, discussions or videos).
Please enter a URL to test.

Features

"Looks" Like a Human
Analyze evaluates web pages like a human being does: by looking at it in a regular web browser. Just as you can tell a shopping page from a news article, so can Diffbot.
One-Click Crawling
Use the Analyze API when Crawlbot and all you need is a seed URL. Then let Diffbot find all of the products—or articles, or discussion threads, or videos or images—across the entire site.
Routes to All Automatic APIs
When the Analyze API identifies a matching page-type, it automatically extracts page data using the appropriate extraction API—no extra API call necessary!

Test Drive The Article API

The Article API automatically extracts clean text from news articles and blog posts—returning normalized HTML and plaintext, author and date information, related images/videos and more from any article on any site.
Please enter a URL to test.

Features

Simply the Best
Since 2011 (when anyone started tracking these sorts of things), Diffbot’s AI-driven web extraction has overwhelmingly topped benchmarks of competitors. Compare text-extraction methods.
Fully Automatic
Like all of Diffbot's Automatic APIs, the Article API needs no rules or training. Send it any text-heavy page and let Diffbot do the rest.
Speaks Any Language
Thanks to its basis in computer vision, the Article API extracts clean text in any language.
Native Text Analysis
Topics/tags are automatically generated for each analyzed article, and built-in sentiment analysis automatically scores an article's (and all of its comments') positivity/negativity.
Comments, Too
Diffbot's Discussion API technology is built-in to the Article API to automatically extract comments alongside main article text.
Crawl 'Em All
Pair the Article API with Crawlbot to automatically identify and extract all articles across an entire site.

Test Drive The Discussion API

Diffbot's Discussion API structures the full content of forum threads, article comments, product reviews and more.
Please enter a URL to test.

Features

Fully Automatic
Like all of Diffbot's Automatic APIs, the Discussion API needs no rules or training. Send it any page containing a discussion and let Diffbot do the rest.
Text Analysis
Topics/tags are automatically generated for each post, and built-in sentiment analysis automatically analyzes each individual post to rate its overall positivity/negativity.
Get All the Pages
Long forum thread spanning multiple pages? No problem. Use the &maxPages argument to automatically concatenate as many pages as you need.
Comments & Reviews
Discussion API technology is built-in to the Article and Product APIs to automatically extract comments and reviews from article and product pages.
Crawl 'Em All
Pair the Discussion API with Crawlbot to automatically identify and extract all discussion threads across an entire site.

Test Drive The Image API

Diffbot's Image API extracts and analyzes individual images and image-heavy pages.
Please enter a URL to test.

Features

Fully Automatic
Like all of Diffbot's Automatic APIs, the Image API needs no rules or training. Send it any image-heavy page and let Diffbot do the rest.
Intra-Image Analysis
The Image API automatically evaluates image content and generates tags based on its identified elements.
Image Mentions
Use the mentions field to see where else on the web an image (or its variants) has been seen.

Test Drive The Product API

The Product API extracts complete data from any shopping or e-commerce product page. Retrieve full pricing information, product IDs (SKU, UPC, MPN), images, product specifications, brand and more.
Please enter a URL to test.

Features

Fully Automatic
Like all of our Automatic APIs, the Product API needs no rules or training. Send it any product page and let Diffbot do the rest.
Complete Pricing Data
Retrieve all price data available: original price, sale/offer price, shipping cost and discount amount. If a product comes in price ranges or with quantity-based discounts, the Product API can return that too.
Normalized Specs
The Product API automatically identifies and extracts any specifications tables (or table-like elements).
Review Extraction
Diffbot's Discussion API technology is built-in to the Product API to automatically extract reviews from most product pages.
Country-Specific Pricing
Subscribers to our Plus or Professional plans can optionally upgrade for access to region- or country-specific proxy IP addresses.
Crawl 'Em All
Pair the Product API with Crawlbot to automatically identify and extract all products across an entire shopping site.

Test Drive The Video API

Diffbot's Video API extracts detailed information from video-specific pages.
Please enter a URL to test.

Features

Fully Automatic
Like all Diffbot's Automatic APIs, the Video API works right out of the box, with no need for rules or training.
Get the Raw Bits
Where possible Diffbot extracts the raw source content in addition to embeddable HTML.
Video Metadata
Author/uploader, duration, title, description, date uploaded, video views... Diffbot's visual page analysis returns everything we see on the page.