Automatic Web Page Extraction

» For Articles, Images and Products

» Diffbot uses computer vision, natural language processing and machine learning to automatically recognize and structure specific page-types.

Free Two-Week Trial

Article API

The Article API is used to extract clean article text and other data from news articles, blog posts and other text-heavy pages. Retrieve the full-text, cleaned and normalized HTML, related images and videos, author, date, tags -- all automatically, from any article on any site.

Test it out below on any article web page:

Try the Article API with Any Article or Blog Post:

Diffbot's V3 APIs were released in April 2014. For an overview of the differences between versions, please see this Support article.

Request

To use the Article API, perform a HTTP GET request on the following endpoint:

http://api.diffbot.com/v3/article

Provide the following arguments:

ArgumentDescription
token
Developer token
url
Web page URL of the article to process (URL encoded)
Optional arguments
fields
Used to specify exact fields to be returned by the Article API. See the Fields section below.
timeout
Set a value in milliseconds to terminate the response. By default the Article API has a 30-second (30000) timeout.
paging
Pass paging=false to disable automatic concatenation of multiple-page articles. (By default, Diffbot will concatenate up to 20 pages of a single article.) More on automatic concatenation.
callback
Use for jsonp requests. Needed for cross-domain ajax.
links
Returns a top-level object (links) containing all hyperlinks found on the page.
meta
Returns a top-level object (meta) containing the full contents of page meta tags, including sub-arrays for OpenGraph tags, Twitter Card metadata, schema.org microdata, and -- if available -- oEmbed metadata.
querystring
Returns any key/value pairs present in the URL querystring. Items without a discrete value will be returned as true.

The fields argument

Use the fields argument to return optional fields in the JSON response. The default fields will always be returned. For nested arrays, use parentheses to retrieve specific fields, or * to return all sub-fields.

For example, to return tags and the image height fields (in addition to the default fields), your &fields argument would be:

&fields=tags,images(height)

Advanced Text Analysis Powered by Semantria

Our native integration with Semantria optionally allows extracted article content to be fully processed for categorization, entity and keyword extraction, and sentiment analysis. Contact us if you are interested in adding this functionality to your paid Diffbot plan.

Response

Diffbot's V3 APIs return information about all identified objects on a submitted page.

Each V3 response includes a request object (which returns request-specific metadata), and an objects array, which will include the extracted information for all objects on a submitted page. At the moment, only a single object will be returned for Article API requests.

Objects in the Article API's objects array will include the following fields:

FieldDescription
type
Type of object (always article).
pageUrl
URL of submitted page / page from which the article is extracted.
resolvedPageUrl
Returned if the pageUrl redirects to another URL.
title
Title of the article.
text
Full text of the article.
html
Diffbot-normalized HTML of the extracted article. Please see the HTML Specification for a breakdown of elements and attributes returned.
date
Date of extracted article, normalized in most cases to RFC 1123 (HTTP/1.1).
author
Article author.
numPages
Number of pages automatically concatenated to form the text or html response. By default, Diffbot will automatically concatenate up to 20 pages of an article. More on automatic concatenation.
nextPages
Array of all page URLs concatenated in a multipage article. More on automatic concatenation.
images
Array of images, if present within the article body.
url
Fully resolved link to image.
title
Description or caption of the image.
naturalHeight
Raw image height, in pixels.
naturalWidth
Raw image width, in pixels.
primary
Returns true if image is identified as primary based on visual analysis.
diffbotUri
Internal ID used for indexing.
videos
Array of videos, if present within the article body.
url
Fully resolved link to source video content.
primary
Returns true if video is identified as primary based on visual analysis..
diffbotUri
Internal ID used for indexing.
diffbotUri
Internal ID used for indexing.
Optional fields, available using fields= argument
tags
Array of tags/entities, generated from analysis of the extracted text and cross-referenced with DBpedia. Returned with &fields=tags.
label
Name of the automatically-generated entity or tag. Tags are based on analysis of the text field and cross-referenced with DBpedia.
type
Link to the entity type, if identified, at DBpedia.
uri
Link to the entity at DBpedia.
humanLanguage
Returns the (spoken/human) language of the submitted page, using two-letter ISO 639-1 nomenclature.
images
Array of images, if present within the article body.
height
Height of image as (re-)sized via browser/CSS.
width
Width of image as (re-)sized via browser/CSS.
videos
Array of videos, if present within the article body.
naturalHeight
Raw video height, in pixels.
naturalWidth
Raw video width, in pixels.

Example Response

The following request --

http://api.diffbot.com/v3/article?token=...&paging=false&timeout=10000&url=http%3A%2F%2Fblog.diffbot.com%2Fdiffbots-new-product-api-teaches-robots-to-shop-online%2F&fields=tags,title,text,html,author,date,humanLanguage,videos

-- will result in this API response:

{
  "request": {
    "pageUrl": "http://blog.diffbot.com/diffbots-new-product-api-teaches-robots-to-shop-online/",
    "api": "article",
    "options": [
      "paging=false",
      "timeout=10000"
    ],
    "fields": "tags,title,text,html,author,date,humanLanguage,videos",
    "version": 3
  },
  "objects": [
    {
    "type": "article",
    "title": "Diffbot's New Product API Teaches Robots to Shop Online",
    "author": "John Davi",
    "date": "Wed, 31 Jul 2013 08:00:00 GMT",
    "videos": [
      {
        "primary": "true",
        "url": "http://www.youtube.com/embed/lfcri5ungRo?feature=oembed"
      }
    ],
    "tags": [
      "e-commerce",
      "SaaS"
    ],
    "pageUrl": "http://blog.diffbot.com/diffbots-new-product-api-teaches-robots-to-shop-online/",
    "humanLanguage": "en",
    "text": "Diffbot's human wranglers are proud today to announce the release of our newest product: an API for... products!\nThe Product API can be used for extracting clean, structured data from any e-commerce product page. It automatically makes available all the product data you'd expect: price, discount/savings amount, shipping cost, product description, any relevant product images, SKU and/or other product IDs.\nEven cooler: pair the Product API with Crawlbot, our intelligent site-spidering tool, and let Diffbot determine which pages are products, then automatically structure the entire catalog. Here's a quick demonstration of Crawlbot at work:\nWe've developed the Product API over the course of two years, building upon our core vision technology that's extracted structured data from billions of web pages, and training our machine learning systems using data from tens of thousands of unique shopping sites. We can't wait for you to try it out.\nWhat are you waiting for? Check out the Product API documentation and dive on in! If you need a token, check out our pricing and plans (including our Free plan).\nQuestions? Hit us up at support@diffbot.com.",
    "html": "<p>Diffbot’s human wranglers are proud today to announce the release of our newest product: an API for… products!</p><p>The <a href=\"http://www.diffbot.com/products/automatic/product\">Product API</a> can be used for extracting clean, structured data from any e-commerce product page. It automatically makes available all the product data you’d expect: price, discount/savings amount, shipping cost, product description, any relevant product images, SKU and/or other product IDs.</p><p>Even cooler: pair the Product API with <a href=\"http://www.diffbot.com/products/crawlbot\">Crawlbot</a>, our intelligent site-spidering tool, and let Diffbot determine which pages are products, then automatically structure the entire catalog. Here’s a quick demonstration of Crawlbot at work:</p><figure><iframe frameborder=\"0\" src=\"http://www.youtube.com/embed/lfcri5ungRo?feature=oembed\"></iframe></figure><p>We’ve developed the Product API over the course of two years, building upon our core vision technology that’s extracted structured data from billions of web pages, and training our machine learning systems using data from tens of thousands of unique shopping sites. We can’t wait for you to try it out.</p><p>What are you waiting for? Check out the <a href=\"http://www.diffbot.com/products/automatic/product\">Product API documentation</a> and dive on in! If you need a token, check out our <a href=\"http://www.diffbot.com/pricing\">pricing and plans</a> (including our Free plan).</p><p>Questions? Hit us up at <a href=\"mailto:support@diffbot.com\">support@diffbot.com</a>.</p>",
    "diffbotUri": "article|3|768070723"
    }
  ]
}

Authentication

You can supply Diffbot with basic authentication credentials or custom HTTP headers (see below) to access intranet pages or other sites that require a login.

Basic Authentication

To access pages that require a login/password (using basic access authentication), include the username and password in your url parameter, e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com.

Custom HTTP Headers

You can supply Diffbot APIs with custom values for the user-agent, referer, cookie, or accept-language values in the HTTP request. These will be used in place of the Diffbot default values.

To provide custom headers, pass in the following values in your own headers when calling the Diffbot API:

HeaderDescription
X-Forward-User-AgentWill be used as Diffbot's User-Agent header when making your request.
X-Forward-RefererWill be used as Diffbot's Referer header when making your request.
X-Forward-CookieWill be used as Diffbot's Cookie header when making your request.
X-Forward-Accept-LanguageWill be used as Diffbot's Accept-Language header when making your request.

Posting Content

If your content is not publicly available (e.g., behind a firewall), you can POST markup or plain text directly to the Article API endpoint for analysis:

http://api.diffbot.com/v3/article?token=...&url=...

Please note that if you submit HTML, the url argument is still required, and will be used to resolve any relative links contained in the markup.

Provide the content to analyze as your POST body, and specify the Content-Type header as text/html (for full markup) or text/plain (for text-only).

HTML Post Sample:

curl
    -H "Content-Type: text/html"
    -d '<html><body><p>Now is the time for all good robots to come to the aid of their-- oh never mind, run!</p></body></html>'
    http://api.diffbot.com/v3/article?token=...&url=http%3A%2F%2Fblog.diffbot.com

Plaintext Post Sample:

curl -H "Content-Type: text/plain" -d 'Now is the time for all good robots to come to the aid of their-- oh never mind, run!' http://api.diffbot.com/v3/article?token=...&fields=tags,text
ags,text