Automatic Computer Vision Extraction

» For Articles, Images, Products and Front Pages

» Diffbot uses computer vision, natural language processing and machine learning to automatically recognize and structure specific page-types.

Get a Free Token

Article API

The Article API is used to extract clean article text from news article web pages.

Test drive

Request

To use the Article API, perform a HTTP GET request on the following endpoint:

http://api.diffbot.com/v2/article

Provide the following parameters:

ParameterDescription
tokenDeveloper token
urlArticle URL to process (URL encoded). If you wish to POST content, please see POSTing Content, below.
Optional parameters
fieldsUsed to control which fields are returned by the API. See the Response section below.
timeoutSet a value in milliseconds to terminate the response. By default the Article API has a five second timeout.
pagingPass paging=false to disable automatic concatenation of multiple-page articles. (By default, Diffbot will concatenate up to ten pages. Read more.)
callbackUse for jsonp requests. Needed for cross-domain ajax.

Example Response

http://api.diffbot.com/v2/article?token=...&url=http%3A%2F%2Fblog.diffbot.com%2Fusing-customize-and-correct-to-make-instant-api-fixes%2F
Note the encoding of the url parameter.

Advanced Text Analysis

Our native integration with Semantria optionally allows extracted article content to be fully processed for categorization, entity extraction, and sentiment analysis. Contact us if you are interested in adding this functionality to your paid Diffbot plan.

Response

The Article API returns information about the primary article content on the submitted page.

Use the fields query parameter to limit or expand which fields are returned in the JSON response. For nested arrays, use parentheses to retrieve specific fields, or * to return all sub-fields.

http://api.diffbot.com/v2/article...&fields=meta,querystring,images(*)
FieldDescription
urlURL submitted. Returned by default.
resolved_urlReturned if the resolving URL is different from the submitted URL (e.g., link shortening services). Returned by default.
titleTitle of extracted article. Returned by default.
textPlain-text of the extracted article. Returned by default.
htmlDiffbot-normalized HTML of the extracted article. Please see the HTML Specification section for a breakdown of elements and attributes returned. Returned by default.
numPagesNumber of pages automatically concatenated to form the text or html response. By default, Diffbot will automatically concatenate multiple-page articles. Read more.)
nextPagesList of all URLs that were concatenated in a multi-page article. Read more.)
dateArticle date, normalized in most cases to GMT. Returned by default.
authorArticle author. Returned by default.
tagsArray of tags, automatically generated by Diffbot natural-language-processing. Returned with fields.
humanLanguage Returns the (spoken/human) language of the submitted URL, using two-letter ISO 639-1 nomenclature. Returned with fields.
imagesArray of images, if present within the article body. Returned by default.
url Direct (fully resolved) link to image.
pixelHeight Image height, in pixels. Returned with fields.
pixelWidth Image width, in pixels. Returned with fields.
caption Diffbot-determined best caption for the image, if detected.
primary Returns "true" if image is identified as primary based on visual analysis of the page.
videosArray of videos, if present within the article body. Returned by default.
url Direct (fully resolved) link to the video content.
pixelHeight Video height, in pixels, if accessible.
pixelWidth Video width, in pixels, if accessible.
primary Returns "true" if the video is identified as primary based on visual analysis of the page.
iconPage favicon. Returned by default.
metaReturns the full contents of page meta tags, including sub-arrays for OpenGraph tags, Twitter Card metadata, schema.org microdata, and -- if available -- oEmbed metadata. Returned with fields.
querystringReturns the key/value pairs of the URL querystring, if present. Items without a value will be returned as "true." Returned with fields.
linksReturns all links (anchor tag href values) found on the page. Returned with fields.
typeType of page -- always article. Returned by default.

Example Response

{
  "type": "article",
  "icon": "http://www.diffbot.com/favicon.ico",
  "title": "Diffbot's New Product API Teaches Robots to Shop Online",
  "author": "John Davi",
  "date": "Wed, 31 Jul 2013 08:00:00 GMT",
  "videos": [
    {
      "primary": "true",
      "url": "http://www.youtube.com/embed/lfcri5ungRo?feature=oembed",
    }
  ],
  "tags": [
    "e-commerce",
    "SaaS"
  ]
  "url": "http://blog.diffbot.com/diffbots-new-product-api-teaches-robots-to-shop-online/",
  "humanLanguage": "en",
  "text": "Diffbot's human wranglers are proud today to announce the release of our newest product: an API for... products!\nThe Product API can be used for extracting clean, structured data from any e-commerce product page. It automatically makes available all the product data you'd expect: price, discount/savings amount, shipping cost, product description, any relevant product images, SKU and/or other product IDs.\nEven cooler: pair the Product API with Crawlbot, our intelligent site-spidering tool, and let Diffbot determine which pages are products, then automatically structure the entire catalog. Here's a quick demonstration of Crawlbot at work:\nWe've developed the Product API over the course of two years, building upon our core vision technology that's extracted structured data from billions of web pages, and training our machine learning systems using data from tens of thousands of unique shopping sites. We can't wait for you to try it out.\nWhat are you waiting for? Check out the Product API documentation and dive on in! If you need a token, check out our pricing and plans (including our Free plan).\nQuestions? Hit us up at support@diffbot.com ."
}

Authentication and Custom Headers

You can supply Diffbot with custom headers, or basic authentication credentials, in order to access intranet pages or other sites that require a login.

Basic Authentication

To access pages that require a login/password (using basic access authentication), include the username and password in your url parameter, e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com.

Custom Headers

You can supply the Article API with custom values for the user-agent, referer, or cookie values in the HTTP request. These will be used in place of the Diffbot default values.

To provide custom headers, pass in the following values in your own headers when calling the Diffbot API:

HeaderDescription
X-Forward-User-AgentWill be used as Diffbot's User-Agent header when making your request.
X-Forward-RefererWill be used as Diffbot's Referer header when making your request.
X-Forward-CookieWill be used as Diffbot's Cookie header when making your request.

Posting Content

If your content is not publicly available (e.g., behind a firewall), you can POST markup directly to the Article API endpoint for analysis:

http://api.diffbot.com/v2/article?token=...&url=...

Please note that the url parameter is still required in the endpoint, and will be used to resolve any relative links contained in the markup.

Provide markup to analyze as your POST body, and specify the Content-Type header as text/html.

The following call submits a sample to the API:

curl
    -H "Content-Type:text/html"
    -d 'Now is the time for all good robots to come to the aid of their-- oh never mind, run!'
    http://api.diffbot.com/v2/article?token=...&url=http://www.diffbot.com/products/automatic/article