our APIs

use computer vision, machine learning and natural language processing to help developers extract and understand objects from any Web page.

An API for the Web

We've determined that the entire Web can be classified into approximately 18 structural page types. From this basic understanding of common page layouts, Diffbot then uses computer vision, natural language processing and other machine learning algorithms to identify and extract the important items from within these pages.



We utilize state-of-the art computer vision and NLP algorithms; have the largest collection of tagged pages and update our model several times per week.


Pass in a URL and we'll do the rest. Stop spending time building custom scrapers and -- even worse -- maintaining them.


Diffbot is built and run by Web veterans in a multi-tiered environment with redundancy, monitoring and scalability built-in. Our scale lets us operate the service more cheaply than running it yourself.


We use open standards (schema.org) and allow for endless configurability via our customization tool.

Frontpage API

The Frontpage API takes in a multifaceted “homepage” and returns individual page elements.

Test drive


To use the Frontpage API, perform a HTTP GET request on the following endpoint:


Provide the following arguments:

tokenDeveloper token
urlFrontpage URL from which to extract items (URL encoded)
Optional parameters
timeoutSpecify a value in milliseconds (e.g., &timeout=15000) to override the default API timeout of 5000ms.
formatFormat the response output in xml (default) or json
allReturns all content from page, including navigation and similar links that the Diffbot visual processing engine considers less important / non-core.
Basic authentication
To access pages that require a login/password (using basic access authentication), include the username and password in your url parameter, e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com

Alternatively, you can POST the content to analyze directly to the same endpoint. Specify the Content-Type header as either text/plain or text/html.


DML (Diffbot Markup Language) is an XML format for encoding the extracted structural information from the page. A DML consists of a single info section and a list of items.

Info fieldTypeDescription
idlongDMLID of the URL
titlestringExtracted title of the page
sourceURLurlthe URL this was extracted from
iconurlA link to a small icon/favicon representing the page
numItemsintThe number of items in this DML document

Some of the fields found in Items

Item fieldTypeDescription
idlongUnique hashcode/id of item
titlestringTitle of item
descriptionstringinnerHTML content of item
xrootxpathXPATH of where item was found on the page
pubDatetimestampTimestamp when item was detected on page
linkURLExtracted permalink (if applicable) of item
type{IMAGE,LINK,STORY,CHUNK}Extracted type of the item, whether the item represents an image, permalink, story (image+summary), or html chunk.
imgURLExtracted image from item
textSummarystringA plain-text summary of the item
spdouble<-[0,1]Spam score - the probability that the item is spam/ad
srdouble<-[1,5]Static rank - the quality score of the item on a 1 to 5 scale
freshdouble<-[0,1]Fresh score - the percentage of the item that has changed compared to the previous crawl