Diffbot API Documentation
Home › Follow APIFollow API
The Follow API allows you to turn any webpage into a source of information that can be subscribed to.
Using the Follow API consists of two steps: setting up the subscription using an Add request and then reading the subscription.
Add Request
The first step in using the follow API is to let Diffbot know about the URL that you wish to follow changes on by registering the URL with Diffbot. The goal of this stage is to receive an identifier (referred to as a DMLID) that you can use to make subsequent requests.
Registering a URL with Diffbot is accomplished by sending a POST request to the following endpoint:
http://www.diffbot.com/api/add?token=...&url=...
Provide the following arguments:
| Argument | Description |
|---|---|
token | Developer token |
url | URL to register |
Example Response
<dml id="538">
<info>
<title>CNN.com - Breaking News, U.S., World, Weather & Video News</title>
<sourceURL>http://www.cnn.com</sourceURL>
<last_crawled>Tue, 8 Feb 2011 17:49:24 GMT</last_crawled>
</info>
</dml>
The key bit of information in the response is the "id" attribute, which you will use to retrieve your subscription
Read Request
Once you have registered your URL, you can use the id (DMLID) to retrieve the latest changes Diffbot has detected to that URL. The markup format of the response is given in DML format.
Note: a developer token is not required for a read request.
http://www.diffbot.com/api/dfs/dml/archive?id=...
Provide the following arguments:
| Argument | Description |
|---|---|
id | id of URL(See Add Request to get one) |
The response to a read requests consists of a series of DML documents
DML Response
DML (Diffbot Markup Language) is an XML format for encoding the extracted structural information from the page. A DML consists of a single info section and a list of items.
Fields found in the Info section
| Info field | Type | Description |
|---|---|---|
id | long | DMLID of the URL |
title | string | Extracted title of the page |
sourceURL | url | the URL this was extracted from (this is the URL given in the Add Request |
icon | url | A link to a small icon/favicon representing the page |
numItems | int | The number of items in this DML document |
Some of the fields found in Items
| Item field | Type | Description |
|---|---|---|
id | long | Unique hashcode/id of item |
title | string | Title of item |
description | string | innerHTML content of item |
xroot | xpath | XPATH of where item was found on the page |
pubDate | timestamp | Timestamp when item was detected on page |
link | URL | Extracted permalink (if applicable) of item |
type | {IMAGE,LINK,STORY,CHUNK} | Extracted type of the item, whether the item represents a image, permalink, story(image+summary), or html chunk. |
img | URL | Extracted image from item |
textSummary | string | A plain-text summary of the item |
sp | double<-[0,1] | Spam score - the probability that the change is spam/ad |
sr | double<-[1,5] | Static rank - the quality score of the item on a 1 to 5 scale |
fresh | double<-[0,1] | Fresh score - the percentage of the item that has changed compared to the previous crawl |