Bulk API

The Bulk API allows you to submit large numbers of URLs for Diffbot API processing. It is built atop the Crawlbot API and shares much of the same underlying model. Both APIs return data in JSON format.

Creating a Bulk Job

To create a bulk job, make a POST request to https://api.diffbot.com/v3/bulk.

Set your Content-Type header to application/x-www-form-urlencoded (not multipart/form-data). POST body content should be in querystring format (key/value pairs), e.g.:

name=bulkTest&token=sampletoken&urls=http://www.diffbot.com http://blog.diffbot.com&apiUrl=https://api.diffbot.com/v3/analyze

Your POST body should contain the following fields:

tokenDeveloper token
nameJob name. This should be a unique identifier and will be used to modify your bulk job and retrieve its output.
urlsSpace-delimited list of URLs to process. If you are on the Startup plan, jobs must contain at least 50 URLs.
apiUrlThe full Diffbot API to be used for each URL. For instance, to process each URL via the article API, supply https://api.diffbot.com/v3/article. You may also include API parameters, e.g. https://api.diffbot.com/v3/article?fields=meta,tags.
Optional Arguments
customHeadersSet custom headers to be used for processing each URL. Send multiple customHeaders values in your POST body, with header keys/values delimited by a colon (and URL-encoded). See more information on using this functionality.
notifyEmailSend a message to this email address when the bulk job completes.
notifyWebhookPass a URL to be notified when the bulk job completes. You will receive a POST with the full JSON response in the POST body.
repeatSpecify the number of days as a floating-point (e.g. repeat=7.0) to repeat this job. By default bulk jobs will not be repeated.
maxRoundsSpecify the maximum number of repeats. Use maxRounds=-1 to continually repeat.
pageProcessPatternEnter ||-separated strings to limit pages processed to those whose HTML contains any of the content strings. If a page does not contain at least one of the strings, it will be ignored.


Upon adding a new bulk job, you will receive a success message in the JSON response, in addition to full job details:

  "response": "Successfully added urls for spidering."

Pausing, Deleting or Restarting Bulk Jobs

You can manage your bulk jobs by making GET requests to the same endpoint, https://api.diffbot.com/v3/bulk.

Provide the following data:

tokenDeveloper token
name Job name as defined when the bulk was created.
pause Pass pause=1 to pause a bulk job. Pass pause=0 to resume a paused job.
restart Pass restart=1 to restart a bulk job. This will erase all processed data and re-process all of the submitted URLs.
delete Pass delete=1 to delete a job, and all associated data, completely.

Retrieving Bulk Job Data

To download results please make a GET request to https://api.diffbot.com/v3/bulk/data. Provide the following arguments based on the data you need. By default the complete extracted JSON data will be downloaded.

tokenDiffbot token.
nameName of the bulk job whose data you wish to download.
formatRequest format=csv to download the extracted data in CSV format (default: json). Note that CSV files will only contain top-level fields.
For diagnostic data:
typeRequest type=urls to retrieve the bulk URL Report (CSV).
numPass an integer value (e.g. num=100) to request a subset of URLs, most recently processed first.

Using the Search API to Retrieve Bulk Job Data

You can also use Diffbot's Search API to fine-tune the data retrieved from your Crawlbot or Bulk API jobs.

Search API documentation

Viewing Bulk Job Details

Your active bulk jobs (along with any Crawlbot crawls) will be returned in the jobs object in a request made to https://api.diffbot.com/v3/bulk.

To retrieve a single job's details, provide the job's name in your request:

tokenDeveloper token
nameName of bulk job (or crawl) to retrieve.

To view all bulk jobs (and crawls), simply omit the name parameter: https://api.diffbot.com/v3/bulk?token={{token}}


This will return a JSON response of your token's Bulk API jobs (and crawls). Sample response from a single job:

  "jobs": [
      "name": "bulkJob",
      "type": "bulk",
      "jobCreationTimeUTC": 1426872272,
      "jobCompletionTimeUTC": 1426872504,
      "jobStatus": {
        "status": 9,
        "message": "Job has completed and no repeat is scheduled."
      "sentJobDoneNotification": 1,
      "objectsFound": 100,
      "urlsHarvested": 200,
      "pageCrawlAttempts": 100,
      "pageCrawlSuccesses": 100,
      "pageCrawlSuccessesThisRound": 100,
      "pageProcessAttempts": 100,
      "pageProcessSuccesses": 100,
      "pageProcessSuccessesThisRound": 100,
      "maxRounds": 0,
      "repeat": 0.0,
      "crawlDelay": 0.25,
      "obeyRobots": 0,
      "roundsCompleted": 0,
      "roundStartTime": 0,
      "currentTime": 1443822680,
      "currentTimeUTC": 1443822680,
      "apiUrl": "https://api.diffbot.com/v3/analyze",
      "urlCrawlPattern": "",
      "urlProcessPattern": "",
      "pageProcessPattern": "",
      "urlCrawlRegEx": "",
      "urlProcessRegEx": "",
      "maxHops": -1,
      "downloadJson": "https://api.diffbot.com/v3/bulk/download/sampletoken-bulkJob_data.json",
      "downloadUrls": "https://api.diffbot.com/v3/bulk/download/sampletoken-bulkJob_urls.csv",
      "notifyEmail": "support@diffbot.com",
      "notifyWebhook": "http://www.diffbot.com"

Status Codes

The jobStatus object will return the following status codes and associated messages:

0Job is initializing
6Job paused
7Job in progress
8All crawling temporarily paused by root administrator for maintenance.
9Job has completed and no repeat is scheduled