Crawlbot API

The Crawlbot API allows you to programmatically manage Crawlbot crawls and retrieve output. Crawlbot API responses are in JSON format.

Creating or Updating a Crawl

To create a crawl, make a GET request to https://api.diffbot.com/v3/crawl.

Provide the following data:

ParameterDescription
tokenDeveloper token
nameJob name. This should be a unique identifier and can be used to modify your crawl or retrieve its output.
seedsSeed URL(s). Must be URL encoded. Separate multiple URLs with whitespace to spider multiple sites within the same crawl. If the seed contains a non-www subdomain ("http://blog.diffbot.com" or "http://support.diffbot.com") Crawlbot will restrict spidering to the specified subdomain.
apiUrlFull Diffbot API URL through which to process pages. E.g., &apiUrl=https://api.diffbot.com/v3/article to process matching links via the Article API. The Diffbot API URL can include querystring parameters to tailor the output. For example, &apiUrl=https://api.diffbot.com/v3/product?fields=querystring,meta will process matching links using the Product API, and also return the querystring and meta fields.

To automatically identify and process content using our Page Classifier API (Smart Processing), pass apiUrl=https://api.diffbot.com/v3/analyze?mode=auto to return all page-types. See full Page Classifier documentation under the Automatic APIs documentation.

Be sure to URL encode your Diffbot API actions.

You can refine your crawl using the following optional controls. Read more on crawling versus processing.

ParameterDescription
urlCrawlPatternSpecify ||-separated strings to limit pages crawled to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. !product to exclude URLs containing the string "product," and the ^ and $ characters to limit matches to the beginning or end of the URL.

The use of a urlCrawlPattern will allow Crawlbot to spider outside of the seed domain; it will follow all matching URLs regardless of domain.
urlCrawlRegExSpecify a regular expression to limit pages crawled to those URLs that contain a match to your expression. This will override any urlCrawlPattern value.
urlProcessPatternSpecify ||-separated strings to limit pages processed to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. !/category to exclude URLs containing the string "/category," and the ^ and $ characters to limit matches to the beginning or end of the URL.
urlProcessRegExSpecify a regular expression to limit pages processed to those URLs that contain a match to your expression. This will override any urlProcessPattern value.
pageProcessPatternSpecify ||-separated strings to limit pages processed to those whose HTML contains any of the content strings.

Additional (optional) Parameters:

ParameterDescription
customHeadersSet multiple custom headers to be used while crawling and processing pages sent to Diffbot APIs. Each header should be sent in its own customHeaders argument, with a colon delimiting the header name and value, and should be URL-encoded. For example, &customHeaders=Accept-Language%3Aen-us. See more information on using this functionality.
obeyRobotsPass obeyRobots=0 to ignore a site's robots.txt instructions. See more.
restrictDomainPass restrictDomain=0 to allow limited crawling across subdomains/domains. See more.
useProxiesSet value to 1 to force the use of proxy IPs for the crawl. This will utilize proxy servers for both crawling and processing of pages.
maxHopsSpecify the depth of your crawl. A maxHops=0 will limit processing to the seed URL(s) only -- no other links will be processed; maxHops=1 will process all (otherwise matching) pages whose links appear on seed URL(s); maxHops=2 will process pages whose links appear on those pages; and so on.

By default (maxHops=-1) Crawlbot will crawl and process links at any depth.
maxToCrawlSpecify max pages to spider. Default: 100,000.
maxToProcessSpecify max pages to process through Diffbot APIs. Default: 100,000.
notifyEmailSend a message to this email address when the crawl hits the maxToCrawl or maxToProcess limit, or when the crawl completes.
notifyWebhookPass a URL to be notified when the crawl hits the maxToCrawl or maxToProcess limit, or when the crawl completes. You will receive a POST with X-Crawl-Name and X-Crawl-Status in the headers, and the job's JSON metadata in the POST body. Note that in webhook POSTs the parent jobs will not be sent—only the individual job object will be returned.

We've integrated with Zapier to make webhooks even more powerful; read more on what you can do with Zapier and Diffbot.
crawlDelayWait this many seconds between each URL crawled from a single IP address. Specify the number of seconds as an integer or floating-point number (e.g., crawlDelay=0.25).
repeatSpecify the number of days as a floating-point (e.g. repeat=7.0) to repeat this crawl. By default crawls will not be repeated.
onlyProcessIfNewBy default repeat crawls will only process new (previously unprocessed) pages. Set to 0 (onlyProcessIfNew=0) to process all content on repeat crawls.
maxRoundsSpecify the maximum number of crawl repeats. By default (maxRounds=0) repeating crawls will continue indefinitely.

Response

Upon adding a new crawl, you will receive a success message in the JSON response, in addition to full crawl details:

  "response": "Successfully added urls for spidering."

Pausing, Restarting or Deleting Crawls

You can manage your crawls by making GET requests to the same endpoint, https://api.diffbot.com/v3/crawl.

Provide the following data:

ParameterDescription
tokenDeveloper token
name Job name as defined when the crawl was created.
Job-control arguments
roundStart Pass roundStart=1 to force the start of a new crawl "round" (manually repeat the crawl). If onlyProcessIfNew is set to 1 (default), only newly-created pages will be processed.
pause Pass pause=1 to pause a crawl. Pass pause=0 to resume a paused crawl.
restart Restart removes all crawled data while maintaining crawl settings. Pass restart=1 to restart a crawl.
delete Pass delete=1 to delete a crawl, and all associated data, completely.

Retrieving Crawlbot API Data

To download results please make a GET request to https://api.diffbot.com/v3/crawl/data. Provide the following arguments based on the data you need. By default the complete extracted JSON data will be downloaded.

ParameterDescription
tokenDiffbot token.
nameName of the crawl whose data you wish to download.
formatRequest format=csv to download the extracted data in CSV format (default: json). Note that CSV files will only contain top-level fields.
For diagnostic data:
typeRequest type=urls to retrieve the crawl URL Report (CSV).
numPass an integer value (e.g. num=100) to request a subset of URLs, most recently crawled first.

Using the Search API to Retrieve Crawl Data

You can also use Diffbot's Search API to fine-tune the data retrieved from your Crawlbot or Bulk API jobs.

Search API documentation


Viewing Crawl Details

Your active crawls (and any active Bulk API jobs) will be returned in the jobs object in a request made to https://api.diffbot.com/v3/crawl.

To retrieve a single crawl's details, provide the crawl's name in your request:

ParameterDescription
tokenDeveloper token
nameName of crawl to retrieve.

To view all crawls, simply omit the name parameter.

Response

This will return a JSON response of your token's crawls (and Bulk API) jobs in the jobs object. If you have specified a single job name, only one job's details will be returned.

Sample response from a single crawl:

{
  "jobs": [
    {
      "name": "crawlJob",
      "type": "crawl",
      "jobCreationTimeUTC": 1427410692,
      "jobCompletionTimeUTC": 1427410798,
      "jobStatus": {
        "status": 9,
        "message": "Job has completed and no repeat is scheduled."
      },
      "sentJobDoneNotification": 1,
      "objectsFound": 177,
      "urlsHarvested": 2152,
      "pageCrawlAttempts": 367,
      "pageCrawlSuccesses": 365,
      "pageCrawlSuccessesThisRound": 365,
      "pageProcessAttempts": 210,
      "pageProcessSuccesses": 210,
      "pageProcessSuccessesThisRound": 210,
      "maxRounds": 0,
      "repeat": 0.0,
      "crawlDelay": 0.25,
      "obeyRobots": 1,
      "maxToCrawl": 100000,
      "maxToProcess": 100000,
      "onlyProcessIfNew": 1,
      "seeds": "http://support.diffbot.com",
      "roundsCompleted": 0,
      "roundStartTime": 0,
      "currentTime": 1443822683,
      "currentTimeUTC": 1443822683,
      "apiUrl": "https://api.diffbot.com/v3/analyze",
      "urlCrawlPattern": "",
      "urlProcessPattern": "",
      "pageProcessPattern": "",
      "urlCrawlRegEx": "",
      "urlProcessRegEx": "",
      "maxHops": -1,
      "downloadJson": "http://api.diffbot.com/v3/crawl/download/sampletoken-crawlJob_data.json",
      "downloadUrls": "http://api.diffbot.com/v3/crawl/download/sampletoken-crawlJob_urls.csv",
      "notifyEmail": "support@diffbot.com",
      "notifyWebhook": "http://www.diffbot.com"
    }
  ]
}

Status Codes

The jobStatus object will return the following status codes and associated messages:

StatusMessage
0Job is initializing
1Job has reached maxRounds limit
2Job has reached maxToCrawl limit
3Job has reached maxToProcess limit
4Next round to start in _____ seconds
5No URLs were added to the crawl
6Job paused
7Job in progress
8All crawling temporarily paused by root administrator for maintenance
9Job has completed and no repeat is scheduled
10Failed to crawl any seed
Indicates a problem retrieving links from the seed URL(s)