Managing Custom Rules Programmatically

You can manage your Custom API using our Custom Rule Management API.

Retrieving Existing Rules

To retrieve your existing rules, perform a HTTP GET request on the following endpoint:

https://api.diffbot.com/v3/custom

Provide the following arguments:

ArgumentDescription
token
Developer token

This returns a JSON-formatted ruleset which corresponds to the UI elements of the Custom API Toolkit.

FieldDescription
urlPatternRegular expression used to match URLs to the appropriate rule.
apiDiffbot API against which the ruleset should be appled. The api value should include the /api/ string, e.g. /api/article.
rulesAn array of rules applying to individual fields of the Diffbot API. The rules array can be empty (rules=[]), but the field must be included in API requests.
nameField to correct (e.g., title) or add (e.g., customCategory).
selectorCSS selector to find the appropriate content on the page.
valueOptional: a specific value to hard-code, in lieu of a selector.
filtersOptional: additional options to replace content, ignore selectors, or extract HTML attribute values. See below.
Optional arguments:
testUrlA sample URL used to preview your rule within the Custom API Toolkit.
prefiltersAn array of selectors that should be completely dropped from the DOM. These selectors will be fully ignored by all Diffbot processing.
renderOptionsQuerystring arguments to be passed to the Diffbot rendering engine, e.g. wait=5000.
xForwardHeadersAn object containing any custom headers to be passed along in all requests to URLs matching the urlPattern. Header values can either be a single string, or an array of strings (from which one will be selected at request-time). Custom headers can include:
User-AgentUser agent to use in place of Diffbot default.
RefererCustom referer to use in place of Diffbot default.
CookieCustom cookie content to be sent with all requests.
Accept-LanguageCustom accept-language header to be sent.
X-EvaluateCustom Javascript to be executed at render-time. See Custom Javascript below.

Rules for nested arrays (e.g., images or videos in the Article API, or products in the Product API) should be nested within the rules object.

Rule Filters (Optional)

A rule's filters element can contain multiple entries, which correspond to the filters within the API Toolkit. Each filter should contain:

FieldDescription
typeType of filter, either replace, exclude (ignore), or attribute.
argsArgument(s) to be used for the filter. If you are replacing content, your args should be a comma-separated list of the regular expression to search for and the expression to replace with. If you are excluding content, your args should enumerate the CSS selector(s) to ignore. If you are attempting to retrieve an HTML attribute, specify the attribute (e.g. src) in your args value.

Custom Javascript

Using the X-Evaluate custom header, you can inject your own Javascript code into web pages. Custom Javascript will be executed once the DOM has loaded.

Your custom Javascript should be provided as a text string and contained in its own function. You must also include the special functions start() and end() to indicate the beginning and end of your custom script. Once end() fires, the updated document will be processed by your chosen extraction API.

It's recommended that your end() function be offset using setTimeout (see JavaScript Timing Events) in order to accommodate your primary function processing. Additionally, if your custom Javascript requires access to Ajax-delivered content, it may be necessary to offset your entire function using setTimeout in order to delay the start of your processing.

The following sample X-Evaluate header waits one-half second after the DOM has loaded, enacts a click on the a.loadMore element, then waits 800 milliseconds before signaling the end():

function() {
    start();
    setTimeout(function() {
        var loadMoreNode = document.querySelector('a.loadMore');
        if (loadMoreNode != null) {
            loadMoreNode.click();
            setTimeout(function() {
                end();
            }, 800);
        } else {
            end();
        }
    }, 500);
}

Delivered as a rule:

"xForwardHeaders": {
   "X-Evaluate": "function() {start();setTimeout(function(){var loadMoreNode=document.querySelector('a.loadMore');if (loadMoreNode != null) {loadMoreNode.click();setTimeout(function(){end();}, 800);} else {end();}},500);}"
}

Sample Ruleset

The following ruleset JSON gives an example of many of the fields and functionality described above.

{
    "urlPattern": "(http(s)?://)?(.*\\.)?support.diffbot.com.*",
    "testUrl": "http://support.diffbot.com/crawlbot/using-zapier-with-crawlbot-or-bulk-api-jobs/",
    "api": "/api/article",
    "prefilters": ["#footer",".advertisement-block"],
    "renderOptions": "wait=10000",
    "xForwardHeaders": {
      "Cookie": [
        "cookie value 1",
        "cookie value 2"
      ],
      "Referer": "http://www.diffbot.com",
      "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36"
    }
    "rules": [
    {
      "selector": ".entry-content p",
      "name": "text"
    },
    {
      "selector": ".entry-content img",
      "name": "images",
      "rules": [
        {
          "name": "primary",
          "value": "true"
        },
        {
          "selector": "img",
          "name": "url",
          "filters": [
            {
              "args": [
                "src"
              ],
              "type": "attribute"
            }
          ]
        }
      ]
    }
  ]
}

The above:

  • Specifies a URL pattern regular expression that matches *.support.diffbot.com
  • Includes a testUrl -- for preview in the Custom API UI.
  • Specifies the API (/api/article).
  • Then, within the rules object:
    • A simple selector to override the text (and html) field.
    • A selector for the images parent container, with its own sub-array of rules for individual images.
    • A filter on the specific image url field to retrieve the elements src attribute.

Creating or Updating Rules

An individual rule is determined by a unique urlPattern and api combination. Create or update rules by POSTing to the following endpoint:

https://api.diffbot.com/v3/custom

Append the following querystring arguments to your POST URL:

ArgumentDescription
token
Developer token
Optional arguments
replace
Hash code of a single ruleset to update. This allows you to update an existing rule's API or urlPattern without adding a new set.

Your POST body should either contain a JSON array of JSON objects -- corresponding to the above fields -- or a single JSON object.

To update or add a single ruleset, send a single JSON object. This will add the new ruleset to your token's rules (or update an existing urlPattern and api pair). To update a ruleset while changing either the api or urlPattern, send that ruleset's hashcode, as returned from the ruleset's last update.

To update/overwrite all rules for your token, send a JSON array of objects. This will replace all rulesets for your token.

Response

Updating or creating rules will return a JSON response containing an array of hashes. These hashes represent each of your updated or created rules, and can be used to update individual rules.

{
  "hashes":
    [
      "507a31ce",
      "ax7n3sa1",
      "z992ns6c"
    ]
}