Custom API Selectors and Filters

The API Toolkit uses advanced CSS selector logic to override the output of default Diffbot fields (in an Automatic API) or to create entirely new fields. When editing your rules, you can use the following selectors and logic to populate your output.

Basic Selectors

Pattern Matches Example
* any element *
tagname elements with the given tag name div, p
namespace|type elements of type 'type' in the namespace ns fb|name finds <fb:name> elements
#id elements with attribute ID of "id" div#container, #header
.class elements with a class name of "class" div.left, .post-body
element[attr] or [attr] elements with an attribute named "attr" (with any value) a[href], [title]
element[attr=val] or [attr=val] elements with an attribute named "attr" and value equal to "val" img[width=500], a[rel=nofollow]
[^attrPrefix] elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets [^data-], div[^data-]
[attr^=valPrefix] elements with an attribute named "attr", and value starting with "valPrefix" a[href^=http:]
[attr$=valSuffix] elements with an attribute named "attr", and value ending with "valSuffix" img[src$=.png]
[attr*=valContaining] elements with an attribute named "attr", and value containing "valContaining" a[href*=/search/]
[attr~=regex] elements with an attribute named "attr", and value matching the regular expression img[src~=(?i)\\.(png|jpe?g)]
The above may be combined in any order div.header[title]

Combinators

The following can be used to specify certain elements based on their relation to other elements on the page (parents, children, siblings, etc.).

Pattern Matches Example
E F an F element descended from an E element div a, .logo h1
E > F an F direct child of E ol > li
E + F an F element immediately preceded by sibling E li + li, div.head + div
E ~ F an F element preceded by sibling E h1 ~ p
E, F, G all matching elements E, F, or G a[href], div, h3

Pseudo Selectors

The following advanced selectors are also available.

Pattern Matches Example
:first-child elements that are the first child of some other element div > p:first-child finds the first child element of a div that happens to be a p
:last-child elements that are the last child of some other element ul > li:last-child finds the last list-item in each unordered list
:only-child elements that are the only child of a parent element p:only-child finds paragraphs without sibling elements
:first-of-type elements that are the first sibling of its type in the list of children of its parent element div > p:first-of-type finds the first p element of each div
:last-of-type elements that are the last sibling of its type in the list of children of its parent element div > span:last-of-type finds the last span element within div elements
:only-of-type an element that has a parent element and whose parent element has no other element children with the same expanded element name p:only-of-type finds paragraphs without sibling p elements
:empty elements that have no children at all p:empty finds paragraphs without children
:nth-child(an+b) elements that have an+b-1 siblings before them in the document tree, for any positive integer or zero value of n, and have a parent element. Can also take 'odd' and 'even' as arguments. tr:nth-child(2n+1) finds every odd row of a table
:nth-last-child(an+b) elements that have an+b-1 siblings after after them in the document tree. tr:nth-lastchild(-n+2) finds the last two rows of a table
:nth-of-type(an+b) represents an element that has an+b-1 siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent element img:nth-of-type(2n+1)
:nth-last-of-type(an+b) represents an element that has an+b-1 siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent element img:nth-last-of-type(2n+1)
:lt(n) elements whose sibling index is less than n td:lt(3) finds the first 2 cells of each row
:gt(n) elements whose sibling index is greater than n td:gt(1) finds cells after skipping the first two
:eq(n) elements whose sibling index is equal to n td:eq(0) finds the first cell of each row
:has(selector) elements that contains at least one element matching the selector div:has(p) finds divs that contain p elements
:not(selector) elements that do not match the selector div:not(.logo) finds all divs that do not have the "logo" class.
div:not(:has(div)) finds divs that do not contain divs.
:contains(text) elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants. p:contains(jsoup) finds p elements containing the text "jsoup".
:matches(regex) elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants. td:matches(\\d+) finds table cells containing digits. div:matches((?i)login) finds divs containing the text, case insensitively.
:containsOwn(text) elements that directly contains the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants. p:containsOwn(jsoup) finds p elements with own text "jsoup".
:matchesOwn(regex) elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants. td:matchesOwn(\\d+) finds table cells directly containing digits. div:matchesOwn((?i)login) finds divs containing the text, case insensitively.
The above may be combined in any order and with other selectors. .light:contains(name):eq(0)

Output Filters

When creating a new rule, the following filters can be applied to the default selector output to further refine returned data.

Filter Description
Attribute Retrieves the specified attribute value of an element. For example, to extract the link (http://blog.diffbot.com) from the anchor tag <a href="http://www.blog.diffbot.com" class="outbound">, you would enter href as your attribute filter. You can only use a single attribute filter per rule.
Ignore Ignores the specified selectors (and all descendants) if they are found within the primary CSS selector. You may use any of the selector formats specified in this help screen.
Replace Allows you to specify match and replace regular expressions to alter the output returned by the Diffbot API. To remove matching content, simply leave the "replace with" field blank. Backreferences are also supported. For example, you can prepend text with the replace selector (^.*$) and replacement prefix: $1

Diffbot uses a Java implementation for its regular expression parsing. Regular-Expressions.info offers an excellent overview of language-specific distinctions. For more details on Diffbot's regular expression implementation, please see this Support article.