Crawlbot API: Added the
useCanonical argument to allow disabling of canonical URL deduplication on specific crawls.
See more. 2017-11-10
Significant improvements to
Video API site support. 2017-10-30
Custom API fields using the
attribute filter will now return all matching selector values, not just the first attribute match. 2017-10-25
Crawlbot and Bulk Service data retrieval no longer requires access to port :18100. Data downloads are also now HTTPS-only.
Fixed a rare issue where custom rules could be accidentally deleted. Significant performance improvements in the Search API. Improved crawling performance and site coverage in the Global Index. Improved ability to identify, analyze and return background images in all extraction APIs.
Fixed an issue in the Video API where the
url value would retain HTML escaping if present within the original page source.
Fixed a rare crawling issue that occasionally resulted in “Bad IP” status messages for individual pages. Fixed an issue where empty
<video> elements could be returned in the Article API.
Fixed an issue in the Global Index in which complicated Boolean (
OR) queries would return no results.
Improved date normalization to include Hijri and Jalali dates Fixed support for unicode characters in API Toolkit rules
Many improvements to
brand detection in the Product API.
Resolved an issue where
humanLanguage could be mis-identified on some Spanish-language pages.
Crawlbot: resolved an issue where IP-address-only webhooks would not receive notifications. Crawlbot: improved link spidering/harvesting resilience to markup errors and other invalid HTML source. Fixed an issue where custom APIs would not display in Crawlbot and Bulk Processing dashboard.
Improved link-detection when returning page links via our
Improved support for and handling of the
sizes) image attributes in all APIs.
Added detection of Afrikaans (
af) in the
Improved duplicate detection in the
Diffbot Global Index. 2017-04-21
category field has been added to the Product API. See
documentation. All extraction APIs now support the sending of completely custom headers using
X-Forward- terminology. Previously only four defined headers were supported.
See Article API example. 2017-04-10
In the Article and Discussion APIs’
tags element, DBPedia
uri values are now properly URL-encoded.
Fixed an issue when sorting by
date in the Search API.
Various improvements and fixes to the
Global Index 2017-01-12
The Account API now tracks Global Index search calls/requests. Improved SKU detection and extraction in the Product API. Article API: Added support for the
start attribute (
ol elements) and
data- attributes in
normalized HTML. In the Article API, identified image captions will no longer be returned in the
text field content.
Various improvements to replacement rule regular expressions in Custom APIs. PDF processing improvements.
Product API: overriding the
mpn or related fields using custom rules will now affect the
productId field as well.
Crawls using the Analyze API will now correctly index video pages. Improved the reliability of the
fields=links argument in all Automatic APIs.
Updates to our rendering engine to properly support more Unicode scripts
Updates to our status page for improved coverage and reliability. Crawlbot crawls can now have repeat settings adjusted or added after a crawl completes. Fixed a Crawlbot issue wherein users could completely erase the
POSTing to our APIs is speedier, particularly when content includes slow-loading third-party assets. Crawlbot now has limited support for crawling/processing content across multiple domains.
Improvements to background image detection and extraction across all Automatic APIs. This resolves many issues with proper extraction from sites that use
background CSS properties for image delivery.
Improvements to specification extraction in the Product API. Improvements to HTML
<figure> parsing in the Article API.
diffbotUri value for Custom APIs is now calculated from the entirety of custom content (and can be used to detect changes between extractions).
Various improvements to caption detection and parsing in the Article API. Crawlbot now adheres to the “Diffbot” user-agent in robots.txt directives, so that our crawling can be
whitelisted when crawling partner or other sites. 2016-10-04
Increased the size limits for content POSTed to Diffbot APIs. Bulk Service jobs now require a minimum of 50 URLs for Startup plan customers. Bulk Service and Crawlbot jobs now automatically retry failing URLs.
Numerous improvements to
normalizedSpecs in the Product API.
Diffbot Automatic APIs now process PDFs. PDF URLs will be converted to HTML and then analyzed for extractable content. PDFs are not currently supported while crawling. Crawlbot fixes to reduce DNS errors when starting new crawls or crawl rounds. Crawlbot and Bulk Processing: deletion of a nonexistent job will no longer return a “success” message. Improved handling of UTF-8 encoded characters within Crawlbot.
Analyze API example (works with all Automatic and Custom APIs). 2016-06-17
Added support for custom headers to the Crawlbot and Bulk Job interfaces. Read more. Added beta field
inferredCategory_beta to the Product API, which provides an automatically-determined category for any extracted product.
The Bulk Service will no longer normalize URLs before processing them—all pages will be sent to the specified Diffbot API as-is. Various improvements to performance and quality of Diffbot’s rendering engine. Corrected an issue in
Custom APIs where a replacement rule would errantly trim blank spaces. 2016-05-24
Crawlbot now supports custom headers while crawling; Bulk Processing jobs now support custom headers for all URLs. Read more. Fixed an issue in Crawlbot where internal JSON objects were sometimes being returned in JSON data downloads. Various improvements to date-parsing and normalization in the Article API. Improvements to “Replacement” and “Ignore” filters in Custom APIs and manual rules.
Added column to the Crawlbot and Bulk API URL Report indicating if a proxy IP was used. Added the argument
useProxies to Crawlbot, which allows for proxy IPs to be used for specific crawls
Added proxy usage tracking to the
Account API. 2016-04-28
Added available colors (if found on the page) to the
normalizedSpecs object in the
Product API. Updated format of
normalizedSpecs to return multiple values, if available, for a single key.
Fixed an issue where image URLs with spaces in the filename would be incorrectly returned. Improved proxy support in the extraction APIs to help diversify origin IPs.
Read more. 2016-04-15
tagConfidence argument in the
Article API, allowing for the return of tags with lower relevance scores if desired. Improved Crawlbot handling of DNS and other connection issues; increased range of TLDs supported by Crawlbot. Fixed an issue where duplicate
tags were being returned when
sentiment analysis was being performed.
seeds behavior, so that if a non-www subdomain is specified as the only seed URL, crawling will restrict itself to that subdomain.
Significant updates to the beta
normalizedSpecs field in the Product API.
See more details. Added the field
parentUrlDocId to Crawlbot and Bulk Processing JSON objects. This field can be used to match objects to URLs in the Crawlbot or Bulk Processing
URL Report. 2016-03-10
originalType field to extracted objects when utilizing the
Fixed an issue with our
Semantria integration that could lead to errant timeout responses. 2016-02-25
html field from unsupported video players.
Discussion API: Improved extraction from single-post (no reply) conversations. Improvements to video extraction within the
Video API. 2016-01-29
Added beta fields
multiplePrices to the
Product API. Improved
availability detection and extraction in the
Product API. Improved
offerPrice detection in the
Product API to reduce the chance of returning an incorrect value from unavailable products or items without a visible price. 2016-01-26
Significant speed improvements to the Global Index.
Released an official endpoint for Custom API management. Please see the documentation for information on programmatic management of custom rules and APIs. Improved video extraction in the Article API to include new providers and HTML5
max:date queries in the
Search and Global Index APIs are now inclusive of the date specified. 2016-01-14
specification extraction in the Product API.
Fixed an issue where the
estimatedDate field (
Article API) would sometimes not be correctly computed. 2016-01-07
Fixed an issue where the
<base> element could be incorrectly use to calculate relative paths.
Added initial functionality to categorize articles in the Article API based on article text content. If you would like to test this beta feature, contact us. Improved handling of media sources without a specified protocol (e.g.
src="//www.youtube.com...). Media element URLs will now match the protocol of the analyzed page.
Crawlbot and Bulk jobs pending delete (per your Diffbot plan) are now identified in the Crawlbot and Bulk interfaces. The
API Toolkit now uses Diffbot’s custom rendering engine for live web page previews. This should reduce inaccuracies when creating custom rules. 2015-12-18
Fixed an issue where plain-text POSTed to the Article API would not perform text analysis (tags, sentiment, language-detection). Improved Crawlbot behavior on Ajax-heavy sites so that pages with the exact same HTML source are no longer deduplicated. Fixed an issue within the Crawlbot and Bulk interfaces where the “Last 500” URL Report was incorrectly returning the first 500. Improved
author detection within the Article API.
Analyze API now supports POSTed content. 2015-11-27
Account API now returns a list of child or sub-tokens. 2015-11-19
Fixed an issue in the Analyze API where products with an API-Toolkit-overridden price field would not reflect changes in the “details” field (
Fixed an Article API issue for certain top-level domains where articles dated in the near future (e.g., tomorrow) would incorrectly be returned with a date from the prior year.
Crawlbot will now successfully spider URLs that contain (invalid) UTF-8 characters. Global Index API: search-by-tag can optionally be performed using a tag-match shorthand.
We now offer an Account API for tracking token API usage and billing history. Global Index API: negative search queries (
diffbot AND -"machine learning") are now functioning as documented;
Fixed an issue where Crawlbot and Bulk API data downloads did not include a filename. The
breadcrumb element is now a default field in the Article API.
APIs no longer ignore “format characters”—invisible characters that may have an effect on neighboring characters. For example,
Crawlbot and Bulk Service URL Reports now offer an option to download the last 500 URLs crawled. Global Index API: Faceted date queries will no longer return a
min value of 0.
Analyze API now supports a “fallback” API via the argument
fallback. By passing an API value, any otherwise unsupported pages will be forcibly processed via that API. E.g., passing
fallback=article will result in any “other” pages being processed by the Article API.
Hey, as of today we’re publishing a changelog. It’s visible… here.
Additional token support within a single account has been added. Additional tokens are available on a case-by-case basis to paying customers. Please contact firstname.lastname@example.org if you would like to discuss additional tokens. API Toolkit now allows direct update of URL pattern / regular expression without having to create a new ruleset. API Toolkit rule output automatically trims fields to remove leading or trailing blank spaces. The
diffbotUri field is now computed based on rule-based output, if a custom rule is used to override default output.
resolvedPageUrl is correctly returned in Custom APIs (if a submitted page is redirected).
Each tag in the
tags element now returns a list of all matching
Email invoices now return both dollars
and cents. 2015-08-23
Performance improvements to Article API to prevent intermittent extra-long responses.
Semantria output updated to include additional fields. Fix for
timeout parameter when sending data to Semantria.
Invoices are now visible and printable within the Developer Dashboard (under “Account > Billing History”).
Tokens are now case-insensitive across all Diffbot APIs.
Article API now returns the
publisherRegion of an article, if it can be determined or if already known.
If price data is overridden with a custom rule, the corresponding “details” field (
offerPriceDetails) will be computed from the overridden value.
Spotify embeds are properly maintained/returned with Article API
Multipage articles, when concatenated, will no longer return duplicate images (that appear on multiple pages).
Article and discussion tagging now supports Spanish and Chinese language tags, in addition to English and French.