We added support for a quarterly revenues attribute sourced from SEC filings for US corporate entities: type:Organization quarterlyRevenues.quarter:”Q1-2020″ You can now facet on NAICs code names: type:Organization nbEmployeesMax>100000 facet:naicsClassification.name We added over 15k open-source academic journals to the list of Diffbot Knowledge Graph article sources. You can now search the Knowledge Graph and enhance firmographic data profiles from the Diffbot Excel Add-In. We added Diffbot Knowledge Graph ontology reference documentation. We expanded
Diffbot Query Language (DQL) docs. January 2020
Improved handling of tables and lists in Article data to better support Entity tagging and linking. Optimized entity tagging in the Article Title. It now occurs when the same entity is mentioned in the title and text of the Article. Improved location data extraction in the AnalyzeAPI for events. Added a
Diffbot Excel Plug-in to enable our clients to TestDrive Diffbot’s data enrichment API. The beta service currently supports organization firmographic profile data enrichment. December 2019
Launched new Renderer architecture in support of Crawlbot and DiffbotAPI services. Added descriptions to 90+ million Organization entities in the KG. Added 300+ local US news sources to Article data.
Improved date/time handling Improved linking of Board Members to Organizations Added revisit/update frequency signals based on whether or not a profile was accessed in the last 30 days.
Added Longitude and/or Latitude data to 53M Organizations Added sicClassification attribute to Organizations Created a more robust employment category taxonomy and ML model in support of employment data Improved coverage of parentCompany attribute for subsidiary organization entities Normalized stock exchange labels to improve filtering and discoverability. Deployed bug fixes to developer Dashboards
Added support for the inclusion of RocketReach email contact data (in addition to LeadIQ). Added support for extraction of the headquarter address from headquarter building entity. Began improvements to Record Linking for Organizations with emphasis on improving subsidiary data accuracy. Improved coverage of Org and Person data records with a focus on: ‘educated at’, ‘member of’, ‘owner of’, and ‘position held’ data fields. Improved Role Classifications: separating CEO and Director. Enhanced and extended the Visual Query Builder Tool in the Developer Dashboard.
Size is now supported in facet queries for articles Enabled access to crawls and bulk jobs created on child tokens from the app.diffbot.com UI when logged in under the parent token. Enabled the cloning of a crawl from a crawl job page from the app.diffbot.com UI. Made significant improvements to the performance of the app.diffbot.com UI. Added location inference to the Natural Language API. Improved how importance score is generated for spam profiles. Improved deduplication on Organization Founders. Now avoid linking to the same DiffbotURI for some fields, such as the parent and subsidiary entities cannot link to the same unique identifier – Google and Alphabet must have unique IDs. Removed bad descriptions from the allDescriptions field. Improved age calculation/inference logic.
Added support for multiple Headquarters locations for Organizations. Added support for multiple stock exchange symbol/pairs. Improved extraction of city from neighborhoods. Added support for display of English tags for Non-English taggers. Trained a Dutch Entitylinker. Improved RawDataSentinels supporting Organization data ingest including subsidiary data Improved sub-record linking between Organizations and Founders. Now force extraction of Headquarter address from HQ building entity. Now ensure countries are always classified as administrative areas. Populated missing address in location for 81Mil organizations. Improved the error message returned for mismatched quotes in DQL queries. Ensured users have the ability to stop or pause a crawl between crawl rounds from the Dashboard. Forced the persistence of the assignment of a customAPI to a crawl job. Set the article title in the <title> field. Now rank person images for Person profiles. In DKG: facet-ing on parent key for enums now expand to <enum>.normalizedValue Now cache Person and Organization images, including logos.
Committed to delivering 100% accuracy of ‘Fortune 1000’ Company entity profile core facts (name, headquarters location, website, CEO, founders, logo, isPublic, parent organization, year founded, stock ticker symbol and exchanges, twitter handle, size attributes – employee count & annual top-line revenues) in the Diffbot KnowledgeGraph (DKG). Enhanced isPublic field population in the DKG. Enhanced stock ticker symbol extraction in the DKG. Fixed rules for assigning min and max employees to an Organization in the DKG. Enriched 3Mil organizations with no revenue data in the DKG. Improved selection of location for Organization.location in the DKG. Improved evaluation of postal codes when an address has no street address in the DKG. Enhanced age calculation/inference in the DKG. Improved Candidate selection for email address and phone number in the DKG. Added support for > and < for date/time fields in DQL. Querying on a DiffbotURI is now strict by default in DQL. Added support for type:Post (discussions) to DQL. Added contextually embedded links to docs from the Crawlbot UI.
We addressed missing revenues for over 80Mil company entities in the Diffbot KnowledgeGraph (DKG). Improved DKG entity postal code assignments. Improved DKG entity Stock Exchange assignments We removed cookie disclaimer text from DKG entity descriptions. We improved Organization entity classification in the DKG. We added the ability to facet on Organization name tokens in DQL. We expanded currency support in the Diffbot extraction APIs to include ALL currencies in Europe in addition to the European Union (Euro currency standard). We improved DQL error messages. We lifted the limit on facet pagination. Organization size attributes are now supported in facets. We normalized Organization entity importance in the DKG to score between 1 and 100.
Extended coverage of Entities located or residing in Asia to the Diffbot KnowledgeGraph. Added support for the strict operator to DQL.
Improved Organization Data Quality (i.e. sub-record linking of CEOs and Founders) in the Diffbot KnowledgeGraph (DKG). Added dedicated process to parse subsidiary entities in the DKG. Added support for multiple Person/Organization descriptions in the DKG. Fixed date/timestamp conversion bugs in DQL. Optimized revenue.value and revenue.currency extractions for Organization profile data in the DKG. Added support for pagination of facets in DQL. Added support for querying by tags for type:Image in DQL. Added facet count to the Diffbot KnowledgeGraph Search API response.
Added DQL support for type:Product has:breadcrumb.name Added support for computation of total investment when individual investments have different currencies (Organization Profile). Added support for svg image file type for Entity images. Added indexing of Entity description fields. Improved tokenization for Chinese/Japanese tagging. Added hit count for facets.
Improved date/time extraction, timezone support in Diffbot extraction APIs. Added support for ‘has:’operator to DQL for Articles and Products.
Diffbot Knowledge Graph including a new developer Dashboard, embedded ontology documentation, and an OpenAPI spec. 2018-02-27
URL Report downloads are now sorted in newest-first order Crawlbot now indexes the seed URL of each extracted object in the
Crawlbot API: Added the
useCanonical argument to allow disabling of canonical URL deduplication on specific crawls.
See more. 2017-11-10
Significant improvements to
Video API site support. 2017-10-30
Custom API fields using the
attribute filter will now return all matching selector values, not just the first attribute match. 2017-10-25
Crawlbot and Bulk Service data retrieval no longer requires access to port :18100. Data downloads are also now HTTPS-only.
Fixed a rare issue where custom rules could be accidentally deleted. Significant performance improvements in the Search API. Improved crawling performance and site coverage in the Global Index. Improved ability to identify, analyze and return background images in all extraction APIs.
Fixed an issue in the Video API where the
url value would retain HTML escaping if present within the original page source.
Fixed a rare crawling issue that occasionally resulted in “Bad IP” status messages for individual pages. Fixed an issue where empty
<video> elements could be returned in the Article API.
Fixed an issue in the Global Index in which complicated Boolean (
OR) queries would return no results.
Improved date normalization to include Hijri and Jalali dates Fixed support for unicode characters in API Toolkit rules
Many improvements to
brand detection in the Product API.
Resolved an issue where
humanLanguage could be mis-identified on some Spanish-language pages.
Crawlbot: resolved an issue where IP-address-only webhooks would not receive notifications. Crawlbot: improved link spidering/harvesting resilience to markup errors and other invalid HTML source. Fixed an issue where custom APIs would not display in Crawlbot and Bulk Processing dashboard.
Improved link-detection when returning page links via our
Improved support for and handling of the
sizes) image attributes in all APIs.
Added detection of Afrikaans (
af) in the
Improved duplicate detection in the
Diffbot Global Index. 2017-04-21
category field has been added to the Product API. See
documentation. All extraction APIs now support the sending of completely custom headers using
X-Forward- terminology. Previously only four defined headers were supported.
See Article API example. 2017-04-10
In the Article and Discussion APIs’
tags element, DBPedia
uri values are now properly URL-encoded.
Fixed an issue when sorting by
date in the Search API.
Various improvements and fixes to the
Global Index 2017-01-12
The Account API now tracks Global Index search calls/requests. Improved SKU detection and extraction in the Product API. Article API: Added support for the
start attribute (
ol elements) and
data- attributes in
normalized HTML. In the Article API, identified image captions will no longer be returned in the
text field content.
Various improvements to replacement rule regular expressions in Custom APIs. PDF processing improvements.
Product API: overriding the
mpn or related fields using custom rules will now affect the
productId field as well.
Crawls using the Analyze API will now correctly index video pages. Improved the reliability of the
fields=links argument in all Automatic APIs.
Updates to our rendering engine to properly support more Unicode scripts
Updates to our status page for improved coverage and reliability. Crawlbot crawls can now have repeat settings adjusted or added after a crawl completes. Fixed a Crawlbot issue wherein users could completely erase the
POSTing to our APIs is speedier, particularly when content includes slow-loading third-party assets. Crawlbot now has limited support for crawling/processing content across multiple domains.
Improvements to background image detection and extraction across all Automatic APIs. This resolves many issues with proper extraction from sites that use
background CSS properties for image delivery.
Improvements to specification extraction in the Product API. Improvements to HTML
<figure> parsing in the Article API.
diffbotUri value for Custom APIs is now calculated from the entirety of custom content (and can be used to detect changes between extractions).
Various improvements to caption detection and parsing in the Article API. Crawlbot now adheres to the “Diffbot” user-agent in robots.txt directives, so that our crawling can be
whitelisted when crawling partner or other sites. 2016-10-04
Increased the size limits for content POSTed to Diffbot APIs. Bulk Service jobs now require a minimum of 50 URLs for Startup plan customers. Bulk Service and Crawlbot jobs now automatically retry failing URLs.
Numerous improvements to
normalizedSpecs in the Product API.
Diffbot Automatic APIs now process PDFs. PDF URLs will be converted to HTML and then analyzed for extractable content. PDFs are not currently supported while crawling. Crawlbot fixes to reduce DNS errors when starting new crawls or crawl rounds. Crawlbot and Bulk Processing: deletion of a nonexistent job will no longer return a “success” message. Improved handling of UTF-8 encoded characters within Crawlbot.
Analyze API example (works with all Automatic and Custom APIs). 2016-06-17
Added support for custom headers to the Crawlbot and Bulk Job interfaces. Read more. Added beta field
inferredCategory_beta to the Product API, which provides an automatically-determined category for any extracted product.
The Bulk Service will no longer normalize URLs before processing them—all pages will be sent to the specified Diffbot API as-is. Various improvements to performance and quality of Diffbot’s rendering engine. Corrected an issue in
Custom APIs where a replacement rule would errantly trim blank spaces. 2016-05-24
Crawlbot now supports custom headers while crawling; Bulk Processing jobs now support custom headers for all URLs. Read more. Fixed an issue in Crawlbot where internal JSON objects were sometimes being returned in JSON data downloads. Various improvements to date-parsing and normalization in the Article API. Improvements to “Replacement” and “Ignore” filters in Custom APIs and manual rules.
Added column to the Crawlbot and Bulk API URL Report indicating if a proxy IP was used. Added the argument
useProxies to Crawlbot, which allows for proxy IPs to be used for specific crawls
Added proxy usage tracking to the
Account API. 2016-04-28
Added available colors (if found on the page) to the
normalizedSpecs object in the
Product API. Updated format of
normalizedSpecs to return multiple values, if available, for a single key.
Fixed an issue where image URLs with spaces in the filename would be incorrectly returned. Improved proxy support in the extraction APIs to help diversify origin IPs.
Read more. 2016-04-15
tagConfidence argument in the
Article API, allowing for the return of tags with lower relevance scores if desired. Improved Crawlbot handling of DNS and other connection issues; increased range of TLDs supported by Crawlbot. Fixed an issue where duplicate
tags were being returned when
sentiment analysis was being performed.
seeds behavior, so that if a non-www subdomain is specified as the only seed URL, crawling will restrict itself to that subdomain.
Significant updates to the beta
normalizedSpecs field in the Product API.
See more details. Added the field
parentUrlDocId to Crawlbot and Bulk Processing JSON objects. This field can be used to match objects to URLs in the Crawlbot or Bulk Processing
URL Report. 2016-03-10
originalType field to extracted objects when utilizing the
Fixed an issue with our
Semantria integration that could lead to errant timeout responses. 2016-02-25
html field from unsupported video players.
Discussion API: Improved extraction from single-post (no reply) conversations. Improvements to video extraction within the
Video API. 2016-01-29
Added beta fields
multiplePrices to the
Product API. Improved
availability detection and extraction in the
Product API. Improved
offerPrice detection in the
Product API to reduce the chance of returning an incorrect value from unavailable products or items without a visible price. 2016-01-26
Significant speed improvements to the Global Index.
Released an official endpoint for Custom API management. Please see the documentation for information on programmatic management of custom rules and APIs. Improved video extraction in the Article API to include new providers and HTML5
max:date queries in the
Search and Global Index APIs are now inclusive of the date specified. 2016-01-14
specification extraction in the Product API.
Fixed an issue where the
estimatedDate field (
Article API) would sometimes not be correctly computed. 2016-01-07
Fixed an issue where the
<base> element could be incorrectly use to calculate relative paths.
Added initial functionality to categorize articles in the Article API based on article text content. If you would like to test this beta feature, contact us. Improved handling of media sources without a specified protocol (e.g.
src="//www.youtube.com...). Media element URLs will now match the protocol of the analyzed page.
Crawlbot and Bulk jobs pending delete (per your Diffbot plan) are now identified in the Crawlbot and Bulk interfaces. The
API Toolkit now uses Diffbot’s custom rendering engine for live web page previews. This should reduce inaccuracies when creating custom rules. 2015-12-18
Fixed an issue where plain-text POSTed to the Article API would not perform text analysis (tags, sentiment, language-detection). Improved Crawlbot behavior on Ajax-heavy sites so that pages with the exact same HTML source are no longer deduplicated. Fixed an issue within the Crawlbot and Bulk interfaces where the “Last 500” URL Report was incorrectly returning the first 500. Improved
author detection within the Article API.
Analyze API now supports POSTed content. 2015-11-27
Account API now returns a list of child or sub-tokens. 2015-11-19
Fixed an issue in the Analyze API where products with an API-Toolkit-overridden price field would not reflect changes in the “details” field (
Fixed an Article API issue for certain top-level domains where articles dated in the near future (e.g., tomorrow) would incorrectly be returned with a date from the prior year.
Crawlbot will now successfully spider URLs that contain (invalid) UTF-8 characters. Global Index API: search-by-tag can optionally be performed using a tag-match shorthand.
We now offer an Account API for tracking token API usage and billing history. Global Index API: negative search queries (
diffbot AND -"machine learning") are now functioning as documented;
Fixed an issue where Crawlbot and Bulk API data downloads did not include a filename. The
breadcrumb element is now a default field in the Article API.
APIs no longer ignore “format characters”—invisible characters that may have an effect on neighboring characters. For example,
Crawlbot and Bulk Service URL Reports now offer an option to download the last 500 URLs crawled. Global Index API: Faceted date queries will no longer return a
min value of 0.
Analyze API now supports a “fallback” API via the argument
fallback. By passing an API value, any otherwise unsupported pages will be forcibly processed via that API. E.g., passing
fallback=article will result in any “other” pages being processed by the Article API.
Hey, as of today we’re publishing a changelog. It’s visible… here.
Additional token support within a single account has been added. Additional tokens are available on a case-by-case basis to paying customers. Please contact email@example.com if you would like to discuss additional tokens. API Toolkit now allows direct update of URL pattern / regular expression without having to create a new ruleset. API Toolkit rule output automatically trims fields to remove leading or trailing blank spaces. The
diffbotUri field is now computed based on rule-based output, if a custom rule is used to override default output.
resolvedPageUrl is correctly returned in Custom APIs (if a submitted page is redirected).
Each tag in the
tags element now returns a list of all matching
Email invoices now return both dollars
and cents. 2015-08-23
Performance improvements to Article API to prevent intermittent extra-long responses.
Semantria output updated to include additional fields. Fix for
timeout parameter when sending data to Semantria.
Invoices are now visible and printable within the Developer Dashboard (under “Account > Billing History”).
Tokens are now case-insensitive across all Diffbot APIs.
Article API now returns the
publisherRegion of an article, if it can be determined or if already known.
If price data is overridden with a custom rule, the corresponding “details” field (
offerPriceDetails) will be computed from the overridden value.
Spotify embeds are properly maintained/returned with Article API
Multipage articles, when concatenated, will no longer return duplicate images (that appear on multiple pages).
Article and discussion tagging now supports Spanish and Chinese language tags, in addition to English and French.