Diffbot Changelog - Diffbot

Documentation

Analyze API

Article API

Discussion API

Image API

Product API

Video API (BETA)

Custom APIs

Bulk Processing

Crawlbot

Search API

Account API

Error Codes

Libraries

Changelog

February 2020
We added support for a quarterly revenues attribute sourced from SEC filings for US corporate entities: type:Organization quarterlyRevenues.quarter:”Q1-2020″
You can now facet on NAICs code names: type:Organization nbEmployeesMax>100000 facet:naicsClassification.name
We added over 15k open-source academic journals to the list of Diffbot Knowledge Graph article sources.
You can now search the Knowledge Graph and enhance firmographic data profiles from the Diffbot Excel Add-In.
We added Diffbot Knowledge Graph ontology reference documentation.
We expanded Diffbot Query Language (DQL) docs.
January 2020
Improved handling of tables and lists in Article data to better support Entity tagging and linking.
Optimized entity tagging in the Article Title. It now occurs when the same entity is mentioned in the title and text of the Article.
Improved location data extraction in the AnalyzeAPI for events.
Added a Diffbot Excel Plug-in to enable our clients to TestDrive Diffbot’s data enrichment API. The beta service currently supports organization firmographic profile data enrichment.
December 2019
Launched new Renderer architecture in support of Crawlbot and DiffbotAPI services.
Added descriptions to 90+ million Organization entities in the KG.
Added 300+ local US news sources to Article data.
November 2019
Improved date/time handling
Improved linking of Board Members to Organizations
Added revisit/update frequency signals based on whether or not a profile was accessed in the last 30 days.
October 2019
Added Longitude and/or Latitude data to 53M Organizations
Added sicClassification attribute to Organizations
Created a more robust employment category taxonomy and ML model in support of employment data
Improved coverage of parentCompany attribute for subsidiary organization entities
Normalized stock exchange labels to improve filtering and discoverability.
Deployed bug fixes to developer Dashboards
September 2019
Added support for the inclusion of RocketReach email contact data (in addition to LeadIQ).
Added support for extraction of the headquarter address from headquarter building entity.
Began improvements to Record Linking for Organizations with emphasis on improving subsidiary data accuracy.
Improved coverage of Org and Person data records with a focus on: ‘educated at’, ‘member of’, ‘owner of’, and ‘position held’ data fields.
Improved Role Classifications: separating CEO and Director.
Enhanced and extended the Visual Query Builder Tool in the Developer Dashboard.
August 2019
Size is now supported in facet queries for articles
Enabled access to crawls and bulk jobs created on child tokens from the app.diffbot.com UI when logged in under the parent token.
Enabled the cloning of a crawl from a crawl job page from the app.diffbot.com UI.
Made significant improvements to the performance of the app.diffbot.com UI.
Added location inference to the Natural Language API.
Improved how importance score is generated for spam profiles.
Improved deduplication on Organization Founders.
Now avoid linking to the same DiffbotURI for some fields, such as the parent and subsidiary entities cannot link to the same unique identifier – Google and Alphabet must have unique IDs.
Removed bad descriptions from the allDescriptions field.
Improved age calculation/inference logic.
July 2019
Added support for multiple Headquarters locations for Organizations.
Added support for multiple stock exchange symbol/pairs.
Improved extraction of city from neighborhoods.
Added support for display of English tags for Non-English taggers.
Trained a Dutch Entitylinker.
Improved RawDataSentinels supporting Organization data ingest including subsidiary data
Improved sub-record linking between Organizations and Founders.
Now force extraction of Headquarter address from HQ building entity.
Now ensure countries are always classified as administrative areas.
Populated missing address in location for 81Mil organizations.
Improved the error message returned for mismatched quotes in DQL queries.
Ensured users have the ability to stop or pause a crawl between crawl rounds from the Dashboard.
Forced the persistence of the assignment of a customAPI to a crawl job.
Set the article title in the <title> field.
Now rank person images for Person profiles.
In DKG: facet-ing on parent key for enums now expand to <enum>.normalizedValue
Now cache Person and Organization images, including logos.
June 2019
Committed to delivering 100% accuracy of ‘Fortune 1000’ Company entity profile core facts (name, headquarters location, website, CEO, founders, logo, isPublic, parent organization, year founded, stock ticker symbol and exchanges, twitter handle, size attributes – employee count & annual top-line revenues) in the Diffbot KnowledgeGraph (DKG).
Enhanced isPublic field population in the DKG.
Enhanced stock ticker symbol extraction in the DKG.
Fixed rules for assigning min and max employees to an Organization in the DKG.
Enriched 3Mil organizations with no revenue data in the DKG.
Improved selection of location for Organization.location in the DKG.
Improved evaluation of postal codes when an address has no street address in the DKG.
Enhanced age calculation/inference in the DKG.
Improved Candidate selection for email address and phone number in the DKG.
Added support for > and < for date/time fields in DQL.
Querying on a DiffbotURI is now strict by default in DQL.
Added support for type:Post (discussions) to DQL.
Added contextually embedded links to docs from the Crawlbot UI.
May 2019
We addressed missing revenues for over 80Mil company entities in the Diffbot KnowledgeGraph (DKG).
Improved DKG entity postal code assignments.
Improved DKG entity Stock Exchange assignments
We removed cookie disclaimer text from DKG entity descriptions.
We improved Organization entity classification in the DKG.
We added the ability to facet on Organization name tokens in DQL.
We expanded currency support in the Diffbot extraction APIs to include ALL currencies in Europe in addition to the European Union (Euro currency standard).
We improved DQL error messages.
We lifted the limit on facet pagination.
Organization size attributes are now supported in facets.
We normalized Organization entity importance in the DKG to score between 1 and 100.
February 2019
Extended coverage of Entities located or residing in Asia to the Diffbot KnowledgeGraph.
Added support for the strict operator to DQL.
April 2019
Improved Organization Data Quality (i.e. sub-record linking of CEOs and Founders) in the Diffbot KnowledgeGraph (DKG).
Added dedicated process to parse subsidiary entities in the DKG.
Added support for multiple Person/Organization descriptions in the DKG.
Fixed date/timestamp conversion bugs in DQL.
Optimized revenue.value and revenue.currency extractions for Organization profile data in the DKG.
Added support for pagination of facets in DQL.
Added support for querying by tags for type:Image in DQL.
Added facet count to the Diffbot KnowledgeGraph Search API response.
October 2018
Added DQL support for type:Product has:breadcrumb.name
Added support for computation of total investment when individual investments have different currencies (Organization Profile).
Added support for svg image file type for Entity images.
Added indexing of Entity description fields.
Improved tokenization for Chinese/Japanese tagging.
Added hit count for facets.
December 2018
Improved date/time extraction, timezone support in Diffbot extraction APIs.
Added support for ‘has:’operator to DQL for Articles and Products.
August 2018
Launched the Diffbot Knowledge Graph including a new developer Dashboard, embedded ontology documentation, and an OpenAPI spec.
2018-02-27
URL Report downloads are now sorted in newest-first order
Crawlbot now indexes the seed URL of each extracted object in the fromSeedUrl field.
2018-01-05
Crawlbot API: Added the useCanonical argument to allow disabling of canonical URL deduplication on specific crawls. See more.
2017-11-10
Significant improvements to Video API site support.
2017-10-30
Custom API fields using the attribute filter will now return all matching selector values, not just the first attribute match.
2017-10-25
Crawlbot and Bulk Service data retrieval no longer requires access to port :18100. Data downloads are also now HTTPS-only.
2017-10-16
Fixed a rare issue where custom rules could be accidentally deleted.
Significant performance improvements in the Search API.
Improved crawling performance and site coverage in the Global Index.
Improved ability to identify, analyze and return background images in all extraction APIs.
2017-08-31
Fixed an issue in the Video API where the url value would retain HTML escaping if present within the original page source.
Fixed a rare crawling issue that occasionally resulted in “Bad IP” status messages for individual pages.
Fixed an issue where empty <video> elements could be returned in the Article API.
2017-08-15
Fixed an issue in the Global Index in which complicated Boolean (OR) queries would return no results.
2017-08-08
Improved date normalization to include Hijri and Jalali dates
Fixed support for unicode characters in API Toolkit rules
2017-05-22
Many improvements to brand detection in the Product API.
Resolved an issue where humanLanguage could be mis-identified on some Spanish-language pages.
2017-05-15
Crawlbot: resolved an issue where IP-address-only webhooks would not receive notifications.
Crawlbot: improved link spidering/harvesting resilience to markup errors and other invalid HTML source.
Fixed an issue where custom APIs would not display in Crawlbot and Bulk Processing dashboard.
2017-05-11
Improved link-detection when returning page links via our &fields=links argument.
Improved support for and handling of the srcset (and sizes) image attributes in all APIs.
Added detection of Afrikaans (af) in the humanLanguage field.
Improved duplicate detection in the Diffbot Global Index.
2017-04-21
The beta category field has been added to the Product API. See documentation.
All extraction APIs now support the sending of completely custom headers using X-Forward- terminology. Previously only four defined headers were supported. See Article API example.
2017-04-10
In the Article and Discussion APIs’ tags element, DBPedia uri values are now properly URL-encoded.
Fixed an issue when sorting by date in the Search API.
Various improvements and fixes to the Global Index
2017-01-12
The Account API now tracks Global Index search calls/requests.
Improved SKU detection and extraction in the Product API.
Article API: Added support for the start attribute (ol elements) and data- attributes in normalized HTML.
In the Article API, identified image captions will no longer be returned in the text field content.
Various improvements to replacement rule regular expressions in Custom APIs.
PDF processing improvements.
2016-12-09
Product API: overriding the sku, mpn or related fields using custom rules will now affect the productId field as well.
Crawls using the Analyze API will now correctly index video pages.
Improved the reliability of the fields=links argument in all Automatic APIs.
2016-12-01
Updates to our rendering engine to properly support more Unicode scripts
2016-11-30
Updates to our status page for improved coverage and reliability.
Crawlbot crawls can now have repeat settings adjusted or added after a crawl completes.
Fixed a Crawlbot issue wherein users could completely erase the seeds field.
2016-11-13
POSTing to our APIs is speedier, particularly when content includes slow-loading third-party assets.
Crawlbot now has limited support for crawling/processing content across multiple domains.
2016-11-06
Improvements to background image detection and extraction across all Automatic APIs. This resolves many issues with proper extraction from sites that use background CSS properties for image delivery.
Improvements to specification extraction in the Product API.
Improvements to HTML <figure> parsing in the Article API.
2016-10-24
The diffbotUri value for Custom APIs is now calculated from the entirety of custom content (and can be used to detect changes between extractions).
Various improvements to caption detection and parsing in the Article API.
Crawlbot now adheres to the “Diffbot” user-agent in robots.txt directives, so that our crawling can be whitelisted when crawling partner or other sites.
2016-10-04
Increased the size limits for content POSTed to Diffbot APIs.
Bulk Service jobs now require a minimum of 50 URLs for Startup plan customers.
Bulk Service and Crawlbot jobs now automatically retry failing URLs.
2016-09-12
Numerous improvements to normalizedSpecs in the Product API.
Diffbot Automatic APIs now process PDFs. PDF URLs will be converted to HTML and then analyzed for extractable content. PDFs are not currently supported while crawling.
Crawlbot fixes to reduce DNS errors when starting new crawls or crawl rounds.
Crawlbot and Bulk Processing: deletion of a nonexistent job will no longer return a “success” message.
Improved handling of UTF-8 encoded characters within Crawlbot.
2016-06-24
Fixed an issue where large Crawlbot and Bulk job downloads would prematurely terminate.
Added beta support for executing custom Javascript before processing a page via an extraction API. See Analyze API example (works with all Automatic and Custom APIs).
2016-06-17
Added support for custom headers to the Crawlbot and Bulk Job interfaces. Read more.
Added beta field inferredCategory_beta to the Product API, which provides an automatically-determined category for any extracted product.
The Bulk Service will no longer normalize URLs before processing them—all pages will be sent to the specified Diffbot API as-is.
Various improvements to performance and quality of Diffbot’s rendering engine.
Corrected an issue in Custom APIs where a replacement rule would errantly trim blank spaces.
2016-05-24
Crawlbot now supports custom headers while crawling; Bulk Processing jobs now support custom headers for all URLs. Read more.
Fixed an issue in Crawlbot where internal JSON objects were sometimes being returned in JSON data downloads.
Various improvements to date-parsing and normalization in the Article API.
Improvements to “Replacement” and “Ignore” filters in Custom APIs and manual rules.
2016-05-12
Added column to the Crawlbot and Bulk API URL Report indicating if a proxy IP was used.
Added the argument useProxies to Crawlbot, which allows for proxy IPs to be used for specific crawls
2016-05-05
Added proxy usage tracking to the Account API.
2016-04-28
Added available colors (if found on the page) to the normalizedSpecs object in the Product API. Updated format of normalizedSpecs to return multiple values, if available, for a single key.
Fixed an issue where image URLs with spaces in the filename would be incorrectly returned.
Improved proxy support in the extraction APIs to help diversify origin IPs. Read more.
2016-04-15
Released the tagConfidence argument in the Article API, allowing for the return of tags with lower relevance scores if desired.
Improved Crawlbot handling of DNS and other connection issues; increased range of TLDs supported by Crawlbot.
Fixed an issue where duplicate tags were being returned when sentiment analysis was being performed.
2016-03-25
Updated Crawlbot seeds behavior, so that if a non-www subdomain is specified as the only seed URL, crawling will restrict itself to that subdomain.
Significant updates to the beta normalizedSpecs field in the Product API. See more details.
Added the field parentUrlDocId to Crawlbot and Bulk Processing JSON objects. This field can be used to match objects to URLs in the Crawlbot or Bulk Processing URL Report.
2016-03-10
Added the originalType field to extracted objects when utilizing the Analyze API‘s’ fallbackargument.
Fixed an issue with our Semantria integration that could lead to errant timeout responses.
2016-02-25
Fixed an issue in the Article API to prevent in-line Javascript and CSS from being returned in the html field from unsupported video players.
Discussion API: Improved extraction from single-post (no reply) conversations.
Improvements to video extraction within the Video API.
2016-01-29
Added beta fields quantityPrices, priceRange and multiplePrices to the Product API.
Improved availability detection and extraction in the Product API.
Improved offerPrice detection in the Product API to reduce the chance of returning an incorrect value from unavailable products or items without a visible price.
2016-01-26
Significant speed improvements to the Global Index.
2016-01-21
Released an official endpoint for Custom API management. Please see the documentation for information on programmatic management of custom rules and APIs.
Improved video extraction in the Article API to include new providers and HTML5 <video> elements.
max:date queries in the Search and Global Index APIs are now inclusive of the date specified.
2016-01-14
Improved specification extraction in the Product API.
Fixed an issue where the estimatedDate field (Article API) would sometimes not be correctly computed.
2016-01-07
Fixed an issue where the <base> element could be incorrectly use to calculate relative paths.
Added initial functionality to categorize articles in the Article API based on article text content. If you would like to test this beta feature, contact us.
Improved handling of media sources without a specified protocol (e.g. src="//www.youtube.com...). Media element URLs will now match the protocol of the analyzed page.
2015-12-21
Crawlbot and Bulk jobs pending delete (per your Diffbot plan) are now identified in the Crawlbot and Bulk interfaces.
The API Toolkit now uses Diffbot’s custom rendering engine for live web page previews. This should reduce inaccuracies when creating custom rules.
2015-12-18
Fixed an issue where plain-text POSTed to the Article API would not perform text analysis (tags, sentiment, language-detection).
Improved Crawlbot behavior on Ajax-heavy sites so that pages with the exact same HTML source are no longer deduplicated.
Fixed an issue within the Crawlbot and Bulk interfaces where the “Last 500” URL Report was incorrectly returning the first 500.
Improved author detection within the Article API.
2015-12-07
The Analyze API now supports POSTed content.
2015-11-27
The Account API now returns a list of child or sub-tokens.
2015-11-19
Fixed an issue in the Analyze API where products with an API-Toolkit-overridden price field would not reflect changes in the “details” field (offerPriceDetails, regularPriceDetails, etc.).
Fixed an Article API issue for certain top-level domains where articles dated in the near future (e.g., tomorrow) would incorrectly be returned with a date from the prior year.
2015-11-11
Crawlbot will now successfully spider URLs that contain (invalid) UTF-8 characters.
Global Index API: search-by-tag can optionally be performed using a tag-match shorthand.
2015-10-22
We now offer an Account API for tracking token API usage and billing history.
Global Index API: negative search queries (diffbot AND -"machine learning") are now functioning as documented;
2015-10-16
Fixed an issue where Crawlbot and Bulk API data downloads did not include a filename.
The breadcrumb element is now a default field in the Article API.
2015-10-08
APIs no longer ignore “format characters”—invisible characters that may have an effect on neighboring characters. For example, &zwnj;.
Crawlbot and Bulk Service URL Reports now offer an option to download the last 500 URLs crawled.
Global Index API: Faceted date queries will no longer return a min value of 0.
2015-10-02
Analyze API now supports a “fallback” API via the argument fallback. By passing an API value, any otherwise unsupported pages will be forcibly processed via that API. E.g., passing fallback=article will result in any “other” pages being processed by the Article API.
2015-10-01
Hey, as of today we’re publishing a changelog. It’s visible… here.
2015-09-23
Additional token support within a single account has been added. Additional tokens are available on a case-by-case basis to paying customers. Please contact support@diffbot.com if you would like to discuss additional tokens.
API Toolkit now allows direct update of URL pattern / regular expression without having to create a new ruleset.
API Toolkit rule output automatically trims fields to remove leading or trailing blank spaces.
The diffbotUri field is now computed based on rule-based output, if a custom rule is used to override default output.
The resolvedPageUrl is correctly returned in Custom APIs (if a submitted page is redirected).
2015-09-16
Each tag in the tags element now returns a list of all matching rdfTypes.
Email invoices now return both dollars and cents.
2015-08-23
Performance improvements to Article API to prevent intermittent extra-long responses.
2015-08-13
Semantria output updated to include additional fields.
Fix for timeout parameter when sending data to Semantria.
2015-08-10
Invoices are now visible and printable within the Developer Dashboard (under “Account > Billing History”).
2015-08-06
Tokens are now case-insensitive across all Diffbot APIs.
2015-07-16
Article API now returns the siteName, publisherCountry and publisherRegion of an article, if it can be determined or if already known.
If price data is overridden with a custom rule, the corresponding “details” field (offerPriceDetails) will be computed from the overridden value.
2015-06-28
Spotify embeds are properly maintained/returned with Article API html.
Multipage articles, when concatenated, will no longer return duplicate images (that appear on multiple pages).
2015-06-11
Article and discussion tagging now supports Spanish and Chinese language tags, in addition to English and French.