Changelog

  • February 2020
  • We added support for a quarterly revenues attribute sourced from SEC filings for US corporate entities: type:Organization quarterlyRevenues.quarter:”Q1-2020″
  • You can now facet on NAICs code names: type:Organization nbEmployeesMax>100000 facet:naicsClassification.name
  • We added over 15k open-source academic journals to the list of Diffbot Knowledge Graph article sources.
  • You can now search the Knowledge Graph and enhance firmographic data profiles from the Diffbot Excel Add-In.
  • We added Diffbot Knowledge Graph ontology reference documentation.
  • We expanded Diffbot Query Language (DQL) docs.
  • January 2020
  • Improved handling of tables and lists in Article data to better support Entity tagging and linking.
  • Optimized entity tagging in the Article Title. It now occurs when the same entity is mentioned in the title and text of the Article.
  • Improved location data extraction in the AnalyzeAPI for events.
  • Added a Diffbot Excel Plug-in to enable our clients to TestDrive Diffbot’s data enrichment API. The beta service currently supports organization firmographic profile data enrichment.
  • December 2019
  • Launched new Renderer architecture in support of Crawlbot and DiffbotAPI services.
  • Added descriptions to 90+ million Organization entities in the KG.
  • Added 300+ local US news sources to Article data.
  • November 2019
  • Improved date/time handling
  • Improved linking of Board Members to Organizations
  • Added revisit/update frequency signals based on whether or not a profile was accessed in the last 30 days.
  • October 2019
  • Added Longitude and/or Latitude data to 53M Organizations
  • Added sicClassification attribute to Organizations
  • Created a more robust employment category taxonomy and ML model in support of employment data
  • Improved coverage of parentCompany attribute for subsidiary organization entities
  • Normalized stock exchange labels to improve filtering and discoverability.
  • Deployed bug fixes to developer Dashboards
  • September 2019
  • Added support for the inclusion of RocketReach email contact data (in addition to LeadIQ).
  • Added support for extraction of the headquarter address from headquarter building entity.
  • Began improvements to Record Linking for Organizations with emphasis on improving subsidiary data accuracy.
  • Improved coverage of Org and Person data records with a focus on: ‘educated at’, ‘member of’, ‘owner of’, and ‘position held’ data fields.
  • Improved Role Classifications: separating CEO and Director.
  • Enhanced and extended the Visual Query Builder Tool in the Developer Dashboard.
  • August 2019
  • Size is now supported in facet queries for articles
  • Enabled access to crawls and bulk jobs created on child tokens from the app.diffbot.com UI when logged in under the parent token.
  • Enabled the cloning of a crawl from a crawl job page from the app.diffbot.com UI.
  • Made significant improvements to the performance of the app.diffbot.com UI.
  • Added location inference to the Natural Language API.
  • Improved how importance score is generated for spam profiles.
  • Improved deduplication on Organization Founders.
  • Now avoid linking to the same DiffbotURI for some fields, such as the parent and subsidiary entities cannot link to the same unique identifier – Google and Alphabet must have unique IDs.
  • Removed bad descriptions from the allDescriptions field.
  • Improved age calculation/inference logic.
  • July 2019
  • Added support for multiple Headquarters locations for Organizations.
  • Added support for multiple stock exchange symbol/pairs.
  • Improved extraction of city from neighborhoods.
  • Added support for display of English tags for Non-English taggers.
  • Trained a Dutch Entitylinker.
  • Improved RawDataSentinels supporting Organization data ingest including subsidiary data
  • Improved sub-record linking between Organizations and Founders.
  • Now force extraction of Headquarter address from HQ building entity.
  • Now ensure countries are always classified as administrative areas.
  • Populated missing address in location for 81Mil organizations.
  • Improved the error message returned for mismatched quotes in DQL queries.
  • Ensured users have the ability to stop or pause a crawl between crawl rounds from the Dashboard.
  • Forced the persistence of the assignment of a customAPI to a crawl job.
  • Set the article title in the <title> field.
  • Now rank person images for Person profiles.
  • In DKG: facet-ing on parent key for enums now expand to <enum>.normalizedValue
  • Now cache Person and Organization images, including logos.
  • June 2019
  • Committed to delivering 100% accuracy of ‘Fortune 1000’ Company entity profile core facts (name, headquarters location, website, CEO, founders, logo, isPublic, parent organization, year founded, stock ticker symbol and exchanges, twitter handle, size attributes – employee count & annual top-line revenues) in the Diffbot KnowledgeGraph (DKG).
  • Enhanced isPublic field population in the DKG.
  • Enhanced stock ticker symbol extraction in the DKG.
  • Fixed rules for assigning min and max employees to an Organization in the DKG.
  • Enriched 3Mil organizations with no revenue data in the DKG.
  • Improved selection of location for Organization.location in the DKG.
  • Improved evaluation of postal codes when an address has no street address in the DKG.
  • Enhanced age calculation/inference in the DKG.
  • Improved Candidate selection for email address and phone number in the DKG.
  • Added support for > and < for date/time fields in DQL.
  • Querying on a DiffbotURI is now strict by default in DQL.
  • Added support for type:Post (discussions) to DQL.
  • Added contextually embedded links to docs from the Crawlbot UI.
  • May 2019
  • We addressed missing revenues for over 80Mil company entities in the Diffbot KnowledgeGraph (DKG).
  • Improved DKG entity postal code assignments.
  • Improved DKG entity Stock Exchange assignments
  • We removed cookie disclaimer text from DKG entity descriptions.
  • We improved Organization entity classification in the DKG.
  • We added the ability to facet on Organization name tokens in DQL.
  • We expanded currency support in the Diffbot extraction APIs to include ALL currencies in Europe in addition to the European Union (Euro currency standard).
  • We improved DQL error messages.
  • We lifted the limit on facet pagination.
  • Organization size attributes are now supported in facets.
  • We normalized Organization entity importance in the DKG to score between 1 and 100.
  • February 2019
  • Extended coverage of Entities located or residing in Asia to the Diffbot KnowledgeGraph.
  • Added support for the strict operator to DQL.
  • April 2019
  • Improved Organization Data Quality (i.e. sub-record linking of CEOs and Founders) in the Diffbot KnowledgeGraph (DKG).
  • Added dedicated process to parse subsidiary entities in the DKG.
  • Added support for multiple Person/Organization descriptions in the DKG.
  • Fixed date/timestamp conversion bugs in DQL.
  • Optimized revenue.value and revenue.currency extractions for Organization profile data in the DKG.
  • Added support for pagination of facets in DQL.
  • Added support for querying by tags for type:Image in DQL.
  • Added facet count to the Diffbot KnowledgeGraph Search API response.
  • October 2018
  • Added DQL support for type:Product has:breadcrumb.name
  • Added support for computation of total investment when individual investments have different currencies (Organization Profile).
  • Added support for svg image file type for Entity images.
  • Added indexing of Entity description fields.
  • Improved tokenization for Chinese/Japanese tagging.
  • Added hit count for facets.
  • December 2018
  • Improved date/time extraction, timezone support in Diffbot extraction APIs.
  • Added support for ‘has:’operator to DQL for Articles and Products.
  • August 2018
  • Launched the Diffbot Knowledge Graph including a new developer Dashboard, embedded ontology documentation, and an OpenAPI spec.
  • 2018-02-27
  • URL Report downloads are now sorted in newest-first order
  • Crawlbot now indexes the seed URL of each extracted object in the fromSeedUrl field.
  • 2018-01-05
  • Crawlbot API: Added the useCanonical argument to allow disabling of canonical URL deduplication on specific crawls. See more.
  • 2017-11-10
  • Significant improvements to Video API site support.
  • 2017-10-30
  • Custom API fields using the attribute filter will now return all matching selector values, not just the first attribute match.
  • 2017-10-25
  • Crawlbot and Bulk Service data retrieval no longer requires access to port :18100. Data downloads are also now HTTPS-only.
  • 2017-10-16
  • Fixed a rare issue where custom rules could be accidentally deleted.
  • Significant performance improvements in the Search API.
  • Improved crawling performance and site coverage in the Global Index.
  • Improved ability to identify, analyze and return background images in all extraction APIs.
  • 2017-08-31
  • Fixed an issue in the Video API where the url value would retain HTML escaping if present within the original page source.
  • Fixed a rare crawling issue that occasionally resulted in “Bad IP” status messages for individual pages.
  • Fixed an issue where empty <video> elements could be returned in the Article API.
  • 2017-08-15
  • Fixed an issue in the Global Index in which complicated Boolean (OR) queries would return no results.
  • 2017-08-08
  • Improved date normalization to include Hijri and Jalali dates
  • Fixed support for unicode characters in API Toolkit rules
  • 2017-05-22
  • Many improvements to brand detection in the Product API.
  • Resolved an issue where humanLanguage could be mis-identified on some Spanish-language pages.
  • 2017-05-15
  • Crawlbot: resolved an issue where IP-address-only webhooks would not receive notifications.
  • Crawlbot: improved link spidering/harvesting resilience to markup errors and other invalid HTML source.
  • Fixed an issue where custom APIs would not display in Crawlbot and Bulk Processing dashboard.
  • 2017-05-11
  • Improved link-detection when returning page links via our &fields=links argument.
  • Improved support for and handling of the srcset (and sizes) image attributes in all APIs.
  • Added detection of Afrikaans (af) in the humanLanguage field.
  • Improved duplicate detection in the Diffbot Global Index.
  • 2017-04-21
  • The beta category field has been added to the Product API. See documentation.
  • All extraction APIs now support the sending of completely custom headers using X-Forward- terminology. Previously only four defined headers were supported. See Article API example.
  • 2017-04-10
  • In the Article and Discussion APIs’ tags element, DBPedia uri values are now properly URL-encoded.
  • Fixed an issue when sorting by date in the Search API.
  • Various improvements and fixes to the Global Index
  • 2017-01-12
  • The Account API now tracks Global Index search calls/requests.
  • Improved SKU detection and extraction in the Product API.
  • Article API: Added support for the start attribute (ol elements) and data- attributes in normalized HTML.
  • In the Article API, identified image captions will no longer be returned in the text field content.
  • Various improvements to replacement rule regular expressions in Custom APIs.
  • PDF processing improvements.
  • 2016-12-09
  • Product API: overriding the sku, mpn or related fields using custom rules will now affect the productId field as well.
  • Crawls using the Analyze API will now correctly index video pages.
  • Improved the reliability of the fields=links argument in all Automatic APIs.
  • 2016-12-01
  • Updates to our rendering engine to properly support more Unicode scripts
  • 2016-11-30
  • Updates to our status page for improved coverage and reliability.
  • Crawlbot crawls can now have repeat settings adjusted or added after a crawl completes.
  • Fixed a Crawlbot issue wherein users could completely erase the seeds field.
  • 2016-11-13
  • POSTing to our APIs is speedier, particularly when content includes slow-loading third-party assets.
  • Crawlbot now has limited support for crawling/processing content across multiple domains.
  • 2016-11-06
  • Improvements to background image detection and extraction across all Automatic APIs. This resolves many issues with proper extraction from sites that use background CSS properties for image delivery.
  • Improvements to specification extraction in the Product API.
  • Improvements to HTML <figure> parsing in the Article API.
  • 2016-10-24
  • The diffbotUri value for Custom APIs is now calculated from the entirety of custom content (and can be used to detect changes between extractions).
  • Various improvements to caption detection and parsing in the Article API.
  • Crawlbot now adheres to the “Diffbot” user-agent in robots.txt directives, so that our crawling can be whitelisted when crawling partner or other sites.
  • 2016-10-04
  • Increased the size limits for content POSTed to Diffbot APIs.
  • Bulk Service jobs now require a minimum of 50 URLs for Startup plan customers.
  • Bulk Service and Crawlbot jobs now automatically retry failing URLs.
  • 2016-09-12
  • Numerous improvements to normalizedSpecs in the Product API.
  • Diffbot Automatic APIs now process PDFs. PDF URLs will be converted to HTML and then analyzed for extractable content. PDFs are not currently supported while crawling.
  • Crawlbot fixes to reduce DNS errors when starting new crawls or crawl rounds.
  • Crawlbot and Bulk Processing: deletion of a nonexistent job will no longer return a “success” message.
  • Improved handling of UTF-8 encoded characters within Crawlbot.
  • 2016-06-24
  • Fixed an issue where large Crawlbot and Bulk job downloads would prematurely terminate.
  • Added beta support for executing custom Javascript before processing a page via an extraction API. See Analyze API example (works with all Automatic and Custom APIs).
  • 2016-06-17
  • Added support for custom headers to the Crawlbot and Bulk Job interfaces. Read more.
  • Added beta field inferredCategory_beta to the Product API, which provides an automatically-determined category for any extracted product.
  • The Bulk Service will no longer normalize URLs before processing them—all pages will be sent to the specified Diffbot API as-is.
  • Various improvements to performance and quality of Diffbot’s rendering engine.
  • Corrected an issue in Custom APIs where a replacement rule would errantly trim blank spaces.
  • 2016-05-24
  • Crawlbot now supports custom headers while crawling; Bulk Processing jobs now support custom headers for all URLs. Read more.
  • Fixed an issue in Crawlbot where internal JSON objects were sometimes being returned in JSON data downloads.
  • Various improvements to date-parsing and normalization in the Article API.
  • Improvements to “Replacement” and “Ignore” filters in Custom APIs and manual rules.
  • 2016-05-12
  • Added column to the Crawlbot and Bulk API URL Report indicating if a proxy IP was used.
  • Added the argument useProxies to Crawlbot, which allows for proxy IPs to be used for specific crawls
  • 2016-05-05
  • Added proxy usage tracking to the Account API.
  • 2016-04-28
  • Added available colors (if found on the page) to the normalizedSpecs object in the Product API. Updated format of normalizedSpecs to return multiple values, if available, for a single key.
  • Fixed an issue where image URLs with spaces in the filename would be incorrectly returned.
  • Improved proxy support in the extraction APIs to help diversify origin IPs. Read more.
  • 2016-04-15
  • Released the tagConfidence argument in the Article API, allowing for the return of tags with lower relevance scores if desired.
  • Improved Crawlbot handling of DNS and other connection issues; increased range of TLDs supported by Crawlbot.
  • Fixed an issue where duplicate tags were being returned when sentiment analysis was being performed.
  • 2016-03-25
  • Updated Crawlbot seeds behavior, so that if a non-www subdomain is specified as the only seed URL, crawling will restrict itself to that subdomain.
  • Significant updates to the beta normalizedSpecs field in the Product API. See more details.
  • Added the field parentUrlDocId to Crawlbot and Bulk Processing JSON objects. This field can be used to match objects to URLs in the Crawlbot or Bulk Processing URL Report.
  • 2016-03-10
  • Added the originalType field to extracted objects when utilizing the Analyze API‘s’ fallbackargument.
  • Fixed an issue with our Semantria integration that could lead to errant timeout responses.
  • 2016-02-25
  • Fixed an issue in the Article API to prevent in-line Javascript and CSS from being returned in the html field from unsupported video players.
  • Discussion API: Improved extraction from single-post (no reply) conversations.
  • Improvements to video extraction within the Video API.
  • 2016-01-29
  • Added beta fields quantityPrices, priceRange and multiplePrices to the Product API.
  • Improved availability detection and extraction in the Product API.
  • Improved offerPrice detection in the Product API to reduce the chance of returning an incorrect value from unavailable products or items without a visible price.
  • 2016-01-26
  • Significant speed improvements to the Global Index.
  • 2016-01-21
  • Released an official endpoint for Custom API management. Please see the documentation for information on programmatic management of custom rules and APIs.
  • Improved video extraction in the Article API to include new providers and HTML5 <video> elements.
  • max:date queries in the Search and Global Index APIs are now inclusive of the date specified.
  • 2016-01-14
  • Improved specification extraction in the Product API.
  • Fixed an issue where the estimatedDate field (Article API) would sometimes not be correctly computed.
  • 2016-01-07
  • Fixed an issue where the <base> element could be incorrectly use to calculate relative paths.
  • Added initial functionality to categorize articles in the Article API based on article text content. If you would like to test this beta feature, contact us.
  • Improved handling of media sources without a specified protocol (e.g. src="//www.youtube.com...). Media element URLs will now match the protocol of the analyzed page.
  • 2015-12-21
  • Crawlbot and Bulk jobs pending delete (per your Diffbot plan) are now identified in the Crawlbot and Bulk interfaces.
  • The API Toolkit now uses Diffbot’s custom rendering engine for live web page previews. This should reduce inaccuracies when creating custom rules.
  • 2015-12-18
  • Fixed an issue where plain-text POSTed to the Article API would not perform text analysis (tags, sentiment, language-detection).
  • Improved Crawlbot behavior on Ajax-heavy sites so that pages with the exact same HTML source are no longer deduplicated.
  • Fixed an issue within the Crawlbot and Bulk interfaces where the “Last 500” URL Report was incorrectly returning the first 500.
  • Improved author detection within the Article API.
  • 2015-12-07
  • The Analyze API now supports POSTed content.
  • 2015-11-27
  • The Account API now returns a list of child or sub-tokens.
  • 2015-11-19
  • Fixed an issue in the Analyze API where products with an API-Toolkit-overridden price field would not reflect changes in the “details” field (offerPriceDetails, regularPriceDetails, etc.).
  • Fixed an Article API issue for certain top-level domains where articles dated in the near future (e.g., tomorrow) would incorrectly be returned with a date from the prior year.
  • 2015-11-11
  • Crawlbot will now successfully spider URLs that contain (invalid) UTF-8 characters.
  • Global Index API: search-by-tag can optionally be performed using a tag-match shorthand.
  • 2015-10-22
  • We now offer an Account API for tracking token API usage and billing history.
  • Global Index API: negative search queries (diffbot AND -"machine learning") are now functioning as documented;
  • 2015-10-16
  • Fixed an issue where Crawlbot and Bulk API data downloads did not include a filename.
  • The breadcrumb element is now a default field in the Article API.
  • 2015-10-08
  • APIs no longer ignore “format characters”—invisible characters that may have an effect on neighboring characters. For example, &zwnj;.
  • Crawlbot and Bulk Service URL Reports now offer an option to download the last 500 URLs crawled.
  • Global Index API: Faceted date queries will no longer return a min value of 0.
  • 2015-10-02
  • Analyze API now supports a “fallback” API via the argument fallback. By passing an API value, any otherwise unsupported pages will be forcibly processed via that API. E.g., passing fallback=article will result in any “other” pages being processed by the Article API.
  • 2015-10-01
  • Hey, as of today we’re publishing a changelog. It’s visible… here.
  • 2015-09-23
  • Additional token support within a single account has been added. Additional tokens are available on a case-by-case basis to paying customers. Please contact support@diffbot.com if you would like to discuss additional tokens.
  • API Toolkit now allows direct update of URL pattern / regular expression without having to create a new ruleset.
  • API Toolkit rule output automatically trims fields to remove leading or trailing blank spaces.
  • The diffbotUri field is now computed based on rule-based output, if a custom rule is used to override default output.
  • The resolvedPageUrl is correctly returned in Custom APIs (if a submitted page is redirected).
  • 2015-09-16
  • Each tag in the tags element now returns a list of all matching rdfTypes.
  • Email invoices now return both dollars and cents.
  • 2015-08-23
  • Performance improvements to Article API to prevent intermittent extra-long responses.
  • 2015-08-13
  • Semantria output updated to include additional fields.
  • Fix for timeout parameter when sending data to Semantria.
  • 2015-08-10
  • Invoices are now visible and printable within the Developer Dashboard (under “Account > Billing History”).
  • 2015-08-06
  • Tokens are now case-insensitive across all Diffbot APIs.
  • 2015-07-16
  • Article API now returns the siteName, publisherCountry and publisherRegion of an article, if it can be determined or if already known.
  • If price data is overridden with a custom rule, the corresponding “details” field (offerPriceDetails) will be computed from the overridden value.
  • 2015-06-28
  • Spotify embeds are properly maintained/returned with Article API html.
  • Multipage articles, when concatenated, will no longer return duplicate images (that appear on multiple pages).
  • 2015-06-11
  • Article and discussion tagging now supports Spanish and Chinese language tags, in addition to English and French.