News and Press  ›  For Immediate Release: Aug 16, 2012

Diffbot’s New API Is a Decoder Ring for the Web

"Page Classifier" developer tool from Diffbot uses visual learning robot to instantly identify page content that lies behind any URL

SAN FRANCISCO-- Hyperlinks are core to the Web, yet with the rise of mobile browsing, social sharing sites and link shortening services, they are becoming increasingly difficult for both humans and machines to decode. Without context, no one knows if a link shared on Twitter is a picture, an article, a product, or a Trojan horse.

This problem becomes infinitely more cumbersome for developers looking to build applications around Web content, as they deal with thousands or even millions of links each day. Today Diffbot, creators of visual learning robot technology that lets developers instantly analyze, extract, and enhance Web content, has opened the doors to a new Beta API to change all of that, dubbed “Page Classifier.”

Building atop the more than 100 million URLs analyzed monthly by its developers, Diffbot has trained its visual robot to categorize the entire web into 20 different page types. The new Page Classifier tool lets developer applications identify and classify the type of page and language behind any URL, understanding instantly whether a given link leads to a product, map, social networking profile or any of the other identified page types. Page Classifier leverages Diffbot’s proprietary computer vision and natural-language processing technologies to correctly recognize more than 90% of the Web.

To showcase the new API’s capabilities, Diffbot released an infographic built from Page Classifier data titled “A Day in the Life of Twitter.” Using the new Page Classifier API, Diffbot analyzed three quarters of a million links shared on Twitter on July 11 and 12, 2012, and categorized the underlying pages shared. Some of the company’s key findings are below:

  • Photos and images comprise more than 35% of all shared links, with articles/blog posts making up 16%, videos 9% and products 8%.
  • English-language sites comprise 68% of all shared links. (Japanese-language pages are second with 7%.)
  • “Status messages” (links to other Twitter messages, expanded Tweet services such as twitlonger.com and tmi.me, and links to Facebook status posts) make up 7% of all shared links.
  • YouTube represents 60% of all shared video links, with Livestream’s Twitcam second with 9%. Japan’s Nico Nico Douga (www.nicovideo.jp) makes up 6% of all shared videos.
  • Instagram represents 15% of all shared photos, second only to Twitter’s native photo upload, which captures 40% of all shared photos.
  • Live personal video streaming – from Livestream’s Twitcam, UStream, and Japan’s Twitcasting.tv – is more than 10% of all shared video.
  • Nearly 6% of shared links result in error pages, due to user typos or content that is subsequently removed.

“Just as you and I can look at a given page and say, ‘hey, that’s a recipe, or a map, or a product,’ Page Classifier does the same thing automatically for any URL.” said Mike Tung, President of Diffbot, “Most of the web fits neatly into our 20 identified page types and now Page Classifier allows developer applications to automatically organize any webpage into one of these types.”

“Our eventual goal is to identify and extract the important objects from all pages across the Web, things like the reviews about a product, or the author and content of an article,” continued Mr. Tung. “To reach that goal, we first need to teach Diffbot to recognize what kind of page it’s looking at. Page Classifier is the foundation that will let us identify the important pieces of the Web no matter what kind of page we’re dealing with.”

Diffbot’s Page Classifier technology is already being used by launch partner Springpad, a social bookmarking application that automatically organizes links users have stored from around the Web and enhances that content with useful associated links and offers to save users time and money.

"Springpad's goal is to go beyond clipping a link to help our users save and access what's most useful to them—be it a product, recipe, movie, book or restaurant, said Jeff Chow, CEO of Springpad. “Diffbot's Page Classifier is a plug and play solution that's greatly helped us to improve our categorization and user experience."

About Diffbot:

Diffbot looks at Web content with a human set of eyes. It is a robot that examines the Web using artificial intelligence, computer vision, machine learning and natural language processing, and provides software developers with tools to find, extract and understand objects from any Web page for use in their applications. Thousands of developers use Diffbot APIs to create consumer-friendly applications that use visual interpretation of the Web to re-imagine search, mobile web and hundreds of other consumer applications. Diffbot is based in Palo Alto, CA.

To learn more visit www.diffbot.com