An API for the Web
We've determined that the entire Web can be classified into approximately 18 structural page types. From this basic understanding of common page layouts, Diffbot then uses computer vision, natural language processing and other machine learning algorithms to identify and extract the important items from within these pages.
Benefits
Accurate
We utilize state-of-the art computer vision and NLP algorithms; have the largest collection of tagged pages and update our model several times per week.
Easy
Pass in a URL and we'll do the rest. Stop spending time building custom scrapers and -- even worse -- maintaining them.
Stable
Diffbot is built and run by Web veterans in a multi-tiered environment with redundancy, monitoring and scalability built-in. Our scale lets us operate the service more cheaply than running it yourself.
Open
We use open standards (schema.org) and allow for endless configurability via our customization tool.