Diffbot's Bulk Processor lets you send a large quantity of URLs through a Diffbot extraction API for fast, asynchronous processing.
The Bulk Processor sends all submitted page URLs to a Diffbot API (either automatic or custom). All structured page results are then compiled into a single "collection," which can be downloaded in full or searched using the Search API.
Creating a Bulk Job
Each bulk job requires the following:
- A name (e.g., "NewProducts").
- Multiple URLs to process, one per line. If you are on the Startup plan, jobs require at least 50 URLs.
- A Diffbot API to be used for processing pages.
Passing Diffbot API Querystring Arguments
The Bulk Processor hands off URLs to specific Diffbot APIs for processing. Each of these APIs has optional querystring arguments that can be used to modify the information returned -- most commonly the
fields argument, for adding optional fields to the Diffbot response.
You can choose to be notified at the conclusion of each bulk job, either by webhook or email.
If "webhook" is chosen, you will need to supply a URL that is capable of receiving a POST request. One alternative to building your own: use the Diffbot app on Zapier to receive webhook notifications.
Accessing Bulk Job Data
You can access processed data anytime during your bulk job, or after it completes. There are two download options within the interface:
- Full JSON Output: A single file, in valid JSON, containing all of the processed objects from your job.
- CSV Output: A single comma-separated-values file of the top-level objects. Nested elements (article images, tags, etc.) will not be returned in the CSV.
If you only want to access a subset of your data, the Search API allows much more flexibility in searching and retrieving only the matching items from queries.