Introduction to Crawl API

Crawl works hand-in-hand with Extract API (either automatic or custom). It quickly spiders a site for appropriate links and hands these links to an Extract API for processing. All structured page results are then compiled into a single "collection," which can be downloaded in full or searched using the Search API.

Note: If you have a complete list of all the URLs you wish to extract, you might be looking for the Bulk API instead.

🚧
Access to Crawl API is Limited to Plus Plans and Up
Upgrade to a Plus plan anytime at diffbot.com/pricing, or contact [email protected] for more information.

Robots.txt

By default Crawl adheres to a site’s robots.txt instructions, including the disallow and crawl-delay directives.

In specific cases — typically because of a partnership or agreement you have with the site to be crawled — the robots.txt instruction can be ignored/overridden. This is often faster than waiting for the third-party site to update its robots.txt file.

To whitelist Crawl for a site, specify the “Diffbot” user-agent in the site’s robots.txt:

User-agent: Diffbot 
Disallow:

Note that Crawl does not adhere to the Allow directive.

Data Retention

Inactive crawls will be removed from your account after 18 days for Startup plans and 32 days for Plus plans.

Removal includes the extracted data as well as the job meta information (name, settings, etc.).

“Inactive” crawls are crawls that are essentially in a permanently paused state. Active recurring/repeating crawls will not be deleted or removed from your account. However, after a recurring crawl completes its final round it will be subject to regular deletion policies.

🚧Access to Crawl API is Limited to Plus Plans and Up

Robots.txt

Data Retention

🚧
Access to Crawl API is Limited to Plus Plans and Up