Global Index API (BETA)

Diffbot's Global Index API allows you to search a continually-updated database of news articles and other objects (discussion posts, comments, products, images and videos).

While all of the above types are represented in the Global Index, currently only news sources are preemptively and consistently crawled. More than 20,000 sources are regularly and continuously monitored for new content.

The Global Index API uses the same syntax as our Search API.

Request

To use the Global Index API, perform a GET request on the following endpoint:

https://api.diffbot.com/v3/search

Provide the following arguments:

ArgumentDescription
tokenDeveloper token
colTo search the global index you must specify "GLOBAL-INDEX" as your search collection, col=GLOBAL-INDEX.
querySearch query. Must be URL-encoded. Please see the full list of query operators below.
Optional arguments
numNumber of results to return. Default is 20. Maximum number of results is 1000.
startOrdinal position of first result to return. (First position is 0.) Default is 0.
facetsFacet query/ies to return aggregated metadata and counts for search terms. See Facet Queries, below.
numFacetsMaximum number of individual facet elements to return (default = 50).
clusterArticles only. Utilize Diffbot clustering algorithms to identify related or similar objects. Pass cluster=all to return each article with a clusterId identifying thematically-similar articles. Pass cluster=best to remove thematically-similar articles from the result set, leaving a single best representative from each cluster. Pass cluster=dedupe to remove exact and near-duplicate articles from the result set.

Query Operators

The query argument contains the actual search contents you wish to perform on your collection(s). Multiple operators and values can be combined.

query=Returns...
computer visionAll objects containing "computer" and "vision" anywhere in all Diffbot-extracted fields.
"web page analysis"All objects containing the phrase "web page analysis" anywhere in all Diffbot-extracted fields.
author:"John Henry"All objects containing "John Henry" in the author field. All Diffbot fields within a collection can be queried against.

Other examples: tags:"Barack Obama", offerPrice:10.00
images.caption:flowerAll objects containing "flower" in the caption field within a nested images array.

Other example: images.url:amazon.com
type:articleAll objects identified as articles / processed by the Article API.

Other examples: type:product, type:image
football -49ersAll objects containing "football" but not containing "49ers," in all Diffbot-extracted fields.

Other examples: title:pantene -title:conditioner, text:diffbot -author:"farhad manjoo"
john OR paulObjects containing either "john" or "paul" in Diffbot-extracted fields. OR must be capitalized.

Other examples: title:ukraine OR title:putin, "bill clinton" OR "george bush", title:"puppy chow" OR text:"dog food"
george AND ringoObjects containing both "george" and "ringo" in Diffbot-extracted fields. AND must be capitalized.

Other examples: title:lakers AND text:basketball, "red sox" AND author:"bill simmons"
(obama OR clinton) AND presidentObjects either "obama" or "clinton," and also "president." Parentheses can be used to nest Boolean queries.

Other examples: (title:diffbot AND title:robots) OR title:startup
site:diffbot.comAll objects extracted from diffbot.com. The site operator queries against values in the pageUrl and resolvedPageUrl fields.
sortby:timestampObjects sorted (descending) by the specified Diffbot field. Must be a numeric (e.g. "offerPrice") or date field. For date formatting, see Date Queries below.
revsortby:dateObjects sorted (ascending) by the specified Diffbot field. Must be a numeric or date field. For date formatting, see Date Queries below.
min:timestamp:1426784899All objects indexed (processed) after March 19, 2015 (in Unix epoch time). The min or max operators work only with numeric or date fields. For date formatting options, see Date Queries below.
max:offerPrice:1000All objects with an offerPrice value equal to or less than "1000." Must be a numeric or date field. For date formatting options, see Date Queries below.

Query URL Encoding

Be sure to URL-encode your query. The following examples show some sample queries properly encoded.

QueryURL-Encoded Query
computer visioncomputer%20vision
obama type:article sortby:dateobama%20type%3Aarticle%20sortby%3Adate
text:cats author:"Jet Li" min:date:1399669321
text%3Acats%20author%3A%22Jet%20Li%22%20min%3Adate%3A1399669321
min:date:"Thu, 22 May 2014 00:00:00 GMT"
min%3Adate%3A%22Thu%2C%2022%20May%202014%2000%3A00%3A00%20GMT%22

The space character can be represented by "%20" or the "+"-sign.

Date Queries

You can query against Diffbot API date fields (the extracted article or discussion post date, for instance), or against the timestamp field, which represents the time of an object's indexing into the collection.

The timestamp field can be used to return the latest content from a crawl or bulk job (e.g. min:timestamp:2015-03-01, for instance only the objects found in the last crawl round, or since the last search was run.

When querying dates, your date values should be in one the following formats:

Be sure to wrap any space-containing date strings in quotation marks.

Tags Queries

Many Diffbot APIs (Article, Discussion, Image) automatically generate tags from text or image content. These Diffbot-generated tags are included in Global Index objects, and the index can be searched by tag.

Note! The format of the tags element in the Global Index has changed. See below for complete details on the new response format.

To search by an exact tag, provide its full URI and a query upon one of its subfields:

min:tags.dbpedia_org_resource_Diffbot.count:2Returns items where the "Diffbot" tag appears at least two times.
min:tags.dbpedia_org_resource_Diffbot.score:0.5Returns items where the "Diffbot" tag has a minimum score of at least 0.5.

You can also use built-in tag-matching functionality to search by a tag's common name (label), rather than a full URI. These queries should take the form tags."term" (quotation marks are required). Note that when using the tag-match, your provided term will be mapped to the most likely URI match, so results may vary.

tags."apple"Returns all items including the tag tags.dbpedia_org_resource_Apple_Inc.label:apple.
min:tags."apple".count:5Returns all items where the tags.dbpedia_org_resource_Apple_Inc.label:apple entity appears at least five times.
max:tags."apple".score:0.25Returns all items where the tags.dbpedia_org_resource_Apple_Inc.label:apple entity receives a score of 0.75 or lower.

Response

The Global Index API returns all matching objects found in the index in a JSON format.

Each response includes a request object (metadata specific to the search request), and an objects array, which will include the extracted information for all matching objects.

{
  "request": {
    "num": 20,
    "col": GLOBAL-INDEX,
    "start": 0,
    "token": "...",
    "query": "diffy"
  },
  "objects": [
    {
      "tags": {
        "dbpedia_org_resource_Google": {
          "count": 5,
          "prevalence": 0.35,
          "score": 0.81,
          "label": "Google",
          "uri": "http://dbpedia.org/resource/Google"
        },
        "dbpedia_org_resource_Dublin_Core": {
          "count": 1,
          "prevalence": 0,
          "score": 0.37,
          "label": "Dublin Core",
          "uri": "http://dbpedia.org/resource/Dublin_Core"
        },
        "dbpedia_org_resource_Knowledge_Graph": {
          "count": 2,
          "prevalence": 0.14,
          "score": 0.71,
          "label": "Knowledge Graph",
          "uri": "http://dbpedia.org/resource/Knowledge_Graph"
        },
        "dbpedia_org_resource_Diffbot": {
          "count": 8,
          "prevalence": 0.56,
          "score": 0.9,
          "label": "Diffbot",
          "uri": "http://dbpedia.org/resource/Diffbot"
        }
        "_keys": [
          "dbpedia_org_resource_Diffbot",
          "dbpedia_org_resource_Google",
          "dbpedia_org_resource_Knowledge_Graph",
          "dbpedia_org_resource_Dublin_Core"
        ]
      },
      "icon": "http://i.forbesimg.com/media/assets/appicons/forbes-app-icon_144x144.png",
      "text": "What will the web be when it grows up? Over the past 25 years, the web has become the world’s greatest distribution story, and it is now the largest repository of human knowledge ever assembled. Ease of use and availability created massive scale. Humans have been motivated to make the web for other humans. Increasingly, however, it is machines that are trying to read and make sense of this imperfectly structured database. Diffbot founder Mike Tung thinks the machines need help.\nUntil recently, only very large companies have been able to work with data at web-scale. Google, most notably, has been indexing the web since 1998 to answer people’s search queries. It has built a structured database called the Knowledge Graph from the web. It now contains 570 million objects and 18 billion facts (objects, in this context, refers to kinds of things and facts are instances of those things). According to a statement from Diffbot, this morning, it has surpassed Google’s accomplishment in its first couple of months of active crawling of the entire web. The Diffbot “Global Index” now contains over 600 million objects and 19 Billion facts, and its software is “constantly adding new objects and facts without any human oversight.”\nBeating Google at their own game is a not-so-humble brag. But why does this geek smackdown matter? One way to think about it is that Google’s Knowledge Graph is optimized to find the best answer to a search query. But there are other things we might want to know. We might want to know all of the possible answers to a question. We might want to classify those answers into categories and look at the distribution of those answers at web-scale. Applications need direct access to structured data to do these kinds of things. This simple idea is what Diffbot is aiming to deliver.\nFor years, we have heard about the “semantic web.” Many schemes have been hatched, and alliances forged to get the makers of web pages to standardize the structure of information on the web. Despite RDF, microformats, the Dublin Core and accessibility standards, the web is a mess behind the scenes. It is really hard to get humans to behave! And yet, beyond the barriers of language and truly bad design, we rarely come upon a web page that makes no sense. Tung has used this simple observation in a very powerful way. Because machines read code, it is a natural instinct to make that code more machine-readable. But humans, remember, make web pages for other humans, so the ease of robots is not their highest priority.\nWhat if you could structure the content of a web page based on how humans actually read those pages, Tung asked back in 2008 when he founded Diffbot? As far as I am aware, Diffbot is the only web-crawling technology that uses machine vision to structure data. In simple terms, this means that the software “looks” at an image of a page and compares that image to images of other pages to categorize it as a certain type of page. Within each type of page, the software identifies features: the header, menu, headline, byline, story text, images, products, etc.\nThe Diffbot product is a set of APIs that allow you to retrieve specific types of structured data from the web. You can try all of them out on their site. The APIs analyze the page type, extract all of the elements of an article, product, or discussion thread and return data about images and videos. Any of these APIs work with their Crawlbot, which can create a “structured index of practically any site’s data.” As well, Diffbot’s custom API toolkit allows users to override, correct or add fields to any of the automatic APIs.\nTung and his team had been working on these algorithms for many years. Two years ago they brought Matt Wells, the founder of the Gigablast search engine, into the company to really build out their search infrastructure. Wells “is probably the only person in the world that has single-handedly written a commercial full-web search engine,” Tung told Gigaom at the time, “even Larry and Sergey were a pair,” referring to the founders of Google.\nContinued from page 1\nOne of the signs that Tung is on to something big is the number and variety of customers Diffbot already has for its services. Microsoft’s search engine Bing uses the product API to add rich results to product searches. And the privacy-minded search engine Duck Duck Go uses the article API. Other big names include Cisco, Adobe, eBay, and Instapaper.\nOver the years, human feedback has improved the accuracy of this machine learning approach. Tung now reports that the Diffbot software has higher accuracy than humans at correctly identifying the parts of web pages and is, of course, orders of magnitude faster. Once you have accuracy and speed, you can scale. Scale itself makes systems smarter. James Cham, a partner at VC firm Bloomberg Beta, which just joined Diffbot’s angel round, reminded me of this in a recent conversation. He pointed me to a 2009 paper by Peter Norvig and colleagues at Google called “The Unreasonable Effectiveness of Data.” The point of the paper, which many AI researchers have confirmed to me, is that scale of data trumps the quality of algorithms.\n“We really believe that building the structured data is the key to smarter systems. We don’t know what the algorithm will be, but we know the structured data will be necessary,” says Tung. He believes as well that our existing AI algorithms may deliver significantly more intelligent results once they are working with databases with objects in the trillions as opposed to those in the millions we have now.\nCham sees Diffbot as an “enabler technology.” Rather than encourage people to build web pages semantically, “Mike has bootstrapped the semantic web on top of the existing web.” Because compute resources are now so cheap and accessible, Cham continues, “things that were unimaginable five years ago, are totally doable now—if you are as smart as Mike and his team.” What sold Bloomberg Beta on the investment? “He has a clear idea,” Cham says. “He’s deep and determined. For an investor that’s always exciting.”\nTalking to Tung brings to mind the classic 1970 commercial from clothing retailer Barney’s. A bunch of kids is sitting on a stoop in New York talking about what they are going to do when they grow up. A young Humphrey Bogart boasts about his Hollywood future, Louis Armstrong (actually raised in New Orleans, but this is advertisingland!) knows he will be a musician, Fiorello LaGuardia dreams of becoming Mayor, and Casey Stengel imagines the World Series. When they ask an unassuming boy with glasses, “And what are you going to do, Barney?” he delivers the greatest of slogans, “I don’t know, but you’re all gonna need clothes.” I imagine asking the same of the young Mike Tung and hearing him reply, “I don’t know, but you’re all gonna need structured data!”",
      "lastCrawlTimeUTC": 1441396290,
      "pageUrl": "http://www.forbes.com/sites/anthonykosner/2015/06/04/diffbot-bests-googles-knowledge-graph-to-feed-the-need-for-structured-data/",
      "publisherRegion": "North America",
      "videos": [
        {
          "diffbotUri": "video|3|956399024",
          "primary": true,
          "url": "http://www.youtube.com/embed/KMIgu9-zd8M"
        }
      ],
      "publisherCountry": "United States",
      "humanLanguage": "en",
      "type": "article",
      "date": "Thu, 04 Jun 2015 00:00:00 GMT",
      "estimatedDate": "Thu, 04 Jun 2015 00:00:00 GMT",
      "timestamp": "Fri, 04 Sep 2015 19:51:30 GMT",
      "author": "Anthony Wing Kosner",
      "title": "Diffbot Bests Google's Knowledge Graph To Feed The Need For Structured Data",
      "gburl": "http://www.forbes.com/sites/anthonykosner/2015/06/04/diffbot-bests-googles-knowledge-graph-to-feed-the-need-for-structured-data/-diffbotxyz3822171376",
      "diffbotUri": "article|4|-2094257307",
      "nextPages": [
        "http://www.forbes.com/sites/anthonykosner/2015/06/04/diffbot-bests-googles-knowledge-graph-to-feed-the-need-for-structured-data/2/"
      ],
      "authorUrl": "http://www.forbes.com/sites/anthonykosner/",
      "images": [
        {
          "title": "Mike Tung, CEO and founder of Diffbot with Diffy.",
          "height": 452,
          "diffbotUri": "image|4|-354056683",
          "naturalHeight": 774,
          "width": 701,
          "primary": true,
          "naturalWidth": 1200,
          "url": "http://blogs-images.forbes.com/anthonykosner/files/2015/06/mike-tung-diffbot-diffy.jpg"
        },
        {
          "naturalHeight": 1080,
          "diffbotUri": "image|4|1750734318",
          "naturalWidth": 1920,
          "url": "http://blogs-images.forbes.com/jaymcgregor/files/2015/09/Xperia-z5-display.jpg"
        }
      ],
      "html": "<p>What will the web be when it grows up? Over the past 25 years, the web has become the world&rsquo;s greatest distribution story, and it is now the largest repository of human knowledge ever assembled. Ease of use and availability created massive scale. Humans have been motivated to make the web for other humans. Increasingly, however, it is machines that are trying to read and make sense of this imperfectly structured database. <a href=\"http://www.diffbot.com/\">Diffbot<\/a> founder Mike Tung thinks the machines need help.<\/p>\n<p>Until recently, only very large companies have been able to work with data at web-scale. Google, most notably, has been indexing the web since 1998 to answer people&rsquo;s search queries. It has built a structured database called the Knowledge Graph from the web. It now contains 570 million objects and 18 billion facts (objects, in this context, refers to kinds of things and facts are instances of those things). According to a statement from Diffbot, this morning, it has surpassed Google&rsquo;s accomplishment in its first couple of months of active crawling of the entire web. The Diffbot &ldquo;Global Index&rdquo; now contains over 600 million objects and 19 Billion facts, and its software is &ldquo;constantly adding new objects and facts without any human oversight.&rdquo;<\/p>\n<p>Beating Google at their own game is a not-so-humble brag. But why does this geek smackdown matter? One way to think about it is that Google&rsquo;s Knowledge Graph is optimized to find the best answer to a search query. But there are other things we might want to know. We might want to know all of the possible answers to a question. We might want to classify those answers into categories and look at the distribution of those answers at web-scale. Applications need direct access to structured data to do these kinds of things. This simple idea is what Diffbot is aiming to deliver.<\/p>\n<p>For years, we have heard about the &ldquo;semantic web.&rdquo; Many schemes have been hatched, and alliances forged to get the makers of web pages to standardize the structure of information on the web. Despite RDF, microformats, the Dublin Core and accessibility standards, the web is a mess behind the scenes. It is really hard to get humans to behave! And yet, beyond the barriers of language and truly bad design, we rarely come upon a web page that makes no sense. Tung has used this simple observation in a very powerful way. Because machines read code, it is a natural instinct to make that code more machine-readable. But humans, remember, make web pages for other humans, so the ease of robots is not their highest priority.<\/p>\n<figure><img alt=\"Mike Tung, CEO and founder of Diffbot with Diffy.\" src=\"http://blogs-images.forbes.com/anthonykosner/files/2015/06/mike-tung-diffbot-diffy.jpg\"><\/img><figcaption>Mike Tung, CEO and founder of Diffbot with Diffy.<\/figcaption><\/figure>\n<p>What if you could structure the content of a web page based on how humans actually read those pages, Tung asked back in 2008 when he founded Diffbot? As far as I am aware, Diffbot is the only web-crawling technology that uses machine vision to structure data. In simple terms, this means that the software &ldquo;looks&rdquo; at an image of a page and compares that image to images of other pages to categorize it as a certain type of page. Within each type of page, the software identifies features: the header, menu, headline, byline, story text, images, products, etc.<\/p>\n<p>The Diffbot product is a set of APIs that allow you to retrieve specific types of structured data from the web. You can try all of them out on their site. The APIs analyze the page type, extract all of the elements of an article, product, or discussion thread and return data about images and videos. Any of these APIs work with their Crawlbot, which can create a &ldquo;structured index of practically any site&rsquo;s data.&rdquo; As well, Diffbot&rsquo;s custom API toolkit allows users to override, correct or add fields to any of the automatic APIs.<\/p>\n<p>Tung and his team had been working on these algorithms for many years. Two years ago they brought <a href=\"https://gigaom.com/2013/09/10/diffbot-brings-big-time-search-poobah-aboard-to-help-it-scale/\">Matt Wells<\/a>, the founder of the Gigablast search engine, into the company to really build out their search infrastructure. Wells &ldquo;is probably the only person in the world that has single-handedly written a commercial full-web search engine,&rdquo; Tung told Gigaom at the time, &ldquo;even Larry and Sergey were a pair,&rdquo; referring to the founders of Google.<\/p><br><p>Continued from page 1<\/p>\n<p>One of the signs that Tung is on to something big is the number and variety of customers Diffbot already has for its services. Microsoft&rsquo;s search engine Bing uses the product API to add rich results to product searches. And the privacy-minded search engine Duck Duck Go uses the article API. Other big names include Cisco, Adobe, eBay, and Instapaper.<\/p>\n<p>Over the years, human feedback has improved the accuracy of this machine learning approach. Tung now reports that the Diffbot software has higher accuracy than humans at correctly identifying the parts of web pages and is, of course, orders of magnitude faster. Once you have accuracy and speed, you can scale. Scale itself makes systems smarter. James Cham, a partner at VC firm <a href=\"http://www.bloombergbeta.com/\">Bloomberg Beta<\/a>, which just joined Diffbot&rsquo;s angel round, reminded me of this in a recent conversation. He pointed me to a 2009 paper by Peter Norvig and colleagues at Google called &ldquo;<a href=\"http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/35179.pdf\">The Unreasonable Effectiveness of Data<\/a>.&rdquo; The point of the paper, which many AI researchers have confirmed to me, is that scale of data trumps the quality of algorithms.<\/p>\n<p>&ldquo;We really believe that building the structured data is the key to smarter systems. We don&rsquo;t know what the algorithm will be, but we know the structured data will be necessary,&rdquo; says Tung. He believes as well that our existing AI algorithms may deliver significantly more intelligent results once they are working with databases with objects in the trillions as opposed to those in the millions we have now.<\/p>\n<p>Cham sees Diffbot as an &ldquo;enabler technology.&rdquo; Rather than encourage people to build web pages semantically, &ldquo;Mike has bootstrapped the semantic web on top of the existing web.&rdquo; Because compute resources are now so cheap and accessible, Cham continues, &ldquo;things that were unimaginable five years ago, are totally doable now&mdash;if you are as smart as Mike and his team.&rdquo; What sold Bloomberg Beta on the investment? &ldquo;He has a clear idea,&rdquo; Cham says. &ldquo;He&rsquo;s deep and determined. For an investor that&rsquo;s always exciting.&rdquo;<\/p>\n<p>Talking to Tung brings to mind the classic 1970 commercial from clothing retailer Barney&rsquo;s. A bunch of kids is sitting on a stoop in New York talking about what they are going to do when they grow up. A young Humphrey Bogart boasts about his Hollywood future, Louis Armstrong (actually raised in New Orleans, but this is advertisingland!) knows he will be a musician, Fiorello LaGuardia dreams of becoming Mayor, and Casey Stengel imagines the World Series. When they ask an unassuming boy with glasses, &ldquo;And what are you going to do, Barney?&rdquo; he delivers the greatest of slogans, &ldquo;I don&rsquo;t know, but you&rsquo;re all gonna need clothes.&rdquo; I imagine asking the same of the young Mike Tung and hearing him reply, &ldquo;I don&rsquo;t know, but you&rsquo;re all gonna need <em>structured data<\/em>!&rdquo;<\/p>",
      "docId": 200170031879,
      "numPages": 2,
      "siteName": "Forbes"
    }
  ]
}

Changes to Tags in Article and Discussion Responses

The Global Index introduces a new format for the tags element—unlike the V3 APIs, tags will now be returned as an object (dictionary) rather than in an array. Furthermore, tags will be indicated at the top-level by their unique uri as opposed to their label or common name.

Here is an example tags response:

"tags": {
    "dbpedia_org_resource_New_Jersey": {
    "id": 118590,
    "count": 1,
    "prevalence": 0.2727272727272727,
    "label": "New Jersey",
    "type": "place",
    "uri": "http://dbpedia.org/resource/New_Jersey"
},
    "dbpedia_org_resource_Crush_beverage": {
    "id": 3170288,
    "count": 1,
    "prevalence": 0.2727272727272727,
    "label": "Crush (beverage)",
    "type": "Http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#FunctionalSubstance",
    "uri": "http://dbpedia.org/resource/Crush_(beverage)"
},
    "dbpedia_org_resource_Taste_bud": {
    "id": 92646,
    "count": 1,
    "prevalence": 0.2727272727272727,
    "label": "Taste bud",
    "type": "Http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#BiologicalObject",
    "uri": "http://dbpedia.org/resource/Taste_bud"
},
    "dbpedia_org_resource_Ralph_Goings": {
    "id": 2102273,
    "count": 1,
    "prevalence": 0.2727272727272727,
    "label": "Ralph Goings",
    "type": "person",
    "uri": "http://dbpedia.org/resource/Ralph_Goings"
},
    "dbpedia_org_resource_Ketchup": {
    "id": 673008,
    "count": 2,
    "prevalence": 0.5454545454545454,
    "label": "Ketchup",
    "type": "Http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#FunctionalSubstance",
    "uri": "http://dbpedia.org/resource/Ketchup"
},
    "_keys": [
    "dbpedia_org_resource_Ketchup",
    "dbpedia_org_resource_New_Jersey",
    "dbpedia_org_resource_Taste_bud",
    "dbpedia_org_resource_Ralph_Goings",
    "dbpedia_org_resource_Crush_beverage"
    ]
}

The tags object will be accompanied by a _keys element, which will enumerate the URIs contained within the tag-map.

We are still at work on the most effective way to query the tags element. We are also in progress with our transition to the new format, so for a short time you may find results with the previous formatting. Stay tuned!


Example Searches

The most recently-written 40 articles matching "Diffbot"

  • query: diffbot type:article sortby:date
  • col: GLOBAL-INDEX
  • num: 40
https://api.diffbot.com/v3/search?token=...&num=40&col=GLOBAL-INDEX&query=diffbot type%3Aarticle%20sortby%3Adate

100 articles written by Bill Simmons, at Grantland.com, ordered by oldest first.

  • query: type:article author:"Bill Simmons" site:grantland.com revsortby:date
  • col: GLOBAL-INDEX
  • num: 100
https://api.diffbot.com/v3/search?token=...&num=100&col=GLOBAL-INDEX&query=type%3Aarticle%20author%3A%22Bill%20Simmons%22%20site%3Agrandland.com%20revsortby%3Adate

20 articles mentioning "ukraine" or "Russia" in the main text, ordered by most recently indexed/crawled:

  • query: text:ukraine OR text:russia sortby:timestamp
https://api.diffbot.com/v3/search?token=...&col=GLOBAL-INDEX&query=text%3Aukraine%20OR%20text%3Arussia%20sortby%3Atimestamp

Facet Queries

Facet queries return aggregated data about the matching result set, for instance the most popular tags among article results, or the number of articles falling into specific date ranges. Facet queries operate on the values returned by the primary search query.

To perform facet queries on Global Index results, include the facets argument in your API call:

https://api.diffbot.com/v3/search?token=...&col=GLOBAL-INDEX&query=...&facets=...

You can retrieve faceted data for most of Diffbot's structured data fields. Use a period to specify a nested field (e.g., post.author to specify the author of discussion posts / comments.

Sample Facet Queries

Specify the field name or a series of numerical ranges to perform a facet query.

facets=Returns...
authorData on the most common author values.
siteData on the most common websites represented in the result set.
post.authorData on the most common comment or post authors in the result set.
date(1430438400-1431388800,1431388800-1431475200,1431475200-1431648000)Data from each specified date range. Facet query dates must be provided as Unix timestamps.
author,siteFacets for both author and site.

Facet Query Response

Successful facet queries will return a separate facets object at the top level of the JSON response. Each facet will contain the following fields:

FieldDescription
field Name of the field being faceted.
value Value of the specific facet. For facet query ranges (e.g., date) the result will be a specified range.
count Number of documents in the overall search query matching this facet value or range.
totalDocsWithField Total number of documents in the entire Global Index (not the specific query) that contain the faceted field.
totalDocsWithFieldAndValue Total number of documents in the entire Global Index (not the specific query) that contain the faceted field and value.
query A formatted query that can be used to return only the objects contained within the specific facet parameters.
average For numerical facet ranges, the average of the faceted results for this range.
min For numerical facet ranges, the lowest value within the faceted results for this range.
max For numerical facet ranges, the highest value within the faceted results for this range.

Sample Facet Responses

Example facets response from a facets=author query:

{
  "facets": [
    {
      "count": 95,
      "field": "author",
      "query": "author:\"John Davi\"",
      "totalDocsWithField": 358764479,
      "totalDocsWithFieldAndValue": 196,
      "value": "John Davi"
    },
    {
      "count": 22,
      "field": "author",
      "query": "author:\"Bruno Skvorc\"",
      "totalDocsWithField": 358764479,
      "totalDocsWithFieldAndValue": 108,
      "value": "Bruno Skvorc"
    },
    {
      "count": 16,
      "field": "author",
      "query": "author:\"Kathy Lui\"",
      "totalDocsWithField": 358764479,
      "totalDocsWithFieldAndValue": 32,
      "value": "Kathy Lui"
    }
  ]
}

Example facets response from a query faceting the first three days in May 2015, facets=date(1430438400-1430524800,1430524800-1430611200,1430611200-1430697600):

{
  "facets": [
    {
      "average": 1430673664,
      "count": 1205957,
      "field": "date",
      "max": 1430697599,
      "min": 1430611200,
      "query": "min:date:1430611200 max:date:1430697600",
      "totalDocsWithField": 417485045,
      "totalDocsWithFieldAndValue": 147761,
      "value": "[1430611200-1430697600)"
    },
    {
      "average": 1430468608,
      "count": 523851,
      "field": "date",
      "max": 1430524799,
      "min": 1430438400,
      "query": "min:date:1430438400 max:date:1430524800",
      "totalDocsWithField": 417485045,
      "totalDocsWithFieldAndValue": 196406,
      "value": "[1430438400-1430524800)"
    },
    {
      "average": 1430553344,
      "count": 397512,
      "field": "date",
      "max": 1430611199,
      "min": 1430524800,
      "query": "min:date:1430524800 max:date:1430611200",
      "totalDocsWithField": 417485045,
      "totalDocsWithFieldAndValue": 157620,
      "value": "[1430524800-1430611200)"
    }

  ]
}