Comparing Text-Extraction Methods

In 2011, artificial intelligence student Tomaz Kovacik performed the first broad evaluation of web page text-extraction engines, comparing the state-of-the-art methods for extracting clean text from article/blog-post web pages.1 This comparison included Diffbot’s Article API and a number of open-source and SaaS methods, including Goose, Boilerpipe, AlchemyAPI, Readability and more.

The results of that comparison were clear: Diffbot dominated with an overall F1 score of 0.94.

Since 2011, Diffbot has been the only effort focused on improving content extraction from web pages. We’ve built a new rendering engine, added thousands of features and improvements to our computer-vision-based extraction technologies, and training our machine-learning systems with hundreds of thousands of new web pages.

Recently we resurrected Tomaz’s evaluation code2 and begin running his evaluation on a regular basis to compare our performance with the remaining options out there.

Our most recent evaluation run was in December 2014. Here are the results:

Service/Software Precision Recall F1-Score
Diffbot 0.968 0.978 0.971
Boilerpipe 0.893 0.924 0.893
Readability 0.819 0.911 0.854
AlchemyAPI 0.876 0.892 0.850
Embedly 0.786 0.880 0.822
Goose 0.498 0.815 0.608

F1-Score

Precision

Recall

Precision indicates the fraction of retrieved text that was correct. This doesn’t indicate whether all of the correct text was returned—just the correctness of whatever is returned. For example, if a method returns only one sentence of a ten-paragraph article, it will have a precision of 1.0 despite not returning anywhere near all of the content desired.

Recall indicates the fraction of all correct text that was returned. In our example above, a method that returns only one sentence of a ten-paragraph article will have an extremely low recall. Similarly however, a method that returns all of the content on a page—not only including the full article text, but the header/footer/advertising content/etc.) can achieve a recall of 1.0, despite not extracting the desired article text at all.

The F1-score can be interpreted as a weighted average of the precision and recall, taking both into account. F1’s best score is 1.0 and worst is 0.0.


Other Notes:

  • This is a comparison of the Diffbot Article API against similar text extraction software. (There are no other known services or software that extract products, discussion threads, etc., similar to Diffbot's other APIs). Moreover this compares only the accuracy of article title and text extraction—it does not compare Diffbot's extraction of article dates, authors, images, captions, video, or other features of our APIs.
  • Test pages are from a random assortment of archived Google News-indexed articles, in multiple languages, manually human-labeled to identify the exact text that should be extracted in a “perfect” case.

Footnotes:

  1. Evaluating Web Content Extraction Algorithms
  2. Text extraction evaluation framework on Github