How we helped Avast analyze the Web's Privacy Policies
When the world's largest consumer cybersecurity company—Avast—was looking to develop a universal privacy score for every site on the web, they turned to Diffbot, the experts in automated web-scale extraction, to help them ship the project in record time.
- Galina Alperovich, Staff Scientist, Avast
Founded in 1988, Avast makes the popular suite of consumer security applications Avast Anti-virus, Browser, and VPN, used by over 435 million active users on a monthly basis and with revenues of over $900 million. Avast recently merged with NortonLifeLock (formerly Symantec) to establish the world’s largest consumer cybersecurity company, with jointly thousands of employees across the US and Europe working on AI, machine learning, and security.
Avast’s origins started with protecting consumers’s desktop computers from viruses and malware, but as people shifted more of their lives online, they broadened their product suite to help protect consumers on any device and against the wide range of intenet threats: web security, data breeches, online tracking, and privacy.
Avast knew that their customers wanted tools to help them feel safe browsing the web, but lacked a key piece of information needed in their scoring models that rate the trustworthiness of online websites and applications: information about that site’s privacy practices.
While data about e.g. which sites suffered from data breaches or were reported for phishing are very clear signals, these are relatively small datasets, a drop in the bucket when compared to the vast number of websites on the web that the user might want an assessment on.
Developing this pipeline in-house would be an expensive undertaking both in human resources, machine infrastructure to host a web-scale crawl—and most importantly—time-to-market.
This is when Chandler Givens, Product Director of Consumer Privacy at Avast, connected with our founder and CEO Mike Tung, recognizing our expertise in automated web extraction and developing the pipelines that construct the world’s largest Knowledge Graph. After a quick meeting with Chandler and the Avast data science team, Diffbot proposed leveraging two key components of Diffbot’s knowledge graph pipeline in order to solve their problem:
Automatic page classification and text extraction. Diffbot has the world-class system for webpage classification and automatic webpage extraction. Diffbot’s extraction API automatically visually classifies pages into one of the 20 common types of pages on the web at greater than 97% accuracy without rules and has been battle-tested on nearly every page on the web after years of production crawling and serving customers like Bing, StumbleUpon, and DuckDuckGo. Additionally, Diffbot has the most accurate webpage-to-text extraction algorithms, relying on its visual techniques, customized fork of the Chromium rendering engine and years of extracting clean article text for companies like Instapaper, Snapchat, and AOL. By fine-tuning these models instead of developing from scratch, we were able to build a production grade extractor for privacy policies with a small amount of training data in a fraction of the amount of time.
Firmographic enrichment from the web domain To model the privacy stance of an organization, you need to first understand what organization corresponds to the website you are currently visiting. While this might sound trivial to a human browsing the website, implementing this an automated way is not trivial at all. To implement this map we used Diffbot’s Enhance API to pull in the Organization record from the over 300M available Organizations in Diffbot Knowledge Graph, including all of the rich firmographic detail available such as the firm’s nubmer of employees, location, and industry, which provided additional signals for the trustworthiness of the entity.
(10-fold cross-validation error, N=1108)
Since we were able to ship a production-quality pipeline so quickly and had so much fun working with Avast’s data science team so much in the process, we decided to collaborate with Avast on a series of blog posts to educate consumers about privacy and a dataset for academic researchers studying privacy. You can read more about the collaboration on the official Avast blog here: