Exploring Web Mining

Open Data Sources

While the data sources discussed in this section also apply to the broader study of data mining, in the context of web mining, they are presented here to raise awareness of their existence to help others avoid recreating the wheel in that not all web content needs crawling or scraping. In many cases, the data needed for a given mining project is already available for download in one of the standard formats discussed previously (e.g. tvs, cvs, xml, or json) or accessible via the host’s API.

Among analysts, the most well-known of the publicly available open data sources are offered by Google provided through the Google Public Data Explorer and by Amazon Web Services (AWS). While both Google and Amazon host dozens of datasets, the most popular, according to AWS (2014) include NASA NEX, a collection of Earth science data sets; the Common Crawl, a corpus of a web crawl since 2007 comprised of over 5 billion web pages; 1000 Genomes Project, a detailed map of human genetic variation; Google Books Ngrams, U.S. Census data, and Freebase data dump, a database covering millions of topics.

There are many datasets provided by the U.S. government as well. Examples include those by the U.S. Census Bureau, Data.gov, Project Open Data, U.S. Small Business Administration (SBA), Food and Drug Administration (FDA), and at USA.gov. The latter is probably a good starting point for discovering additional datasets provided by agencies not mentioned here. Institutionally provided research repositories include, among others Databib, Gephi, Dspace, DBLP, the MIT Libraries, Stanford Large Network Dataset Collection, and the University of California, Irvine Network Data Repository.