Exploring Web Mining

Introduction

Web mining, a term coined by Etzioni in 1996 (as cited in Kleftodimos & Evangelidis, 2013) is the discovery and extraction of information contained in web documents and services. Matsudaira (2014) explained that web mining involves the following basic steps:

  1. Select a URL to crawl.
  2. Fetch and parse the page.
    • Lookup DNS.
    • Fetch the robots.txt or robots exclusion protocol file (REP).
    • Fetch URL (if allowed by REP).
    • Parse relevant content.
    • Save the relevant content.
  3. Extract URLs from the page.
  4. Add URLs to the queue.
  5. Repeat.

Algorithms designed to automatically do the above are commonly called web crawlers while the sub-steps in step 2 are performed by a component called a web scraper. The terms are closely related and oftentimes used synonymously by researchers; however a distinction is made here to better understand the functional nature of each. The basic function of a web crawler is to browse the web in much the same way as a person might do who is not looking for anything specific. Browsing by a crawler however is done automatically and intentionally within and across multiple domains whereas a person may be unaware of the ownership of the domain or of the content being consumed.

The most common web crawlers are those used by search engines that index the resources found based on keywords. The resulting searchable index makes it possible for people to avoid arbitrary browsing enabling them to more quickly find resources relevant to their topics of interest. By contrast, a web scraper can be designed to navigate a single URL as a person might do and extract specific information of interest from that domain. The practice of copying and pasting content from a web page is the simplest form of web scraping done manually by users. While crawlers employ scraping techniques to extract the data to be indexed along with other desired metadata, scrapers may or may not employ automated crawlers and the depth and breadth of the results of either varies widely depending on the needs of and the sophistication of the tools used by the person or organization doing the mining.

It should be noted that despite the overlap in the methodologies used in both crawlers and scrapers, current researchers generally focus on one or the other. Thus, for clarity, this paper first presents a literature review of web crawler components and crawl strategies and is followed by a literature review of web scraper components and strategies. A review of current research in web mining approaches is presented next. A sampling of publicly available data sources is presented next followed by a discussion of potential security, ethical, and legal issues common to use of web mining tools.