Exploring Web Mining

Web Crawlers

A web crawler, as explained by Boldi, Marino, Santini, and Vigna (2014), is an application that systematically downloads a large number of web pages starting from a seed and follows the hyperlinks found in those pages. Arif et al. (2013) explained that crawlers are also known as “spiders” or “bots” (short for robots) and that they can collect information even from areas of web pages that are hidden from humans. Specifically, the crawler parses a web page on its seed list, saves the hyperlinks found on the page in a database and continues parsing links recursively. The first crawler, developed by Brian Pinkerton, indexed over 4,000 pages from different websites by April 1994, over 1 million by June 1995, and about 2.5 billion by 2008. As the web evolved, so did crawlers giving rise to the development of focused crawlers that selectively seek out pages and hyperlinks that are relevant according to a pre-defined set of topics and intelligent crawlers that learn the linking structure with no prior assumptions of that structure. Crawlers further evolved to rely less on HTML structure and more on semantics. Such crawlers operate on the Resource Description Framework (RDF) or Ontology Web Language (OWL) metadata with pages retrieved based on their REF/OWL relationships (Di Pietro, Aliprandi, De Luca, Raffaelli, & Soru, 2014).

In order to overcome the problems of early crawlers’ instability and lack of scalability resulting from the rapid growth of the web, crawlers were modified to separate the indexing process from the crawling process to improve stability and further evolved to be distributed and agile to improve scalability and performance. A “good” crawler further needs to be flexible, low cost, robust, manageable, reconfigurable, respectful of server resources, and behave in accordance with pre-established policies, namely those of selection, revisit, politeness, and parallelization. The selection policy establishes the pages the crawler should download while the revisit policy states when the crawler should check a page for changes. A polite crawler avoids overloading the websites it visits and parallelization states how the crawler should coordinate its distributed components (Arif et al., 2013).

Web Crawler Components

The components needed by a scalable web crawler, according to Seeger (2010), include a queuing system, a data warehouse, a searching mechanism, and the application logic that executes the crawler and controls the queuing, parsing, and inserting and updating of records. Seeger used Redis open source data structure server as the queuing component that also served as a write-through cache and MySQL open source database opting for the InnoDB storage engine to allow for concurrent accesses. Solr open source enterprise search platform was used to address the limitations of MySQL indexing and resource use intensity needed to handle queries. In the final implementation, MySQL served as a persistent back-end storage solution while Solr, which saved all indexed data as well, responded to queries by front-end users. Inserts into the database were done in batches asynchronously while updates were done incrementally resulting in a fast, efficient, and highly fault tolerant crawler able to handle 10 domains per second with each page containing an estimated average of 7.5 external links. Ruby open source programming language was used to write the application logic primarily due to the author’s familiarity with it and the availability of extensive libraries.

Web Crawling Strategies

While Seeger (2010) described the building blocks of a web crawler, Dincturk, Jourdan, Bochmann, and Onut (2014) explained common crawling strategies. Web crawlers commonly use graph data structures consisting of vertices and edges where an edge represents a connection between a pair of vertices. The crawler traverses the edges from vertex to vertex in a pre-established organized fashion. The two most commonly used graph traversal methods are depth-first search (DFS) and breadth-first search (BFS). According to Shaffer (2011), DFS follows one edge to its deepest vertex and recursively traverses back up to find the next unvisited vertex. BFS visits all vertices connected to the starting vertex before traversing edges that are deeper in the hierarchy. Implementation of BFS is similar to DFS except it uses a queue rather than a recursion stack.

DFS and BFS were previously discussed. The greedy algorithm attempts to make the locally optimal choice at each stage of the transversal, with the goal that local optimizations will approximate global optimizations in a reasonable time. The menu and probability strategies are meta-models. The former crawls a page for links similar to DFS and BFS, however restricts the crawl to menu items. The crawler tracks the application state along with executed and unexecuted events, event priority, and the resulting state. Essentially the model determines whether a state has been visited or not, if all events have been executed from a particular state, and which events have not yet been executed. The probability meta-model is a refinement of the menu model in that it calculates the probability of finding a new state for each event using a Bayesian formula that includes the history of event execution and decides which event to explore next. The results indicated that the menu and probability strategies were more efficient than the others and that the assumptions imposed by the hypercube model were too strict for most RIAs (e.g. the application behavior is deterministic, a reset always returns to the same initial state, and the crawler is provided with a set of valid user input values). Dincturk et al. (2014) concluded that while the hypercube model performed better than DFS and BFS, it violated the assumptions “too often” (e.g. resulted in unexpected splits and merges).

In a similar vein, van Deursen, Mesbah, and Nederlof (2015) examined Crawljax, an open source crawler designed for AJAX driven web applications. The tool is JavaScript-enabled resulting in a crawl that represents a model of user interaction: a click on an element changes the state of the application and attempting to trigger all possible states builds a complete user interaction scenario. A key challenge in crawling AJAX applications is the potential for state explosion given that the state of the application and the content displayed may be temporally based or personalized to each user. Another challenge is if the crawler is able to determine if navigation can be derived from the URL or if it is programmatically defined in the application. Events may trigger a state change in multiple ways not just through a clickable link (i.e. touchscreens and keyboard combinations) making it difficult for the crawler to invoke all possible events. Finally, though JavaScript-enabled crawlers can find more extractable content than their traditional counterparts, they face the same barriers when it comes to the “deep web” or the parts of the site that are not reachable through links. To address these issues, the authors proposed a crawling approach involving the following steps:

  • Identify clickable document object model (DOM) elements by first listing candidate elements and then trying each dynamically.
  • Compare states and determine if any change is relevant either by using string-based Levenshtein distance to detect changes above a given threshold or by using a series of comparators that eliminate changes to irrelevant detail (e.g. color, counters, timers, etc.).
  • Recurse using a diff algorithm to determine clickable elements already processed on the new page and guide the crawler to process newly added elements.
  • Provide random or custom input values for specific states or fields depending on the needs.

The work by van Deursen et al. (2015) is in its early stages in that they have collected crawl results of over 4,000 randomly selected online web applications. Initial findings indicate that 90% of the crawled sites conduct JavaSript-enabled DOM manipulations after the initial load, that half contain non-unique id fields in DOM elements, and that half of them load their style sheets too late requiring page re-rendering. The authors’ suggestions for future research include investigating the potential for guided crawling (a supervised approach) along with example-based and model-based crawling.

While developing optimal traversal strategies is a common theme in the literature, other researchers have focused on the quality of result sets. Goyal and Kalra, for example (2014), compared the results of three crawlers using different classification schemes to determine topical relevancy including naïve Bayesian, decision tree induction, and neural network classifiers. Four topics were chosen and five seed URLs for each topic were placed into the frontier queue (e.g. the queue of links to be crawled). Each link was then scored for relevancy based on four attributes, namely parent page, URL word, anchor text, and surrounding text. The seed links along with their scores for each attribute became the basis for training. Any score above the established threshold for a given attribute was predicted as relevant. The results of the study indicated that the neural network classifier had more than twice the precision rate than the naïve Bayesian classifier for three of the four topics with the fourth significantly higher; the decision tree induction classifier also performed significantly better than the Bayesian one and better than the neural network classifier for one topic (e.g. computer books). Although the results are promising, the authors cautioned that more experiments must be performed to account for quality of the seeds, number of pages crawled, and the completeness and appropriateness of the attributes used to score relevancy.

More and Bharambe (2014) performed a similar comparison from the perspective of the inability of generic crawlers to produce relevant results when dealing with complex queries such as queries using meta-crawlers. The authors proposed using a genetic algorithm (GA) on the meta-search and refining the results of that search locally using a memetic algorithm (MA). The GA consisted of a soft-intelligent agent that can work offline to filter and rank the results according to the needs of the user. Using a fuzzy classifier that is adaptable based on user preferences, the filter selects the most discriminating terms. The MA is similar in that memes, like genes, can learn, but so can the offspring of memes as they evolve. Both use the same four attributes for ranking, namely population size, number of generations, crossover rate, and mutation rate. Five complex queries were chosen for comparison. The results indicated that the GA performed significantly more precise than the traditional crawler, optimizing at 10 iterations. The MA achieved comparable performance to the GA only after five iterations suggesting there were efficiency gains using the latter.