Exploring Web Mining

Web Mining Applications

There is a wide range of industry and service specific research in the literature regarding web mining applications including those used in e-commerce, news websites, discussion forums, weblogs, and e-learning. Ghuli and Shettar. (2014), for example, proposed an e-commerce shopping agent designed to allow consumers to shop for the best deal among alternative vendors. Since such an application is resource intensive, the authors recommended using Hadoop MapReduce, a software framework that allows for distributed processing of large datasets across clusters of computers. MapReduce allows splitting of input datasets into independent chunks that can be processed in parallel across the clusters. Results indicated that while the distributed model performed faster, inconsistent placement of DOM elements containing items of interest and the related metadata proved problematic for the crawler.

Large news websites pose similar crawling problems due to the various ways content is made accessible and rendered in the DOM. Access to premium content, of course is restricted behind a paywall, however, when templates or website structure changes, crawlers quickly become outdated and are unable to index the metadata of new content (i.e. title, author, date, etc) without significant manual modification. Varlamix, Tsirakis, Poulopoulos, and Tsantilas (2014) proposed an automatic wrapper generation process for news or blog websites with each site having a wrapper for its home page, the category pages, and the article pages. Based on the machine-learning paradigm, wrappers are self-correcting in that they learn to distinguish structural elements of interest from irrelevant elements and automatically request a re-building of wrappers when validation fails repeatedly. Results of the authors’ study involving 95 Greek news sites and 2,117 distinct category pages revealed an error rate of greater than 40%. Errors were mostly related to misclassification, namely that of non-category pages classified as category pages (ie. about us and terms of use policies, etc.), article media classified as advertisement media, and articles containing only non-textual media classified as irrelevant content. The authors suggested blacklisting or auto-characterizing certain types of pages and elements to reduce the error rate.

In online education, Hijazi and Itmazi (2013), proposed a context-aware crawler that periodically crawls selected open educational resources. Results are integrated into the learning management system (LMS) and resources are presented only to students who have expressed an interest in the subject. Resources are delivered to interested students further on the basis of students’ delivery preferences (e.g. device, operating system, and connection type).and location. Results of a survey revealed 83% of students accepted this type of context-aware integration of external supplemental learning material into the LMS. Those who rejected the enhancement did so primarily because they did not own or have access to the technology needed to connect remotely or were concerned about revealing private information (i.e. location).