Exploring Web Mining

Ethical and Security Issues

The literature provides numerous examples of legitimate uses of web mining techniques that are, of course, essential to search engines and that help scientists, marketers, executives, researchers, entrepreneurs, educators, and consumers find the data they need to make informed decisions. However, the practice of crawling and scraping websites also causes concerns over privacy, copyrights, and other property rights protection. Invasion of a person’s privacy is not generally an issue when it comes to polite crawlers because the data contains no personally identifying information. As for copyrights, users of the data extracted by polite crawlers should cite the source and/or seek permission to use it if there is any doubt. Polite crawlers are also respectful of the target site’s resources and comply with its acceptable use policies (AUP) and its REP contained in the robots.txt file (Sun, Councill, & Giles, 2010).

Unfortunately, not all crawlers are polite and indeed some are maliciously used by hackers, spammers, and identity thieves and all pose security threats. Privacy violations can occur, for example if a crawler is able to breach restricted areas of a website that contains personally identifying data. While unethical, such a breach may also constitute a violation of the privacy rights of individuals whose data was compromised. Data protected by the Health Insurance Portability and Accountability Act (HIPAA) and Family Educational Rights and Privacy Act (FERPA) would be considered obtained illegally. A crawler that overwhelms the server resulting in a denial of service (DoS) or one that does not comply with the AUP or REP is unethical and may constitute a violation of the rights of the site owners.

Fortunately there are countermeasures site owners can implement to detect and deal with malicious crawlers. According to Bai, Xiong, Zhao, and He (2014), currently there are three primary methods to detect a web crawler, namely 1) by identifying user agents in log files or whether the crawler visited the REP; 2) by recording user key strokes and mouse clicks (robots cannot do either); 3) by tracking navigational patterns. The first two are limited in that the crawler must be active on the site for detection to be possible (and may already be causing damage) while the third requires gradual development of a self-learning model using artificial intelligence (AI) techniques, a complex and costly approach. The authors explained that while AI approaches indicated a high degree of precision, recent studies were limited to a single website of an enterprise having relatively fast connections and consistent structure.

The study described by Algiriyage, Jayasena, Dias, Perera, and Dayananda (2013) is an example of the first method described above by Bai et al.(2014) above, in that the authors analyzed the web logs to identify crawlers and classified each as either a known crawler, a suspicious crawler, or other crawler. Known crawlers passed user-agent verification while suspicious ones violated the REP and other crawlers exhibited crawler behaviors but were not explicitly identified as known. Suspicious crawlers were further analyzed to determine if they had crawled hidden links (something known crawlers normally do not do) or were lured into honeypots. They were further checked against IP blacklists. The study involved 105,981 log lines and seven crawling scenarios. Results indicated that over 53%, 34%, and 12% of the crawling was done by known, suspicious, and other crawlers respectively. The authors concluded that future research needs to focus on improving the accuracy after finding false positives in the identifier module of their design. The authors speculated that erroneous user-agent verification may be possible with respect to crawlers that use “fake” browser based user agents.

According to Aghamohammadi and Eydgahi (2013), timely detection and prevention are the key criteria to safeguard against malicious crawlers. The authors proposed a five-factor identification process to predefine acceptable crawlers. The factors include a passkey, time, IP lookup, user agent, and number of visits. Valid values must be provided before the crawler can do any scraping. Results indicated the authors were able to perform selective exclusion while keeping page visibility high to acceptable crawlers.