Developing a Web Scraper


This paper describes the design and development of a web content extraction application (web scraper) built using PHP and the W3C Document Object Model (DOM) and DOM XPath extensions. The DOM document class not only enables developers to programmatically create dynamic web pages but also allows them to extract content from existing pages. The XPath extension makes it possible to evaluate expressions in a DOM document and find those that match patterns based on the extraction rules established by the developer. Only the content considered relevant is thus extracted for analysis.

The motivation for this project was, and still is, to develop a means to make more informed decisions regarding resource allocation for an e-learning publisher startup that is in the research and development stage. While the application described herein is that of a working model, at this stage it is still mainly a proof of concept in that much work still needs to be done to develop it into a robust production quality application. More research also needs to be done to ensure that the application consistently behaves ethically and is legally compliant.