The internet is the largest single source of information in the world. The internet is interlinked and therefore facilitates the interactive access. Users access the information they want by traveling through the web by URLs and hyperlinks. There are millions of web pages on the World Wide Web. Getting information from the internet is not an easier process and therefore calls for web data mining.
Since the data on the web is growing daily, the extraction of the crucial patterns is becoming a complicated task. This is so due to the difficulty on the regulation of the unstructured and semi-structured content. Internet data is not like print documents and is evolving each day. This makes the database management a complex task. This explains why web data mining is an important process when it comes to mining of data online.
Web data mining incorporates the use of data mining tools and experts in discovering and extracting the information from the web. Web data mining process can be divided into four subtasks:
- Resource Finding. This is the first step and involves retrieving data both from offline and online sources. The internet sources can include newsletters, HTML documents, and website content.
- Pre-processing. This process is also known as information selection. After the completion of data extraction from the internet, there is a need to transform the data into the usable idea. The process involves the removal of stop words and then representing the data in a logical order.
- Generalization. This is the process of identifying the general patterns and trends that are within the target websites and other multiple websites. This process is usually done by data experts or data mining companies.
- Analysis. This is the process of validating the information that has been processed from data, identifying patterns and then interpreting the patterns.
In web data mining it is important to understand that there are three main factors that influence the perception and evaluation process.