Banner
Web Data Mining-All About It
Written by Web Mining Team   
Wednesday, 08 February 2012 01:42

 

The internet is the largest single source of information in the world. The internet is interlinked and therefore facilitates the interactive access. Users access the information they want by travelling through the web by URLs and hyperlinks. There are millions of web pages in the World Wide Web. Getting information from the internet is not an easier process and therefore calls for web data mining.

Since the data on the web is growing daily, the extraction of the crucial patterns is becoming a complicated task. This is so due to the difficult on the regulation of the unstructured and semi-structured content. Internet data is not like print documents and is evolving each day. This makes the database management a complex task. This explains why web data mining is an important process when it comes to mining of data online.

Web data mining incorporates the use data mining tools and experts in discovering and extracting the information from the web. Web data mining process can be divided into four subtasks:

  • Resource Finding. This is the first step and involves retrieving data both from offline and online sources. The internet sources can include newsletters, HTML documents and website content.
  • Pre-processing. This process is also known as information selection. After the completion of data extraction from the internet, there is need to transform the data into usable idea. The process involves removal of stop words and then representing the data in a logical order.
  • Generalization. This is the process of identifying the general patterns and trends that are within the target websites and other multiple websites. This process is usually done by data experts or data mining companies.
  • Analysis. This is the process of validating the information that has been processed from data, identifying patterns and then interpreting the patterns.

In web data mining it is important to understand that there are three main factors that influence the perception and evaluation process.

  • Web page content. Content is “key” every person goes online to search for information to solve a given problem. Therefore the content of any website needs to be rich in content and informative. Web scraping should therefore be undertaken on websites that are regarded to have important information.
  • Web page design. How is a website arranged or designed is another factor that determines the data mining process that is going to be undertaken. How it is linked to other pages is also another important consideration.
  • Website Design and Structure. How many pages and what kind of publishing platform is being used must be taken into account.

The web page includes the data and information that is currently available on a website, web page design and website design and structure are two important factors that determine the website usability and accessibility.


Web scraping is tailored in pursuit of useful and relevant information from web pages and databases; it can be broadly executed into the following three methods:

Web content mining. This is the process of discovering useful and relevant information in the internet. It is important to understand that the internet has extended resources in form of text, audio clips, video streams, images and many other formats. This process is related with multimedia data mining; a process that involves the extraction of various types of data. Web content is ever unstructured and commonly found in forms of free text or semi-structured HTML documents. The main purpose of web data mining in this case is to improve the information that is found on the web and thereby provide efficient results.

Web Structure Mining. It is the process of discovering data which can be based on underlying link structure on the internet. It is a process that relies greatly on the topology of hyperlinks that may or may not have descriptions. It is an important part of web scraping that enables categorization of web pages and the identification of relationships, patterns and trends that are within a website.

Web Usage Mining. It is the process of analyzing data is generated by use of browsing history or behavior. It is important to understand that the processes of content mining and web structure mining are based on primary data while web usage mining is dependent on secondary data,

 

Web Data Mining-All About It