Search Engine using Web Mining approach – Department of Information Technology

Abstract: The Hypertext Mining is about finding significant statistical patterns related hypertext documents, topics, hyperlinks and queries and using these patterns to connect user’s information they seek. Also to eliminate the limitation of basic web access based on clicking on links and typing keyword queries. The web has become a vast storehouse of knowledge built in a decentralized yet collaborative manner. It is living, growing, populist and participatory medium of expression with no central editorship. Detecting and exploiting statistical dependencies between terms, web pages and hyperlinks will be central theme of this project. Such dependencies are also called ‘PATTERNS’ and the act of searching for such pattern is called ‘Machine learning’. As data mining will be very rich, comprising text, hypertext mark-up, hyperlinks, sites and topic directories. This distinguishes the area of web mining as a new and exciting field; although it borrows liberally from the traditional data analysis. Our goal is to provide both the technical background and tools and tricks for the trade of the web content mining. The general style is a mix of scientific and statistical programming with system engineering and optimization. Firstly, It includes some engineering issues like: crawling, indexing and keyword search; this basically concentrates how for efficiently representing, manipulating and analyzing hypertext document with automatic computer program. Secondly, It focuses on machine learning for the hypertext: The art of creating programs that seek out statistical relations between attributes extracted from web document. Such relations can be used to discover topic based clusters from a collection of web pages, assign a web page to a predefine topic or match a user’s interest to web sites. There is also no treatment of web applications services, dynamic site management or association networking and data-processing technology; rather a probabilistic deep analysis of patterns over web. Occasional references to what has been called “ankle-deep semantics”; – techniques that leverage semantic database in shallow, efficient ways to improve keyword search.