Sunday, November 08, 2009  
Google
Web pcquest.com

CIOL Network sites

Search by Issue | Sitemap | Advanced Search

• For most updated version of DQ TOP 20 issue, visit dqindia.com • Ad : Play and Plug ERP by IBM

Home > Technology > New Technologies for the Web

    Enterprise Solutions
    Hands On
    ITstrategy

    Developer

    Tech Forum

    SMB Forum

    Trends

    Shootout

    Reviews
    Editorials
    Linux and Open Source
    Technology
    Extraedge

    IT Careers

    Vertical Focus

Subscribe to Print magazine.


now!


Newsletter


New Technologies for the Web

Focused crawlers give accurate results by specializing in one or few topics, while Memex-type browsers give information on the basis of past surfing experiences


Wednesday, November 29, 2000

This article is the concluding piece of the series on Web-information management. The first two articles in the series were on the technologies that powered the first-generation search engines and how the second-generation search engines exploit the social-network analysis for effective mining of relevant information. In this article we will talk about focused crawling that promises to contribute to our information-foraging endeavors. We will also look at another technology, Memex, that lets you use your past surfing experiences to search for relevant information on the Web.

How focused crawling works

Focused crawling concentrates on the quality of information and the ease of navigation as against the sheer quantity of the content on the Web. A focused crawler seeks, acquires, indexes, and maintains pages on a specific set of topics that represent a relatively narrow segment of the Web. Thus, a distributed team of focused crawlers, each specializing in one or a few topics, can manage the entire content of the Web.

Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. Focused crawlers selectively seek out pages that are relevant to a pre-defined set of topics. These pages will result in a personalized web within the World Wide Web. Topics are specified to the console of the focus system using exemplary documents and pages (instead of keywords).

Such a way of functioning results in significant savings in hardware and network resources, and yet achieves respectable coverage at a rapid rate, simply because there is relatively little to do. Each focused crawler is far more nimble in detecting changes to pages within its focus than a crawler that crawls the entire Web.

The crawler is built upon two hypertext mining programs—a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, and a distiller that identifies hypertext nodes that are great access points to many relevant pages within a few links.

What focused crawlers can do

Here is what we found when we used focused crawling for many varied topics at different levels of specificity.

  • Focused crawling acquires relevant pages steadily while standard crawling (like the ones used in first-generation search engines) quickly loses its way, even though they start from the same root set.

  • Focused crawling is robust against large perturbations in the starting set of URLs. It discovers largely overlapping sets of resources in spite of these perturbations.

  • It can discover valuable resources that are dozens of links away from the start set, and at the same time carefully prune the millions of pages that may lie within this same radius. The result is a very effective solution for building high-quality collections of Web documents on specific topics, using modest desktop hardware.

  • Focused crawlers impose sufficient topical structure on the Web. As a result, apart from the naïve topical search, powerful semi-structured query, analysis, and discovery are also enabled.

  • Getting isolated pages, rather than comprehensive sites, is a common problem with Web search. With focused crawlers, you can order sites according to the density of relevant pages found there. For example, you can find the top five sites specializing in mountain biking.

  • A focused crawler also detects cases of competition. For instance, it will take into account that the homepage of a particular auto-manufacturing company like Honda, is unlikely to contain a link to the homepage of its competitor, say, Toyota.

  • Focused crawlers also identify regions of the Web that grow or change dramatically as against those that are relatively stable.

The ability of focused crawlers to focus on a topical sub-graph of the Web and to browse communities within that sub-graph will lead to significantly improved Web resource discovery. On the other hand, the one-size-fits-all philosophy of other search engines, like AltaVista and Inktomi, means that they try to cater to every possible query that might be made on the Web. Although such services are invaluable for their broad coverage, the resulting diversity of content is often of little relevance or quality.

Memex


Page(s)   1   2   

End of the article

PC Problems? Get a solution in 24 hours. Ask Tech Expert




Untitled Document



ZTE:Leading CDMA Technology


Extraordinary Networks:Freedom of Choice


Message boards

Discuss this and many other IT topics at the
CIOL message board

Previous Stories

Audrey for Your Home

Filtering Focused Information

Search Engines

   
 

 
 

Magazine Subscription | RQS | Contact Us | Team PCQuest | Advertising - Print | jobs@cybermedia