Federated Search Finds Content that Google Can’t Reach Part I of III

Federated search facilitates research by helping users find high-quality documents in more specialized or remote corners of the Internet. Federated search applications excel at finding scientific, technical, and legal documents whether they live in free public sites or in subscription sites. This makes the federated search a vital technology for students and professional researchers. For this reason, many libraries and corporate research departments provide federated search applications to their students and staff.

To really understand what federated search is and how it works, we should first provide some background.

Crawling the Web: How Typical Web Search Engines Work

There are two basic approaches to finding content on the Web. The approach that Google and all major search engines use is to “crawl” the Web. Google, over many years, has amassed a list of billions of Web sites. In the early days, it’s likely that Google learned about many Web sites when owners registered their sites with them. Today, Google can find new Web sites through links from sites it already knows about. Google periodically visits the sites (and the sites’ pages) on its list and identifies the links at that site. It then follows each link it finds to arrive at other pages where it starts the process over to find more links. In doing this, Google discovers sites it didn’t know about during previous visits. This process of going from one page to another and then to another is referred to as “crawling,” just like a spider crawls from one thread to another in its web. In fact, Web “spiders” are commonly referred to as “Web crawlers.” When you create a new site, just create a link to it from another site, or get someone to do it for you, and Google’s crawler will discover it.

The trouble with crawling is that this search technique doesn’t find everything. One might believe that through sufficient crawling, one could find all Web pages. In fact, only a small percentage of the Web’s content is accessible to Google. The term “deep Web” refers to the vast portion of the Web that is beyond the reach of the typical “surface Web” crawlers. Surface Web search engines like Google can’t easily fathom the deep Web because most deep Web content has no links to it. How can that be? Consider this example: Let’s say that you are researching the effects of some chemical or hazardous substance on humans. You would be well advised to search the National Library of Medicine’s Toxicology Data Network.

Most of the information you would find there you would not find via Google. Why? Because, to find the research articles, you would have typed one or more words in a search box and you clicked on the “search” button. Few, if any, of the articles you found had links to them from any Web site. Google couldn’t find those articles because Google isn’t designed to fill out search forms and click “submit” the way humans do. In particular, Google wouldn’t know what search words to put into the form. Additionally, even if Google did know what to enter into search forms and how to submit them, it wouldn’t be able to retrieve all of the documents from the source. This would leave Google with incomplete content from deep Web sources.

What Makes Federated Searches Different? It’s About the Search Forms

While in most cases, Google doesn’t fill out search forms, this is exactly what federated search applications (also known as federated search engines) do. Why doesn’t Google fill out forms? It turns out that filling out forms is a difficult problem. Federated search engine builders have to customize their search software for each Web form they encounter. While Google has a general approach to crawling links from any Web site, federated search engines are programmed with intimate knowledge of each search form. The specialized software must know not only how to fill out the form and how to simulate the pressing of the “search” button, but also how to read the results that the Toxicology Data Network (as in the example above), or any other source, provides. Both are difficult to do well.

The benefits of Federated Search

The essential benefits of federated search to its users include efficiency, quality of search results, and current, relevant content.

Efficiency, Time Savings

Using a federated search engine can be a huge time-saver for researchers. Instead of needing to search many sources one at a time, the federated search engine performs the many searches on the user’s behalf. While federated search engines specialize in finding content that requires form submissions to retrieve, it isn’t the only criterion for being a federated search engine. A federated search engine also associates content from different sources. Federated search uses just one search form to cover numerous sources and combines the results into a single results page.

Quality of Results

Federated search engines show their value best in environments in which the quality of results matters, such as libraries, corporate research environments, and the federal government. In the case of the federal government, the constituents of the government benefit greatly from such applications. A major difference between a federated search engine and a standard search engine like Google is that the client who contracts for the federated search service selects the sources to search. In almost every case, the sources will be authoritative. Google, on the other hand, has very minimal criteria for source selection. If a Web page doesn’t look like outright junk (i.e., spam) Google will present it among the search results. Thus, the federated search engine acts as a helpful librarian does, directing users to excellent quality.

Most Current Content

In addition to filling out forms and combining documents from multiple sources, another important benefit of federated search engines is that they search for content in real-time. Real-time data is crucial for researchers who are searching for up-to-the-minute content or for content that changes frequently. As soon as the content owner updates their source, the information is available to the searcher on the very next query.

By contrast, with standard search engines/Google, the results are only as current as the last time that Google crawled sites with content that matches your search words. The content you find via Google might be days or weeks old, which can be fine depending on your situation but can be problematic if you want the most current information.

Leave a Comment