Primary Software Architecture

The following sections describe software and database components common to modern search indexes and Information Retrieval systems.


Indexing Robots

The indexer is usually a team of distributed web robots (spiders) that collect web documents. Spiders also collect information about document inter-relatedness. Web traversal is accomplished by following hyper-links.


Document Repository

The document Repository stores indexed web pages in a compressed format. Depending on the engine itself this is either a complete document index (Such as Google, which downloads and stores complete web pages.), or a partial document index which contains condensed representations of web pages composed of those elements which the search engine uses to ascertain topical information about a web page.


Document Indexer

The document Indexer is the sorting and organization hub of the search index. It stores information about the web pages in the document Repository which are usually sorted according to an Identification scheme specific to each search engine. The document indexer can double as a table of contents for the document repository, storing information such as URL, unique ID, document size, etc.


Inverted Index

The Inverted Index is a specialized document Index which stores and organizes web pages by word composition information. During indexing, a web page is parsed into word counts.


For example,the following statement:

"See Spot run. Spot likes to run with Jack."

Would be represented in the Inverted Index as:

See=1, Spot=2, Run=2, Likes=1, To=1, With=1, Jack=1

The inverted index can also contain information about word emphasis, capitalization, placement, etc.


Previous: Introduction
Next: The Indexing Process Briefly
Client Login

Velocitize Your Web Marketing TM today.

Contact us for a free SEO consultation, or check out or other optimization & marketing services....