The Indexing Process Briefly
A URL is sent to the spidering queue by the URL server and a web robot downloads the associated web page (or specific information from the web page) and sends that web page to the document repository.
The Document Repository compresses the web page for processing and storage and a unique Identification is assigned to the web page which is then sent to the document indexer.
The Document Indexer sorts the unique identifier of the web page amongst the rest of the documents list and references location information so that the document can be accessed from the repository. All link information including link text is parsed from the web page and is usually stored in a separate index so that a citation ranking can be assigned to the web page.
The document index then parses the web page into word occurrences and packages this information for storage in the inverted index. Generally the document index will also store information about the web page such as size, term positioning, emphasis, etc. The web page is assigned scores or weights based upon all of the above factors according the methodology of the particular search engine. Using the parsed and ranked term information, the document is organized into word identifications which are used to create the inverted index.
Previous: Primary Software ArchitectureNext: Vector Space Model






