How does indexer walk through hypertext links ================================================= When indexer tries to insert a new URL into database or is trying to index an existing one, it first of all checks whether this URL has corresponding "Server" or "Realm" command given in indexer.conf. URLs without corresponding "Server" or "Realm" command are not indexed. By default those URLs which are already in database and have no Server/Realm commands will be deleted from database. It may happen for example after removing some Server/Realm commands from indexer.conf. "Server" command ---------------- This is the main command of the indexer.conf file. It is used to add servers or their parts to be indexed. The format of Server command is: Server [subsection] [alias] This command also says indexer to insert given URL into database at startup. "Server" command has required "URL" and two optional "subsection" and "alias" parameters. Usage of alias optional parameters is covered in alias.txt. E.g. command "Server http://localhost/" allows to index whole http://localhost/ server. It also makes indexer insert given URL into database at startup. You can also specify some path to index server subsection: "Server http://localhost/subsection/". It also says indexer to insert given URL at startup. Note that you can supress indexer behaviour to add URL given in Server command by using -q indexer command line argument. It is useful when you have hundreds or thousands Server commands and their URLs are already in database. This allows to have more quick indexer startup. Checking that URL matches "Server" command ------------------------------------------ There are several ways how indexer checks that URL corresponds to some Server command. Use optional subsection parameter to specify server's checking behaviour. Values of subsection are the same with "Follow" command arguments. Subsection value must be one of the following: page, path, site, world and has "path" value by defaul. If subsection is not specified, current "Follow" value will be used. So, the only "Server site http://localhost/" command and combination of "Follow site" and "Server http://localhost/" have the same effect. 1) "path" subsection When indexer seeks for a "Server" command correspondent to an URL it checks that the discovered URL starts with URL given in Server command argument but without trailing file name. For example, if "Server path http://localhost/path/to/index.html" is given, all URLs which have "http://localhost/path/to/" at the beginning correspond to this Server command. Commands Server path http://localhost/path/to/index.html Server path http://localhost/path/to/index Server path http://localhost/path/to/index.cgi?q=bla Server path http://localhost/path/to/index?q=bla have the same effect except that they insert different URLs into database. 2) "site" subsection indexer checks that the discovered URL have the same hostname with URL given in Server command. For example, "Server site http://localhost/path/to/a.html" will allow to index whole http://localhost/ server. 3) "world" subsection If world subsection is specified in Server command, it has the same effect that URL is considered to match this Server command. Check an explanation below. 4) "page" subsection This subsection describes the only one URL given in Server argument. 5) subsection in news:// schema Subsection is always considered as "site" for news:// URL schema. This is because news:// schema has no nested paths like ftp:// or http:// Use "Server news://news.server.com/" to index whole news server or for example "Server news://news.server.com/udm" to index all messages from "udm" hierarchy. Realm command ------------- Realm command is more powerful way to describe web area to be indexed. The format of Realm command is: Realm [String|Regex] [Match|NoMatch] [alias] It works almost like "Server" command but takes a regular expression or string wildcards as it's argument. There are two comparison types in Realm command. String wildcards is default match type. You can use ? and * signs in URLMask parameters, they means "one character" and "any number of characters" respectively. For example, if you want to index all HTTP sites in .ru domain, use this command: Realm http://*.ru/* Regex comparison type takes a regular expression as it's argument. Activate regex comparison type using "Regex" keyword. For example, you can describe everything in .ru domain using regex comparison type: Realm Regex ^http://.*\.ru/ Second optional argument means match type. There are "Match" and "NoMatch" possible values with "Match" as default. "Realm NoMatch" has reverse effect. It means that URL that does not match given URLMask will correspond to this Realm command. For example, use this command to index everything without .com domain: Realm NoMatch http://*.com/* Optional "alias" argument allows to provide very complicated URL rewrite more powerful than other aliasing mechanism. Take a look into alias.txt for "alias" argument usage explanation. Alias works only with "Regex" comparison type and has no effect with "String" type. Realm and Follow commands ------------------------- As far as subsection actually means which part of argument given in Server command to compare with a URL, Realm command does not have similar optional subsection parameter. Is is useless in the case of string wildcards and regular expressions. Because of it "Follow" command does not affect "Realm" command. Imagine that you have: Follow path Realm http://localhost/* URL http://localhost/somepath/ If you add into database for example an URL http://localhost/somepath/ either using "URL" indexer.conf command given above or using "indexer -i -u http://localhost/somepath/", indexer WILL follow any URL beyond "/somepath/" directory of localhost if there is a link to it from "/somepath/". "Follow path" has no effect if Realm command is used. Using different parameter for server and it's subsections --------------------------------------------------------- Indexer seeks for "Server" and "Realm" commands in order of their appearance. Thus if you want to give different parameters to e.g. whole server and its subsection you should add subsection line before whole server's. Imagine that you have server subdirectory which contains news articles. Surely those articles are to be reindexed more often than the rest of the server. The following combination may be usefull in such cases: # Add subsection Period 200000 Server http://servername/news/ # Add server Period 600000 Server http://servername/ These commands give different reindexing period for /news/ subdirectory comparing with the period of server as a whole. indexer will choose the first "Server" record for the http://servername/news/page1.html as far as it matches and was given first. Default indexer's behaviour --------------------------- The default behaviour of indexer is to follow through links having correspondent Server/Realm command in the indexer.conf file. It also jumps between servers if both of them are present in indexer.conf either directly in Server command or indirectly in Realm command. For example, there are two Server commands: Server http://www/ Server http://web/ When indexing http://www/page1.html indexer WILL follow the link http://web/page2.html if the last one has been found. Note that these pages are on different servers, but BOTH of them have correspondent Server record. If one of the Server command is deleted, indexer will remove all expired URLs from this server during next reindexing. Using "Follow world" -------------------- The first way to change described default behavour is to use "Follow world" indexer.conf command. indexer will walk through ANY found URLs and will jump between different servers. Theoretically, it will index all Internet in this case if there are no harware limits :-) When "Follow world" command is specified, indexer just adds one server record to memory with an empty start URL during loading indexer.conf. This empty server will be found only in the case when no other Server records with non-empty start URL are found. Using "DeleteNoServer no" ------------------------- The second way to change default behavour is to use "DeleteNoServer no" command. This command means that URLs which are already in database will not be deleted even if they have no corresponding Server/Realm command. "DeleteNoServer no" is implemented by adding one empty server just like "Follow world". The difference between those two commands is that in case of "DeleteNoServer no" indexer follows links ONLY INSIDE servers and does not jump between different servers. This allows to index only those servers which are already in database and do not follow other servers. Example of command sequence: DeleteNoServer no Server http://www/ Server http://web/ While indexing http://www/page1.html indexer WILL follow the link http://www/page2.html but DOES NOT follow http://web/page2.html link because http://www/page1.html and http://web/page2.html are on different servers. Note that if you delete URL from the list in url.txt using the "DeleteNoServer no" scheme, indexer WILL NOT delete URLs from the same server. Imagine that you have removed http://www/ from url.txt. To remove all URLs of this server from the database you'll have to run "indexer -C -u http://www/%". Realm * ------- You may note that "Realm *" is something like "DeleteNoServer no". Actually it has almost the same effect with "DeleteNoServer no". The only difference is that this command does allow indexer to jump between servers. Using "indexer -f ---------------------------- The third scheme is very useful for "indexer -i -f url.txt" running. You may maitain required servers in the url.txt. When new URL is added into url.txt indexer will index the server of this URL during next startup. if you are using "DeleteNoServer no" it does not matter whether you have passed the root URL (http://www/) of the server or one of internal pages (http://www/path/to/some/page.html). Indexer will index whole server http://www/