Zitku bots

Web crawler bots

Zitku employees several bots to keep its search operations functioning. These bots run parallel to each other and communicate with each other using a common database.

Crawler Bot
Crawler bot is a automated script that reads a database of “to-be-crawled” urls, extracts the URL and saves the page into a filesystem. The page is stored as it is without any compression. The bot obeys all rules of robots.txt, indexes web pages at a slow pace on multiple servers such that no single ISP or datacenter is overwhelmed by the crawler. The crawler is capable of reading

  • ROR files
  • Google Sitemaps and
  • Yahoo Url lists.

Data storage
The filesystem is a hierarchiachy oriented alphabetically arranged file system.
For instance www.catabatic.co.in/index.htm is stored in
\netcopy\www\c\a\t\a\b\a\t\i\c\dot\co\dot\in\index.htm

Compressor Bot
Compressor Bot reads a central database to calculate which sections of the file system require compression. Compression criteria is currently defined as minimum 5MB uncompressed size. The compression analyzes the type of content in the webpages and appropriatly selects the best compression method.

Indexer Bot
Indexer Bot is a reads a database of “ready-2b-parsed” urls to extract information from he compressed filesystem and then store them into the search database. The data from the pages is analyzed smantically and fed into the search database using pre-defined rules.

Archiver Bot
Archiver bot is a tool that extracts “older” versions of the webpages and store them into an archive filesystem. The Archive filesystem is a highly-compressed filesystem.

Limitations of the Crawler bot

Spider traps
Crawler bot is presently unable to detect all spider traps and often indexes dummy pages. Although the non-agressive nature of the bot allows it to minimize the damage. The junk data detection is currently not fool proof and only generates warnings. The warnings are generated on the basis of traffic analysis reports of the website Vs the pages that appear in a website. Generated warnings are manually inspected by editors and trapping website are marked as trappers.

URL redirects
302 temporary redirects are currently a known issue with the crawler and is being aggressively researched. Using 302 URL redirects, the search crawler is advised by the redirecting page that a target URL is a temporary URL and real page is the redirector page. The target URL can be an third party website.

Page Parser
Page parser is a colleciton of perl scripts that read the content of page and performs several analytical operations on the content to extract and build an index for that page into a central database. The central database is a filesystem.

Readable content
Readable content is that content of a page which you see in the web browser. The page parser delete the HTML code of the page to extract human readable information from the page. This information is then analyzed as follows:

Primary keywords
Primary keywords are those keywords that have been have maximum weightage in a page. These keywrods are generated from the page content and NOT from the keywords meta attribute.

Secondary keywords
Secondary keywords are all keywords except primary keywords and supportive words (such asconjunctions, articles etc)
Tertiary keywords
Tertiary keywords are sematically linked keywords to primary keywords.

For instance David Becham is a tertiary keyword to soccer.

Spam keywords
Spam keywords are those keywords that are over-present in a readable text as compare to normal. This tracking is performed in based on the size of the readable text and the count of the keyword appearing in the content.

Links
Links are html points to another webpage. The page parser can the following types of links

  • <A> links
  • <Iframes>
  • Image maps

Apart from self pointing links all other outbound links are categorized as

  • Internal links
  • External links

Internal links

  • Internal links are counted as links within the same
  • Votes are added for each internal link
  • Max 500 votes
  • Intenal page priority votes(if defined)

External links

  • Votes are added to each external link
  • External link votes are multiplied by votes accumulated by them divided by the links that page contains

Determining important keywords for a web page

Keywords within the page
Zitku web crawler looks for important keywords of a page that can be used to “represent” the page in only a set of keywords. These keywords are extracted as follows

  • Keywords that appear within <h1>,<h2> and <h3> tags: These tags are likely to contain the most important keyphrases of a document.
  • Keywords that appear within <b> tags: These are collected as “secondary keywords” and are considered more important as compared to regular text.
  • Keywords that are within 2% to 10% of the total word count of the page: Each page is analysed in the form of a word array including the occerence count for each keyword, followed by a “weight” of the position. This array is then converted into a simple array marking a keyword as primary, secondary or spam.
  • Keywords within the URL of the page: Zitku web crawler reads the keywords present in the URL of the page to extract readable keywords and then compare them with the keyword array from the page content. The weight of the keyword that is present in the URL is same as that of the one within an <h1> tag. The web crawler indexing bot is able to detect spam if the URL has more than 15 keywords, in which case all keywords  from the URL are rejected altogether.

Keywords outside the page

Keywords contained in links to a page are probably the best way to describe a page and zitku web crawler acknowledges these keywords with the same weitage as is given to keywords in URL or the <h1> tags.

Internal links:  Link text that appears on internal link. Same importance as <h2> tags.

External links:  Link text that appears on the external links. Same importance as <h1> tags.

Traffic
Traffic is analysed using the Zitku tracker. This data is taken into account only for those website where the tracker is intalled. Website that do not have tracker installed are given an average rating on this vertical.

Unique visitors
Unique visitors are tracked using their IP address and other HTTP friendly information.

Bot filtering
The tracker is able to distinguish between human visitors and bot visitors. Bot visits and crawling either by the zitku crawler or a third party crawler is kept seperatly from the main traffic results.

Track superficial visits
Zitku tracker uses IP address, speed of access and duration of access to determine whether the visit is a human visit a bot visit or a visit performed using an automated tool to archive the website.

Regional visitors
The tracker uses geoLocation IP tracking to determine the visitors of a website. This data is collated to produce search results for other visitors from the same regions. A web page popular among users of paris has a higher weightage for other paris users as compared to users from India.

Uptime Monitor Bot(Proctor)

Uptime votes
Zitku employs an uptime monitor bot to track the uptime of websites. Website with highest uptimes are given more votes as compared to those website which disappear often.

Downtime negative votes

  • Every downtime recorded in a website is counted as a 100 negitive votes.
  • Uptime Error prevention
  • Proctor uses several known “always online” resources to establish credibility of its own network connectivity before assigning negative votes to other pages.

2 Responses to “Zitku bots”

  1. Eric J. Says:

    Are these bots open source ? Can I please get a link to download them if they are open source. If they are not open source, do you have any plans to make them open source, where can I find more details on Zitku ?

    I am looking for some help with search engine crawlers for my university project. If the authors can spare some time with me about the above bots, I’ll really appreciate and ofcourse publish your names in my report.

    thanks,
    –Eric.

  2. Amit Soni Says:

    Hello Eric,

    The bots are currently not OpenSource and may become OpenSource in future (subject to approval from sponsors), they will become open source within the next year.

    I’ll be glad to assist you in your report as long as my work doesn’t not conflict our NDA agreements with the sponsors.

    –Amit.

Leave a Reply