Archive for April, 2007

What can I do being a COTE ?

Wednesday, April 25th, 2007

You can do a lot. COTE is about humanity and about how to run the world. Imagine yourself as the King of the entire world and then suggest a strategy to improve an aspect of the world, the way you would do it. Thats it! Thats your first contribution.

Important: Please note that the moment you make a suggestion. Many others COTE will attack you with criticism, do fault finding, complain and crib just about anything. Don’t feel disheartened or unwanted. COTE is a very open minded community and all ideas are subject to debates, no single individual is important. Please keep ego at home while contributing to the COTE project. Start by posing comments on an COTE article, eventually, you might be offered to become an author for COTE.

Quick SEO guidelines

Friday, April 13th, 2007

The SEO checklist is a “do it quick” method of determining if your website is “ready” to shoot up in the search results for prominent keywords and phrases.

The checklist includes list of freely available tools to ensure that your website is free from logical and technical mistakes, that can prevent your website from achieving top ranks in search engines.

The intended audience of this article includes

  • SEO specialists
  • Website owners
  • Freelance search engine marketeers
  • Students

Checklist for individual pages

SEO using Page URL

Page URL(Universal Resource Location) is the address of a page. An internet user can access a page only through the page URL. This includes the following methods

  • By typing the URL in the web browser
  • By clicking the URL link on another page

Both of the above methods are the recognized by search engines directly of indirectly.

By typing the URL in the web browser directly

This page visit by a user might be recognized by search engines only if

  • A tool bar of the search engine is installed in the browser. Such as Google tool bar, Yahoo tool bar or Alexa tool bar
  • You use a tool that is provided by a search engine such as Google Desktop, Google web accelerator, Yahoo Messenger or Gtalk.

If any of the above tools require you to login, the search engine might use your identity to match it against the information that you have been accessing.

  • The web page itself as an embedded link to a search engine analytical tool such as Google Analytics , Yahoo Badge or Zitku tracker.

SEO on Page URL Do’s

  • Include keywords in URL domain
  • Include keywords in URL path
  • Separate multiple keywords using “-” hyphen character.
  • Phrases in paths and URLs are better than individual keywords put together.

SEO on Page URL Dont’s

  • Do not over use keywords in the URL or path.
  • Do not use too many “-” hyphens to avoid being marked as spam.

Page Title SEO guidelines

Page title is the text that appears on the header bar of the web browser. The following is a checklist of page titles.

Page title SEO Do’s

  • Page title should include keywords.
  • Keywords should NOT be over present in the title.
  • Each page of the website should have a unique title most appropriate for that page.

Page title SEO Dont’s

  • Don’t over use keywords in title. Too many keywords diminish the importance of real keywords.
  • Don’t repeat keywords in title. Keep title “readable” for humans.

Page Meta tags

Page meta is the internal code of the website that includes two important tags Keyword tag and Description tag. Even though some search engines DO NOT use these tags, many still do.

Page meta tags guidelines

  • Keyword Meta: Keyword meta is your opportunity to write important keywords of your webpage in this attribute. Important to note that adding too many keywords in this meta is likely to reduce their importance. Include only significant keywords in this attribute.
  • Description Meta: Description meta is a tag that should be added to the page to include a description of the page in less than 200 characters.
    * Important to note that the description is a description of the page and not the entire website.

Page Content SEO guidelines

Page content is the actual webpage which is shown on the web browser and is visible to the visitor. The page content is the most important section of a page and should be dealt with care.

SEO for Page Content Do’s

  • Write original text.
  • Keyword density should not be less than 5% of the total text.
  • Keywords/keyphrases should appear in <h1>, <h2> and<h3> tags.
  • Use important keywords near the top of the page(in html).
  • Try to anticipate the phrase that the search engine user is likely to search for use the phrase in the same order in your text.

SEO for Page Content Dont’s

  • Don’t use too much javascript.
  • Don’t use “large” images on the top of a page.
  • Don’t use animations.
  • Don’t use Flash for the entire content. Flash is mostly unreadable.
  • Don’t use text in the form of images.

SEO for Images on the website Do’s

  • Always use “Alt text” as well as “title” attribute to describe the photograph. “Alt text” alone is not sufficient.
  • Image file names should be descriptive (apply Page URL + Path do’s and don’ts)

SEO for Images on the website Don’ts

  • Do not use too much text or important keywords in images. They cant be read by search engines.
  • Do not use images to make text invisible such as “plain” black background with black text.

One sport event in many stadiums - LIVE !

Thursday, April 12th, 2007

Sports events have been one of the most successful crowd pullers in the history of this planet.

A soccer match between renowned rivals can attract over 50,000 people in a single stadium. Millions watch it at home.

The same is true to base ball and cricket. Tickets get sold out within a matter of hours and several millions unfortunates are forced to watch the match on TV !

What if there was a way to let everyone watch the same event LIVE in a stadium ?

Read on for how it can be done !

A friend of mine was unable to procure tickets for the most popular cricket match in the world. The India Vs Pakistan World Cup 07 cricket match. He came around cribbing and I thought perhaps someone could do something about it. I couldn’t get him a ticket, but I did write this article to give a concept of how it can be achieve in a few years from now.

The solution lies in holographic projectors and specialized cinematography.

A special set of cameras capture a live event in a stadium then a this video feed is transmitted to stadiums all across the world. The feed is picked up by giant holographic projectors which are then used to project the event live in stadiums across the world !

If the above is implemented, a lot of people have a lot to gain !

  • The ticket collections will multiply by several hundred times(if not thousand).
  • The advertising space will be multiplied.
  • More people will be able to watch the event in stadiums.

Challenges

  1. No one has made a giant holographic projector as yet.
  2. Preventing direct sunlight from projection stadiums
  3. 3d cameras have not yet been invented except a few succesful implementations with stills.

Solutions [stub]

Giant Holographic projectors

Holographic projectors are currently being used to host small products in showrooms etc.

What if it was possible to have the match is one stadium and then reproduce the match in several other stadiums to generate exact holographic presentation of the original in a hundred other stadiums.

Coming soon…

  • Time slicing technique for holography.

Zitku bots

Thursday, April 12th, 2007

Web crawler bots

Zitku employees several bots to keep its search operations functioning. These bots run parallel to each other and communicate with each other using a common database.

Crawler Bot
Crawler bot is a automated script that reads a database of “to-be-crawled” urls, extracts the URL and saves the page into a filesystem. The page is stored as it is without any compression. The bot obeys all rules of robots.txt, indexes web pages at a slow pace on multiple servers such that no single ISP or datacenter is overwhelmed by the crawler. The crawler is capable of reading

  • ROR files
  • Google Sitemaps and
  • Yahoo Url lists.

Data storage
The filesystem is a hierarchiachy oriented alphabetically arranged file system.
For instance www.catabatic.co.in/index.htm is stored in
\netcopy\www\c\a\t\a\b\a\t\i\c\dot\co\dot\in\index.htm

Compressor Bot
Compressor Bot reads a central database to calculate which sections of the file system require compression. Compression criteria is currently defined as minimum 5MB uncompressed size. The compression analyzes the type of content in the webpages and appropriatly selects the best compression method.

Indexer Bot
Indexer Bot is a reads a database of “ready-2b-parsed” urls to extract information from he compressed filesystem and then store them into the search database. The data from the pages is analyzed smantically and fed into the search database using pre-defined rules.

Archiver Bot
Archiver bot is a tool that extracts “older” versions of the webpages and store them into an archive filesystem. The Archive filesystem is a highly-compressed filesystem.

Limitations of the Crawler bot

Spider traps
Crawler bot is presently unable to detect all spider traps and often indexes dummy pages. Although the non-agressive nature of the bot allows it to minimize the damage. The junk data detection is currently not fool proof and only generates warnings. The warnings are generated on the basis of traffic analysis reports of the website Vs the pages that appear in a website. Generated warnings are manually inspected by editors and trapping website are marked as trappers.

URL redirects
302 temporary redirects are currently a known issue with the crawler and is being aggressively researched. Using 302 URL redirects, the search crawler is advised by the redirecting page that a target URL is a temporary URL and real page is the redirector page. The target URL can be an third party website.

Page Parser
Page parser is a colleciton of perl scripts that read the content of page and performs several analytical operations on the content to extract and build an index for that page into a central database. The central database is a filesystem.

Readable content
Readable content is that content of a page which you see in the web browser. The page parser delete the HTML code of the page to extract human readable information from the page. This information is then analyzed as follows:

Primary keywords
Primary keywords are those keywords that have been have maximum weightage in a page. These keywrods are generated from the page content and NOT from the keywords meta attribute.

Secondary keywords
Secondary keywords are all keywords except primary keywords and supportive words (such asconjunctions, articles etc)
Tertiary keywords
Tertiary keywords are sematically linked keywords to primary keywords.

For instance David Becham is a tertiary keyword to soccer.

Spam keywords
Spam keywords are those keywords that are over-present in a readable text as compare to normal. This tracking is performed in based on the size of the readable text and the count of the keyword appearing in the content.

Links
Links are html points to another webpage. The page parser can the following types of links

  • <A> links
  • <Iframes>
  • Image maps

Apart from self pointing links all other outbound links are categorized as

  • Internal links
  • External links

Internal links

  • Internal links are counted as links within the same
  • Votes are added for each internal link
  • Max 500 votes
  • Intenal page priority votes(if defined)

External links

  • Votes are added to each external link
  • External link votes are multiplied by votes accumulated by them divided by the links that page contains

Determining important keywords for a web page

Keywords within the page
Zitku web crawler looks for important keywords of a page that can be used to “represent” the page in only a set of keywords. These keywords are extracted as follows

  • Keywords that appear within <h1>,<h2> and <h3> tags: These tags are likely to contain the most important keyphrases of a document.
  • Keywords that appear within <b> tags: These are collected as “secondary keywords” and are considered more important as compared to regular text.
  • Keywords that are within 2% to 10% of the total word count of the page: Each page is analysed in the form of a word array including the occerence count for each keyword, followed by a “weight” of the position. This array is then converted into a simple array marking a keyword as primary, secondary or spam.
  • Keywords within the URL of the page: Zitku web crawler reads the keywords present in the URL of the page to extract readable keywords and then compare them with the keyword array from the page content. The weight of the keyword that is present in the URL is same as that of the one within an <h1> tag. The web crawler indexing bot is able to detect spam if the URL has more than 15 keywords, in which case all keywords  from the URL are rejected altogether.

Keywords outside the page

Keywords contained in links to a page are probably the best way to describe a page and zitku web crawler acknowledges these keywords with the same weitage as is given to keywords in URL or the <h1> tags.

Internal links:  Link text that appears on internal link. Same importance as <h2> tags.

External links:  Link text that appears on the external links. Same importance as <h1> tags.

Traffic
Traffic is analysed using the Zitku tracker. This data is taken into account only for those website where the tracker is intalled. Website that do not have tracker installed are given an average rating on this vertical.

Unique visitors
Unique visitors are tracked using their IP address and other HTTP friendly information.

Bot filtering
The tracker is able to distinguish between human visitors and bot visitors. Bot visits and crawling either by the zitku crawler or a third party crawler is kept seperatly from the main traffic results.

Track superficial visits
Zitku tracker uses IP address, speed of access and duration of access to determine whether the visit is a human visit a bot visit or a visit performed using an automated tool to archive the website.

Regional visitors
The tracker uses geoLocation IP tracking to determine the visitors of a website. This data is collated to produce search results for other visitors from the same regions. A web page popular among users of paris has a higher weightage for other paris users as compared to users from India.

Uptime Monitor Bot(Proctor)

Uptime votes
Zitku employs an uptime monitor bot to track the uptime of websites. Website with highest uptimes are given more votes as compared to those website which disappear often.

Downtime negative votes

  • Every downtime recorded in a website is counted as a 100 negitive votes.
  • Uptime Error prevention
  • Proctor uses several known “always online” resources to establish credibility of its own network connectivity before assigning negative votes to other pages.

Zitku - the new search engine

Thursday, April 12th, 2007

Zitku is a search engine. It uses data from the world’s two largest community edited projects DMOZ and wikipedia.

DMOZ is the world’s largest human edited directory of websites. Wikipedia is the worlds largest human edited encyclopedia. Zitku combines the resources of these two data repositories to create accurate and spam free search results.

In addition to the two repositories, Zitku does have its own set of software infrastructure to support crawling, storage, compression and search of the “entire internet”. Yes you read it correctly, Zitku is designed to store multiple “copies” of the entire publicly accessible internet.

Zitku employs several bots to perform the tedious tasks of storing the internet. The crawler visits every webpage on the internet and then stores them in an ever expanding distributed filesystem.

Each stored page is parsed for keywords and readable content which is then sorted and important sections are put into a highly compressed search repository. When a user performs a search the results are produced from this repository.

The search engine currently in its alpha stages of development is inventing methods to perform large scale data processing with minimal hardware requirements.

Zitku the amalgamantion of all directories in one

Thursday, April 12th, 2007

Zitku is a revival strategy for all Open web directories that have either gone dead or are unable to cope up with the growth of the internet. How does it unit the effort of all editors across all open directories including the mighty DMOZ ?

read more

New search engine in town

Thursday, April 12th, 2007

Zitku is a search engine that produces search results from the worlds two largest human edited databases DMOZ and Wikipedia. In addition Zitku runs a background crawler on all websites listed in DMOZ to perform freetext searches.

Zitku tracker is a utility that webmasters can use to view analytical data about their website traffic and popularity

read more