Archive for the ‘Search Engines’ Category

Automatic Semantic Link Builder

Tuesday, May 8th, 2007

Semantically most useful links for your webpage and others - Automatically !

The Automatic Semantic Link Builde(ASLB) is very complex programming algorithm that is very simple to use. The script can be installed in your .net , .php or plain html pages and left to function on its own.

The script will generate a list of URLS on your page that will be visible to your visitors as well as search engine crawlers who crawl your page. So does that mean that the script will drive out traffic from your website. Yes. It will.

But copies of this script will also bring in visitors and improve the ranking of your page.

How does it work ?

ASLB front script will publish links on your website for other pages on the internet, you will be allowed to prevent certain links from showing on your website(such as your direct competition). The links to your pages will be shown onthrid party website pages which contain data relevent to your webpage’s content.

What happens at the backend ?

The main ASLB servers continuously distribute links for website and decide on which links to show where, while doing this the following rules are applied.

  1. No reciprocal links are generated.
  2. Inbound is always equal to outbound.
  3. PageRanks of involved pages play a part in the inbound:outbound ratios.
  4. Links remain where they are for a minimum of 30 days, unless removed for spamming.
  5. Only a single ASLB script can be installed in a webpage.
  6. HTML header tags play an important role.

Can I use ASLB on my website ?

No. Not yet, but soon. Bookmark this page. The Automatic Semantic Link Builder is currently in its alpha stages of development and is being tested on over 500,000 webpages. The ASLB is expected to be released in the month of March, 2008 for public installations.

Is there a manual way to do such linking ?

Yes, let a SEO company help you.

Reciprocal Link Building

Friday, May 4th, 2007

Reciprocal Link Building is the process of exchanging links between your website and other websites on the internet that focus on the same subject as yours.

Why do you need to do Reciprocal Link Building ?

  • Increases visibility and reach of your website on the internet for human visitors as well as search engine crawlers.
  • Boosts rankings in search engines who believe your website is important if too many websites on the internet point to your web pages.
  • These links drive traffic to your website from those websites that place your link.

How to do Reciprocal Link Building ?

  1. Prepare a list of those keywords that you think “your website visitors” will search on search engines.
  2. Manually perform these searches on the internet and note down the list of websites that appear within the top 500 results on popular search engines.
  3. Create links of these websites within your own website on relevant pages.
  4. Write to the webmasters of the websites that you have linked to and request them to put a link about your website on their website.
    1. If they agree to put a link to your website, thank them and move on.
    2. Otherwise add a rel=”nofollow” attribute to their link. ;o)
    3.  Write to the new web master!

Pitfalls and things you need to be aware of

  1. Do not exchange links with “link farms”. Link farms are those web site where most pages contain > 100 links per page. These website usually do not contain much content of their own instead they are link repositories. Exchanging links with such websites will cause search engines to penalize your website for participating in link spamming.
  2. Concentrate only on those websites that focus on the same subject as your website. Links from unrelated websites will not do any good.
  3. Understand the difference between web pages and a website. A link exchange is done with a web page, not with a website.
  4. Keep checking with websites with whom you have exchanged links. If your link has disappeared from a website, write to the webmaster to restore it !
  5. Total reciprocal links should not be more than 5% of total outbound links from your website.

Is their an easy way out on Reciprocal Link Building ?

Yes! Let a professional handle the task for you, while you concentrate on your main business. Professional SEO services

Quick SEO guidelines

Friday, April 13th, 2007

The SEO checklist is a “do it quick” method of determining if your website is “ready” to shoot up in the search results for prominent keywords and phrases.

The checklist includes list of freely available tools to ensure that your website is free from logical and technical mistakes, that can prevent your website from achieving top ranks in search engines.

The intended audience of this article includes

  • SEO specialists
  • Website owners
  • Freelance search engine marketeers
  • Students

Checklist for individual pages

SEO using Page URL

Page URL(Universal Resource Location) is the address of a page. An internet user can access a page only through the page URL. This includes the following methods

  • By typing the URL in the web browser
  • By clicking the URL link on another page

Both of the above methods are the recognized by search engines directly of indirectly.

By typing the URL in the web browser directly

This page visit by a user might be recognized by search engines only if

  • A tool bar of the search engine is installed in the browser. Such as Google tool bar, Yahoo tool bar or Alexa tool bar
  • You use a tool that is provided by a search engine such as Google Desktop, Google web accelerator, Yahoo Messenger or Gtalk.

If any of the above tools require you to login, the search engine might use your identity to match it against the information that you have been accessing.

  • The web page itself as an embedded link to a search engine analytical tool such as Google Analytics , Yahoo Badge or Zitku tracker.

SEO on Page URL Do’s

  • Include keywords in URL domain
  • Include keywords in URL path
  • Separate multiple keywords using “-” hyphen character.
  • Phrases in paths and URLs are better than individual keywords put together.

SEO on Page URL Dont’s

  • Do not over use keywords in the URL or path.
  • Do not use too many “-” hyphens to avoid being marked as spam.

Page Title SEO guidelines

Page title is the text that appears on the header bar of the web browser. The following is a checklist of page titles.

Page title SEO Do’s

  • Page title should include keywords.
  • Keywords should NOT be over present in the title.
  • Each page of the website should have a unique title most appropriate for that page.

Page title SEO Dont’s

  • Don’t over use keywords in title. Too many keywords diminish the importance of real keywords.
  • Don’t repeat keywords in title. Keep title “readable” for humans.

Page Meta tags

Page meta is the internal code of the website that includes two important tags Keyword tag and Description tag. Even though some search engines DO NOT use these tags, many still do.

Page meta tags guidelines

  • Keyword Meta: Keyword meta is your opportunity to write important keywords of your webpage in this attribute. Important to note that adding too many keywords in this meta is likely to reduce their importance. Include only significant keywords in this attribute.
  • Description Meta: Description meta is a tag that should be added to the page to include a description of the page in less than 200 characters.
    * Important to note that the description is a description of the page and not the entire website.

Page Content SEO guidelines

Page content is the actual webpage which is shown on the web browser and is visible to the visitor. The page content is the most important section of a page and should be dealt with care.

SEO for Page Content Do’s

  • Write original text.
  • Keyword density should not be less than 5% of the total text.
  • Keywords/keyphrases should appear in <h1>, <h2> and<h3> tags.
  • Use important keywords near the top of the page(in html).
  • Try to anticipate the phrase that the search engine user is likely to search for use the phrase in the same order in your text.

SEO for Page Content Dont’s

  • Don’t use too much javascript.
  • Don’t use “large” images on the top of a page.
  • Don’t use animations.
  • Don’t use Flash for the entire content. Flash is mostly unreadable.
  • Don’t use text in the form of images.

SEO for Images on the website Do’s

  • Always use “Alt text” as well as “title” attribute to describe the photograph. “Alt text” alone is not sufficient.
  • Image file names should be descriptive (apply Page URL + Path do’s and don’ts)

SEO for Images on the website Don’ts

  • Do not use too much text or important keywords in images. They cant be read by search engines.
  • Do not use images to make text invisible such as “plain” black background with black text.

Zitku bots

Thursday, April 12th, 2007

Web crawler bots

Zitku employees several bots to keep its search operations functioning. These bots run parallel to each other and communicate with each other using a common database.

Crawler Bot
Crawler bot is a automated script that reads a database of “to-be-crawled” urls, extracts the URL and saves the page into a filesystem. The page is stored as it is without any compression. The bot obeys all rules of robots.txt, indexes web pages at a slow pace on multiple servers such that no single ISP or datacenter is overwhelmed by the crawler. The crawler is capable of reading

  • ROR files
  • Google Sitemaps and
  • Yahoo Url lists.

Data storage
The filesystem is a hierarchiachy oriented alphabetically arranged file system.
For instance www.catabatic.co.in/index.htm is stored in
\netcopy\www\c\a\t\a\b\a\t\i\c\dot\co\dot\in\index.htm

Compressor Bot
Compressor Bot reads a central database to calculate which sections of the file system require compression. Compression criteria is currently defined as minimum 5MB uncompressed size. The compression analyzes the type of content in the webpages and appropriatly selects the best compression method.

Indexer Bot
Indexer Bot is a reads a database of “ready-2b-parsed” urls to extract information from he compressed filesystem and then store them into the search database. The data from the pages is analyzed smantically and fed into the search database using pre-defined rules.

Archiver Bot
Archiver bot is a tool that extracts “older” versions of the webpages and store them into an archive filesystem. The Archive filesystem is a highly-compressed filesystem.

Limitations of the Crawler bot

Spider traps
Crawler bot is presently unable to detect all spider traps and often indexes dummy pages. Although the non-agressive nature of the bot allows it to minimize the damage. The junk data detection is currently not fool proof and only generates warnings. The warnings are generated on the basis of traffic analysis reports of the website Vs the pages that appear in a website. Generated warnings are manually inspected by editors and trapping website are marked as trappers.

URL redirects
302 temporary redirects are currently a known issue with the crawler and is being aggressively researched. Using 302 URL redirects, the search crawler is advised by the redirecting page that a target URL is a temporary URL and real page is the redirector page. The target URL can be an third party website.

Page Parser
Page parser is a colleciton of perl scripts that read the content of page and performs several analytical operations on the content to extract and build an index for that page into a central database. The central database is a filesystem.

Readable content
Readable content is that content of a page which you see in the web browser. The page parser delete the HTML code of the page to extract human readable information from the page. This information is then analyzed as follows:

Primary keywords
Primary keywords are those keywords that have been have maximum weightage in a page. These keywrods are generated from the page content and NOT from the keywords meta attribute.

Secondary keywords
Secondary keywords are all keywords except primary keywords and supportive words (such asconjunctions, articles etc)
Tertiary keywords
Tertiary keywords are sematically linked keywords to primary keywords.

For instance David Becham is a tertiary keyword to soccer.

Spam keywords
Spam keywords are those keywords that are over-present in a readable text as compare to normal. This tracking is performed in based on the size of the readable text and the count of the keyword appearing in the content.

Links
Links are html points to another webpage. The page parser can the following types of links

  • <A> links
  • <Iframes>
  • Image maps

Apart from self pointing links all other outbound links are categorized as

  • Internal links
  • External links

Internal links

  • Internal links are counted as links within the same
  • Votes are added for each internal link
  • Max 500 votes
  • Intenal page priority votes(if defined)

External links

  • Votes are added to each external link
  • External link votes are multiplied by votes accumulated by them divided by the links that page contains

Determining important keywords for a web page

Keywords within the page
Zitku web crawler looks for important keywords of a page that can be used to “represent” the page in only a set of keywords. These keywords are extracted as follows

  • Keywords that appear within <h1>,<h2> and <h3> tags: These tags are likely to contain the most important keyphrases of a document.
  • Keywords that appear within <b> tags: These are collected as “secondary keywords” and are considered more important as compared to regular text.
  • Keywords that are within 2% to 10% of the total word count of the page: Each page is analysed in the form of a word array including the occerence count for each keyword, followed by a “weight” of the position. This array is then converted into a simple array marking a keyword as primary, secondary or spam.
  • Keywords within the URL of the page: Zitku web crawler reads the keywords present in the URL of the page to extract readable keywords and then compare them with the keyword array from the page content. The weight of the keyword that is present in the URL is same as that of the one within an <h1> tag. The web crawler indexing bot is able to detect spam if the URL has more than 15 keywords, in which case all keywords  from the URL are rejected altogether.

Keywords outside the page

Keywords contained in links to a page are probably the best way to describe a page and zitku web crawler acknowledges these keywords with the same weitage as is given to keywords in URL or the <h1> tags.

Internal links:  Link text that appears on internal link. Same importance as <h2> tags.

External links:  Link text that appears on the external links. Same importance as <h1> tags.

Traffic
Traffic is analysed using the Zitku tracker. This data is taken into account only for those website where the tracker is intalled. Website that do not have tracker installed are given an average rating on this vertical.

Unique visitors
Unique visitors are tracked using their IP address and other HTTP friendly information.

Bot filtering
The tracker is able to distinguish between human visitors and bot visitors. Bot visits and crawling either by the zitku crawler or a third party crawler is kept seperatly from the main traffic results.

Track superficial visits
Zitku tracker uses IP address, speed of access and duration of access to determine whether the visit is a human visit a bot visit or a visit performed using an automated tool to archive the website.

Regional visitors
The tracker uses geoLocation IP tracking to determine the visitors of a website. This data is collated to produce search results for other visitors from the same regions. A web page popular among users of paris has a higher weightage for other paris users as compared to users from India.

Uptime Monitor Bot(Proctor)

Uptime votes
Zitku employs an uptime monitor bot to track the uptime of websites. Website with highest uptimes are given more votes as compared to those website which disappear often.

Downtime negative votes

  • Every downtime recorded in a website is counted as a 100 negitive votes.
  • Uptime Error prevention
  • Proctor uses several known “always online” resources to establish credibility of its own network connectivity before assigning negative votes to other pages.

Zitku - the new search engine

Thursday, April 12th, 2007

Zitku is a search engine. It uses data from the world’s two largest community edited projects DMOZ and wikipedia.

DMOZ is the world’s largest human edited directory of websites. Wikipedia is the worlds largest human edited encyclopedia. Zitku combines the resources of these two data repositories to create accurate and spam free search results.

In addition to the two repositories, Zitku does have its own set of software infrastructure to support crawling, storage, compression and search of the “entire internet”. Yes you read it correctly, Zitku is designed to store multiple “copies” of the entire publicly accessible internet.

Zitku employs several bots to perform the tedious tasks of storing the internet. The crawler visits every webpage on the internet and then stores them in an ever expanding distributed filesystem.

Each stored page is parsed for keywords and readable content which is then sorted and important sections are put into a highly compressed search repository. When a user performs a search the results are produced from this repository.

The search engine currently in its alpha stages of development is inventing methods to perform large scale data processing with minimal hardware requirements.