Archive for the ‘Software’ Category

Website Development

Friday, June 1st, 2007

Website development is the process of creating a website. This article discusses the non-technical aspects of Website development that should be known to customers of a website development company. A website is a collection of web pages that can be seen within a web browser. Your website is essentially your place to publish information about what you do, what you sell or what you want to “promote”.

What is a web page ?
The smallest unit or instance of internet for a internet user is called a web page. Just like the smallest instance of a book for a reader is a single page in a book. Unlike normal paperback pages, web pages have a lot more to offer.

A web page on the internet can contain text, images, videos, links, animations, resources/files or interactive components. Just like a normal book allows the reader to move from one page to another, links of web pages enable a user to move from one page to another. Links(technical term hyper links) are jump points that a user simply needs to “click” to reach another page. A link on the internet will usually contain an under line or a mouse over hover effect(hover effect means the link will flicker or change form when you fly your mouse over it.

Resources on a page may include links to download-able files , documents, software etc.

In a paperback book, you can usually read content. However, some pages in the book, may also require you to fill data, a crossword puzzle for instance. These are called interactive pages. Unlike books however, web pages are also capable of displaying results based on the data that your fill in. Such pages that “generate” data from your input are called “dynamic” pages. Dynamic pages may also read inputs from “databases”. A database is essentially a collection of data stored in a particular format. For instance, A database of books, might contain BookID, BookName, AuthorName, BookPrice, BookDescription. Information about a sinle book in a database is called a “record”. There fore a database is also a collection of records! More details on databases and their connection with website is described later in this section under the topic “database driven website”. For now, a webpage that reads information from a single record from a database and displays it on the internet browser is a dynamic page.

How does the web page know which book to display on the web page ?
The book to display is identified by a paramater which is “unique” for every record. The webpage accepts a “unique identifier” to a record and displays the information about that book. In this case, the unique identifier is the INPUT and the information about the book is the OUTPUT.

How does a web page accept input ?
An web page can accept by two means, either using “a web based form”, OR by passing the information through the URL of the page. The URL of a page is the unique address of a web page. Every page on the internet has a unique address. Just like you have a unique email ID or a unique phone number. A URL is a unique identified or a webpage. Remember that just like the BookID was the unique identifier for a book record,


What is a website ?

A website is a collection of pages of information that are linked together by “hyperlinks”. A hyper link is a clickable link, that take you from one page to another. A page on a website is called a web page which can contain text, images, videos, animations, resources or interactive components.

Static Websites
Static websites that those website where the content of the page will always remain static, which means that the page does not contain any programmatically controlled content like the one discussed in the book records example. Instead a page will change only when a website designer manually changes the content of the page.

E-commerce websites
E-commerce websites are those websites that permit financial transactions from the website itself. A shopping cart from where you may have recently bought a gift, is an e-commerce website. Your favorite website to book air tickets is a e-commerce website. An online university, that provides training to students through remote video classrooms and charges a fee using a credit card is also an e-commerce website. Your online bank account is an example of e-commerce. Essentially, a website that supports the technology to exchange money between two parties in liew of a product or service is an E-commerce website.

Exchanging money on the website is a specialized task and requires expertise and experience to prevent frauds. Some basics of e-commerce security are discussed below. If you are planning to get a e-commerce enabled website for your company, you need to ensure that your vendor pays attention to the following
Customer Account information should be stored in an encrypted format.
All money related information should travel over SSL and not plain text.
No credit card information should EVER be stored on an internet database.

A detailed explanation of guidelines of developing an e-commerce website is beyond the scope of this article, instead please refer to guidelines of developing an e-commerce website for more details.

Portal development
A web portal is a website that focuses on an agenda. The agenda can be a region, a company, an industry, a sport, an event. Another important aspect of a portal the ability to find extensive information on the primary agenda that the portal focuses on. A portal on travel industry should include information about leading travel agencies, modes of transport, business and leisure travel, destinations, travellers etc. Portals have made a lot of contribution in popularizing internet among the masses.

Website development technology

A database driven dynamic website is a must for the success of a website. Static websites are no longer possible to maintain or rely on for your business. Your website requires latest technology and technical expertise to become an effective tool for your business.
Several popular options are available to create a dynamic website, a detailed explanation of these technologies are beyond the scope of this article, but have been discussed in individual articles.

LAMP(Linux, Apache, MySQL PHP)
This is currently the number#1 most popular method of creating a website on the internet. The reason, it is cheaper, easy and quick to create a website using LAMP. However, when you want to develop larger scale applications, LAMP may not be the idle choice and you’d benefit replacing MySQL by PostGreSQL and some alternate language for development of specific utilities in other languages instead of PHP. Important to note that all components of LAMP are Open Source and FREE !!, which has also been one of the reasons for the phenomenal success of the LAMP combination.

.NET
.NET platform is a proprietary language that was launched by Microsoft as a successor to ASP technology. While .NET provides a lot of ready-made tools and components to assist developers in their work. The underlying idea of depending your whole online business on a single company does not go down very well in my personal opinion. .NET has a lot to offer to companies who are willing to risk their existence on the promoter company, but the ever growing popularity of OpenSource technologies such as PHP has created far more free support for OpenSource tools as compared to proprietary languages.

JSP and Java Struts
JSP and JavaSturts is Java’s response to development of internet websites. JSP enjoys the support of very well maintained and easy to use IDE environments. Easy availability of manpower in JSP and strusts is a key factor to their success. The syntax and

ColdFusion
ColdFusion is now the proprietary platform for development from Adobe. Earlier this platform was being promoted by macromdia(no acquired by adobe). ColfFusion enjoys support of a robust IDE from the parent company.

Perl
Perl is a time tested development language that was once the best tool to develop website. However the complex syntax of the generated code, many times makes it a very expensive code to maintain. This has been the foremost reason for the decline of the language. Perl undoubtedly remains the best language for development non-web based scripts in Linux and text parsing utilities. Wherever data parsing is required, Perl should be your first choice

Ruby on Rails
Ruby on Rails has been around for a few years, but recently has been receiving a lot of attention. The recent development IDE for Ruby has also been a contributor of its success. This platform still requires a better IDE and more free scripts on the internet to make it a significant competition for PHP.

Python
Python is a good choice for many web development projects. It gained more popularity ever since google started using python for development of its sub-projects. Python is very easy to write, very easy to debug and very cheap to maintain. Another major advantage of Python over PHP it that python comes with a compiler to protect source code and enhance security. A major deterrent for Python explosion has been the lack of an easy to use IDE for python. Python unfortunately does not have easy availability of manpower, which forces companies to switch to alternate platforms of development. In due course of time when training institutes start focusing on Python, it is likely to become much more popular choice.

What platform should you insist on ?

  • If budget is very important to your project, choose LAMP. If you expect the website to have more than 50,000 visitors / week or if you website is required to handle more than 5,00,000 records, go for PostGreSQL and PHP.
  • If you already have a current software in java, go for JSP or Struts.
  • If you want easy manpower and good commercial technical support go for .NET
  • If you know what you are doing and are confident of pulling it off, only then go for any of the other options.

Important aspects of website development, while choosing your website development vendor.

  • Security
  • Scalability
  • Search Engine Optimization
  • Portability
  • Maintenance
  • Administration

Security

Security is a key consideration of a website development project. A website development company should be chosen if they have experience in developing secure websites. The content of the website whereever should be password protected. A thumb rule here is no person outside the group of webmasters or administrators should be able to modify any content on the website without approval from the administrators. When you negotiate your deal with the website development company
Scalability

Your website needs to be scalable. Scalable means if your website is fortunate enough to reach the top in its category, can it handle the traffic of visitors ? Was the website development project planned with careful insight so that popularity of your website does not cause loss of potential business because the development company did not envisage the popularity. It is important for you to question your website development company to find out the limits of your website and the time/effort estimates required to scale the website to the next level.

Search Engine Optimization

A website needs to be accessible to search engines as well as optimized so that publicly accessible portions of the website driver more visitors to your website. Your wesite development company needs to understand this requirements and design website in a search engine friendly manner. You contract with the website service provider should clearly include the measures that will be taken by the website development company to ensure Search Engine Optimized pages.

Portability

Portability of a website refers to the dependencies of a website on other technologies. A website may not be 100% portable, but it should be portable to an alternate technology with minimum effort and downtime. The world of internet changes very dynamically, Operating Systems, development platforms become obsolete within a matter of months. Is your website development company prepared for such an event ? What measures have they taken to ensure that the product that the develop for you is not entirely dependent on a single entity or person.

Maintenance

Maintenance of a website is a critical component of the success of any website. A common mistake by customers is to pay more attention to the development of the website as compared to the maintenance phase. The inclusion of a maintenance phase is your website development contract is a must. Usually the maintenance cost is 20% of the development cost and includes technical support from the development company, change of labels etc. Maintenance usually does not include changes to functionality or additional features.

Administration Website administration in this article refers more to administration and management of 3rd party components that support your website. This primarily includes the upkeep of software installed on your hosting providers servers. Administration staff of the website require clear documentation on how to get the website up and running in case of an emergency or failure. The website development company must provide information on the tasklist for deployment of the website on a fresh server and restoration of website data. You as a customer need to know if the administrator clearly understand the tasklist and have been trained by the website development company to perform the same.

Terms of use

You may use, copy and modify the content of this article for your personal, non-profit or commercial use as long as you maintain a link to this page on your page which contains this article in the same or modified manner. The link text for this article should be Article on website development. The code to place a link should be <a href=”http://www.simply-geniass.com”> Article on website development</a>

Automatic Semantic Link Builder

Tuesday, May 8th, 2007

Semantically most useful links for your webpage and others - Automatically !

The Automatic Semantic Link Builde(ASLB) is very complex programming algorithm that is very simple to use. The script can be installed in your .net , .php or plain html pages and left to function on its own.

The script will generate a list of URLS on your page that will be visible to your visitors as well as search engine crawlers who crawl your page. So does that mean that the script will drive out traffic from your website. Yes. It will.

But copies of this script will also bring in visitors and improve the ranking of your page.

How does it work ?

ASLB front script will publish links on your website for other pages on the internet, you will be allowed to prevent certain links from showing on your website(such as your direct competition). The links to your pages will be shown onthrid party website pages which contain data relevent to your webpage’s content.

What happens at the backend ?

The main ASLB servers continuously distribute links for website and decide on which links to show where, while doing this the following rules are applied.

  1. No reciprocal links are generated.
  2. Inbound is always equal to outbound.
  3. PageRanks of involved pages play a part in the inbound:outbound ratios.
  4. Links remain where they are for a minimum of 30 days, unless removed for spamming.
  5. Only a single ASLB script can be installed in a webpage.
  6. HTML header tags play an important role.

Can I use ASLB on my website ?

No. Not yet, but soon. Bookmark this page. The Automatic Semantic Link Builder is currently in its alpha stages of development and is being tested on over 500,000 webpages. The ASLB is expected to be released in the month of March, 2008 for public installations.

Is there a manual way to do such linking ?

Yes, let a SEO company help you.

How to store passwords ?

Tuesday, May 8th, 2007

Passwords are secret keywords/keyphrases that are used to distinguish legitimate users from others.

Many years of research is involved in storing passwords. Here is a list of industries best practices on storing passwords.  Hashing !

To establish that hashing is a good way to store passwords lets take a look at the other methods and then compare them with hashing to find out their weaknesses.

The advantage of storing passwords in hashing is that even if someone is able to extract all the hashed passwords as well as the source code. It will not be easy to crack the passwords. If passwords are stored in plain text, then stealing the database alone will allow an outsider to be able to log into the system.

If the passwords are stored in an encrypted format then an outsider will require both the database as well as the source code to decrypt the passwords and log into the system.

Hashing passwords will keep the passwords secure to a large extent even if an outsider is able to access the source code as well as the database.

What is a hash?

A hash is a unique fixed length content that is created using the original password. There are three distinct properties of a hash that make it the ideal choice for storing passwords.

  1.  Hash of a value X will always be the same.
  2. The probability of many values having the same hash value is negligible.
  3. It is impossible to find the original text from the hash itself.

The above three properties make a good hashing algorithm and MD#5 is currently the industries most preferred algorithm.

Storing passwords using MD#5

At the time of creating a new user in your database, allow the user to enter a password in plan text. When you fill your database with the information convert the password into a “hash” and store the hash instead of the password.

Remember that the hash is irriversible so you cannot convert the hash back into the original password. But then how will you authenticate the user the next time he/she tries to log in ?

You will need to utilize the 1st property of a hash.

“Hash of a value X will always be the same”

Calculate the hash of the password that the user enters while trying to login and compare the newly generated hash with the stored hash to find out if the two match. If they do, you should welcome the user!

Using MD#5 in PHP to store passwords

The MD#5 of a string can be generated in PHP as easily as

$hash =md5($txtRawPassword);

At the time of user registration, store the $hash into the database. Post that whenever the user tries to log in, using password $pwd, retreive the hash from the database and compare it with the md5($pwd).

if (!strcmp($hash,md4($pwd)))
{
//welcome user!
}
else
{
//send user back to login page.
}

Disadvantage of storing passwords as hash

If the user forgets his/her password, you will not be able to find the original password. Instead, you will need to create a new password for them a mail it to them at their email ID. Isn’t this what google and yahoo does ?

How does compression work ?

Monday, May 7th, 2007

Software Compression is a technique to store digital data in a format so that least amount of space on the storage media.

Consider the following example to understand how it works.
A Personal Assistant is able to write @ the speed of speech of his/her boss by using shorthand and special codes to represent long words. Compression works exactly like that. The difference is that while the assistant uses codes to increase the writing speed, compression agent uses codes to reduce space usage.

The above technique of representing longer words into codes will be efficient only if the longer words are repeated several times in the data that needs to be compressed.

Important to note that the “longer words” means that the code should be smaller than the word, and the word should have multiple instances in the data that needs to be compressed.

The scope of this article is limited to compression on textual data. Binary data requires more complex algorithms of compression and needs a complete set of articles to discuss the topic.

Steps to create your own compression script

Step 1: Read text into a string variable

$txtOriginalString =
“May Day! May Day! Some one help us on how compression works in programming world. Will this article help us share with its pearls of wisdumb ?”;

Step 2: Collect all words from the text into an array.
Count the spaces in a text and collect all material between two ” ” space characters, through out the string.

$arrAllWords = explode(” “,$txtOriginalString);

Step 3: Ensure that the array is “unique”. Eliminate duplicate words from your array.

$arrUniqueWords = array_unique($addAllWords);

Step 4: Count the number of unique words.
You will require these many codes to replace the orignal words.

$intUniqueWordCount = count($arrUniqueWords);

Step 5: Identify the length of a “code”.
If you are using 200 ASCII characters in your code set. Lets say from ASCII 45 to 245. Then, a “single digit” code is sufficient if the unique word count is <= 200.

If the word count is > 200 and all permutations of 200P2.

if ($intUniqueWordCount > 200) { $intCodeLength = 2; }
else {$intCodeLength =1;}

Step 6: Assign a code to each unique word.
6.a) Generate a new code.
6.b) Assign it to the first unassigned unique word.
6.c) Repeat process for every unique word.

Step 7: Write the new string $CompressedString;

7.a) Write the $intCodeLength into $txtCompressedString;
$txtCompressedString = $intCodeLength;

7.b) Write a Separator to $txtCompressedString

$txtCompressedString.=”###Separator###”;

7.c) Write the original words and their codes in a CSV format to $txtCompressedString, codes go after the words.

foreach ($arrUniqueWords as $key=> $value) //generate code for each unique word.
{
$txtCode = newCode($txtCode);
$arrCodeArr[$key] = $txtCode;
$txtCompressedString.=$value.”,”; //Write words to compressed string in CSV
}
$txtCompressedString.=”###Separator###”; //Seperate Words from Codes.

foreach($arrCodeArr as $value)
{
$txtCompressedString.=$value.”,”; //Write codes to compressed string in CSV.

}

$txtCompressedString.=”###Separator###”;

Step 8 Generate $codeString

8.a) Replace all occurrences of each unique word in $txtOriginalString with their assigned codes in $txtCodeString;

8.b) Replace all space characters ” ” in $txtCodeString with a blank “”.8.c) Append $CompressedString with $txtCodeString.
$txtCompressedString .= $txtCodeString;

Thats it !

Uncompressing the file…

Step 1: Read the string.

Step 2: Explode string using “###Separator###”;

$arrData = explode(’###Separator’,$txtCompressedString);

$intCodeLength = $arrData[0];
$strCSVUniqueWords = $arrData[1];
$strCSVCodes=$arrData[2];
$strCodeString = $addData[3];

Step 3: Replace codes with a space character and the original word.

Mistakes and issues unaddressed in the above algorithm.

If you read the article carefully, you would notice the following mistakes.

1) The uncompressed file will always contain the last character as a space.

2) If the first character of the file was a ” “. It will be lost !

3) What is the maximum number for $intUniqueWordCount that this script will work ?

How to tackle them ?

This is where you come into picture. Your task will be to analyze the above article and prove your geniass by…

1) Find out more errors in the above logic.

AND / OR

2) Propose solution to issues pointed out by you or others.

Multiple comments are not a problem, we’ll track them. But for each inaccurate mistake that you point out your points will get reduced and for each geniass issue you point out your chances to feature in the “Simply Geniass - Hall of geniasses” will increase !

Send in your entries now !

Reciprocal Link Building

Friday, May 4th, 2007

Reciprocal Link Building is the process of exchanging links between your website and other websites on the internet that focus on the same subject as yours.

Why do you need to do Reciprocal Link Building ?

  • Increases visibility and reach of your website on the internet for human visitors as well as search engine crawlers.
  • Boosts rankings in search engines who believe your website is important if too many websites on the internet point to your web pages.
  • These links drive traffic to your website from those websites that place your link.

How to do Reciprocal Link Building ?

  1. Prepare a list of those keywords that you think “your website visitors” will search on search engines.
  2. Manually perform these searches on the internet and note down the list of websites that appear within the top 500 results on popular search engines.
  3. Create links of these websites within your own website on relevant pages.
  4. Write to the webmasters of the websites that you have linked to and request them to put a link about your website on their website.
    1. If they agree to put a link to your website, thank them and move on.
    2. Otherwise add a rel=”nofollow” attribute to their link. ;o)
    3.  Write to the new web master!

Pitfalls and things you need to be aware of

  1. Do not exchange links with “link farms”. Link farms are those web site where most pages contain > 100 links per page. These website usually do not contain much content of their own instead they are link repositories. Exchanging links with such websites will cause search engines to penalize your website for participating in link spamming.
  2. Concentrate only on those websites that focus on the same subject as your website. Links from unrelated websites will not do any good.
  3. Understand the difference between web pages and a website. A link exchange is done with a web page, not with a website.
  4. Keep checking with websites with whom you have exchanged links. If your link has disappeared from a website, write to the webmaster to restore it !
  5. Total reciprocal links should not be more than 5% of total outbound links from your website.

Is their an easy way out on Reciprocal Link Building ?

Yes! Let a professional handle the task for you, while you concentrate on your main business. Professional SEO services

Quick SEO guidelines

Friday, April 13th, 2007

The SEO checklist is a “do it quick” method of determining if your website is “ready” to shoot up in the search results for prominent keywords and phrases.

The checklist includes list of freely available tools to ensure that your website is free from logical and technical mistakes, that can prevent your website from achieving top ranks in search engines.

The intended audience of this article includes

  • SEO specialists
  • Website owners
  • Freelance search engine marketeers
  • Students

Checklist for individual pages

SEO using Page URL

Page URL(Universal Resource Location) is the address of a page. An internet user can access a page only through the page URL. This includes the following methods

  • By typing the URL in the web browser
  • By clicking the URL link on another page

Both of the above methods are the recognized by search engines directly of indirectly.

By typing the URL in the web browser directly

This page visit by a user might be recognized by search engines only if

  • A tool bar of the search engine is installed in the browser. Such as Google tool bar, Yahoo tool bar or Alexa tool bar
  • You use a tool that is provided by a search engine such as Google Desktop, Google web accelerator, Yahoo Messenger or Gtalk.

If any of the above tools require you to login, the search engine might use your identity to match it against the information that you have been accessing.

  • The web page itself as an embedded link to a search engine analytical tool such as Google Analytics , Yahoo Badge or Zitku tracker.

SEO on Page URL Do’s

  • Include keywords in URL domain
  • Include keywords in URL path
  • Separate multiple keywords using “-” hyphen character.
  • Phrases in paths and URLs are better than individual keywords put together.

SEO on Page URL Dont’s

  • Do not over use keywords in the URL or path.
  • Do not use too many “-” hyphens to avoid being marked as spam.

Page Title SEO guidelines

Page title is the text that appears on the header bar of the web browser. The following is a checklist of page titles.

Page title SEO Do’s

  • Page title should include keywords.
  • Keywords should NOT be over present in the title.
  • Each page of the website should have a unique title most appropriate for that page.

Page title SEO Dont’s

  • Don’t over use keywords in title. Too many keywords diminish the importance of real keywords.
  • Don’t repeat keywords in title. Keep title “readable” for humans.

Page Meta tags

Page meta is the internal code of the website that includes two important tags Keyword tag and Description tag. Even though some search engines DO NOT use these tags, many still do.

Page meta tags guidelines

  • Keyword Meta: Keyword meta is your opportunity to write important keywords of your webpage in this attribute. Important to note that adding too many keywords in this meta is likely to reduce their importance. Include only significant keywords in this attribute.
  • Description Meta: Description meta is a tag that should be added to the page to include a description of the page in less than 200 characters.
    * Important to note that the description is a description of the page and not the entire website.

Page Content SEO guidelines

Page content is the actual webpage which is shown on the web browser and is visible to the visitor. The page content is the most important section of a page and should be dealt with care.

SEO for Page Content Do’s

  • Write original text.
  • Keyword density should not be less than 5% of the total text.
  • Keywords/keyphrases should appear in <h1>, <h2> and<h3> tags.
  • Use important keywords near the top of the page(in html).
  • Try to anticipate the phrase that the search engine user is likely to search for use the phrase in the same order in your text.

SEO for Page Content Dont’s

  • Don’t use too much javascript.
  • Don’t use “large” images on the top of a page.
  • Don’t use animations.
  • Don’t use Flash for the entire content. Flash is mostly unreadable.
  • Don’t use text in the form of images.

SEO for Images on the website Do’s

  • Always use “Alt text” as well as “title” attribute to describe the photograph. “Alt text” alone is not sufficient.
  • Image file names should be descriptive (apply Page URL + Path do’s and don’ts)

SEO for Images on the website Don’ts

  • Do not use too much text or important keywords in images. They cant be read by search engines.
  • Do not use images to make text invisible such as “plain” black background with black text.

Zitku bots

Thursday, April 12th, 2007

Web crawler bots

Zitku employees several bots to keep its search operations functioning. These bots run parallel to each other and communicate with each other using a common database.

Crawler Bot
Crawler bot is a automated script that reads a database of “to-be-crawled” urls, extracts the URL and saves the page into a filesystem. The page is stored as it is without any compression. The bot obeys all rules of robots.txt, indexes web pages at a slow pace on multiple servers such that no single ISP or datacenter is overwhelmed by the crawler. The crawler is capable of reading

  • ROR files
  • Google Sitemaps and
  • Yahoo Url lists.

Data storage
The filesystem is a hierarchiachy oriented alphabetically arranged file system.
For instance www.catabatic.co.in/index.htm is stored in
\netcopy\www\c\a\t\a\b\a\t\i\c\dot\co\dot\in\index.htm

Compressor Bot
Compressor Bot reads a central database to calculate which sections of the file system require compression. Compression criteria is currently defined as minimum 5MB uncompressed size. The compression analyzes the type of content in the webpages and appropriatly selects the best compression method.

Indexer Bot
Indexer Bot is a reads a database of “ready-2b-parsed” urls to extract information from he compressed filesystem and then store them into the search database. The data from the pages is analyzed smantically and fed into the search database using pre-defined rules.

Archiver Bot
Archiver bot is a tool that extracts “older” versions of the webpages and store them into an archive filesystem. The Archive filesystem is a highly-compressed filesystem.

Limitations of the Crawler bot

Spider traps
Crawler bot is presently unable to detect all spider traps and often indexes dummy pages. Although the non-agressive nature of the bot allows it to minimize the damage. The junk data detection is currently not fool proof and only generates warnings. The warnings are generated on the basis of traffic analysis reports of the website Vs the pages that appear in a website. Generated warnings are manually inspected by editors and trapping website are marked as trappers.

URL redirects
302 temporary redirects are currently a known issue with the crawler and is being aggressively researched. Using 302 URL redirects, the search crawler is advised by the redirecting page that a target URL is a temporary URL and real page is the redirector page. The target URL can be an third party website.

Page Parser
Page parser is a colleciton of perl scripts that read the content of page and performs several analytical operations on the content to extract and build an index for that page into a central database. The central database is a filesystem.

Readable content
Readable content is that content of a page which you see in the web browser. The page parser delete the HTML code of the page to extract human readable information from the page. This information is then analyzed as follows:

Primary keywords
Primary keywords are those keywords that have been have maximum weightage in a page. These keywrods are generated from the page content and NOT from the keywords meta attribute.

Secondary keywords
Secondary keywords are all keywords except primary keywords and supportive words (such asconjunctions, articles etc)
Tertiary keywords
Tertiary keywords are sematically linked keywords to primary keywords.

For instance David Becham is a tertiary keyword to soccer.

Spam keywords
Spam keywords are those keywords that are over-present in a readable text as compare to normal. This tracking is performed in based on the size of the readable text and the count of the keyword appearing in the content.

Links
Links are html points to another webpage. The page parser can the following types of links

  • <A> links
  • <Iframes>
  • Image maps

Apart from self pointing links all other outbound links are categorized as

  • Internal links
  • External links

Internal links

  • Internal links are counted as links within the same
  • Votes are added for each internal link
  • Max 500 votes
  • Intenal page priority votes(if defined)

External links

  • Votes are added to each external link
  • External link votes are multiplied by votes accumulated by them divided by the links that page contains

Determining important keywords for a web page

Keywords within the page
Zitku web crawler looks for important keywords of a page that can be used to “represent” the page in only a set of keywords. These keywords are extracted as follows

  • Keywords that appear within <h1>,<h2> and <h3> tags: These tags are likely to contain the most important keyphrases of a document.
  • Keywords that appear within <b> tags: These are collected as “secondary keywords” and are considered more important as compared to regular text.
  • Keywords that are within 2% to 10% of the total word count of the page: Each page is analysed in the form of a word array including the occerence count for each keyword, followed by a “weight” of the position. This array is then converted into a simple array marking a keyword as primary, secondary or spam.
  • Keywords within the URL of the page: Zitku web crawler reads the keywords present in the URL of the page to extract readable keywords and then compare them with the keyword array from the page content. The weight of the keyword that is present in the URL is same as that of the one within an <h1> tag. The web crawler indexing bot is able to detect spam if the URL has more than 15 keywords, in which case all keywords  from the URL are rejected altogether.

Keywords outside the page

Keywords contained in links to a page are probably the best way to describe a page and zitku web crawler acknowledges these keywords with the same weitage as is given to keywords in URL or the <h1> tags.

Internal links:  Link text that appears on internal link. Same importance as <h2> tags.

External links:  Link text that appears on the external links. Same importance as <h1> tags.

Traffic
Traffic is analysed using the Zitku tracker. This data is taken into account only for those website where the tracker is intalled. Website that do not have tracker installed are given an average rating on this vertical.

Unique visitors
Unique visitors are tracked using their IP address and other HTTP friendly information.

Bot filtering
The tracker is able to distinguish between human visitors and bot visitors. Bot visits and crawling either by the zitku crawler or a third party crawler is kept seperatly from the main traffic results.

Track superficial visits
Zitku tracker uses IP address, speed of access and duration of access to determine whether the visit is a human visit a bot visit or a visit performed using an automated tool to archive the website.

Regional visitors
The tracker uses geoLocation IP tracking to determine the visitors of a website. This data is collated to produce search results for other visitors from the same regions. A web page popular among users of paris has a higher weightage for other paris users as compared to users from India.

Uptime Monitor Bot(Proctor)

Uptime votes
Zitku employs an uptime monitor bot to track the uptime of websites. Website with highest uptimes are given more votes as compared to those website which disappear often.

Downtime negative votes

  • Every downtime recorded in a website is counted as a 100 negitive votes.
  • Uptime Error prevention
  • Proctor uses several known “always online” resources to establish credibility of its own network connectivity before assigning negative votes to other pages.

Zitku - the new search engine

Thursday, April 12th, 2007

Zitku is a search engine. It uses data from the world’s two largest community edited projects DMOZ and wikipedia.

DMOZ is the world’s largest human edited directory of websites. Wikipedia is the worlds largest human edited encyclopedia. Zitku combines the resources of these two data repositories to create accurate and spam free search results.

In addition to the two repositories, Zitku does have its own set of software infrastructure to support crawling, storage, compression and search of the “entire internet”. Yes you read it correctly, Zitku is designed to store multiple “copies” of the entire publicly accessible internet.

Zitku employs several bots to perform the tedious tasks of storing the internet. The crawler visits every webpage on the internet and then stores them in an ever expanding distributed filesystem.

Each stored page is parsed for keywords and readable content which is then sorted and important sections are put into a highly compressed search repository. When a user performs a search the results are produced from this repository.

The search engine currently in its alpha stages of development is inventing methods to perform large scale data processing with minimal hardware requirements.

Using MySQL Full text search in PHP

Wednesday, March 21st, 2007

Full text search is the ability to perform text based search in MySQL database without matching exact text.

Exact text search such as “Starts with”, “Ends with” or “contains” can be performed using the LIKE operator in mysql. However this kind of a search requires the searcher to know the exact sequence of characters or words while performing the search.

MySQL full text search is an automatic quick and easy way to perform a search when the searcher may or may not type the words/characters in the same order as in the database.

Quick information

  • Full text search can be performed on fields that are of “varchar” or “text” type.
  • Full text search requires a “full text index” to be present in the database.
  • A Full text index
    • A full text index can be created in a single command as
      • CREATE FULLTEXT INDEX <index_name> ON <table_name>(<column_name1>,<column_name2>)
  • Full text index creation example
    • CREATE FULLTEXT INDEX hotels ON hotel(name,description,review)
  • Full text search query
    • Example: Perform a full text search on hotel name
    • Query :SELECT * FROM `hotel` WHERE MATCH(`name`) AGAINST (’leela’);
    • Example:Perform a full text search on the hotel name as well as the description
    • QUERY:SELECT * FROM `hotel` WHERE MATCH(`name`,`description`) AGAINST (’leela’);

MySQL support in determining the how relevant a search result is based on the “searched text”. Here is a quick way to harness the MySQL’s built-in relevance support

SELECT MATCH(’description’) AGAINST (’leela’) as `Relevance` FROM` hotel` WHERE MATCH(’description’) AGAINST (’leela’) as `Relevance` ORDER
BY Relevance DESC

PHP variable not getting passed from an HTML form

Wednesday, March 21st, 2007

This is a pretty common issue and can be resolved quite easily.

If a textfield in your html form is called “name” the same should be available in your PHP script as $name. If the value is not getting transfered from the HTML form then instead try to use $_REQUEST[’name’].

$_REQUEST is an assosiative array that contains all variables received by a script in GET or POST method. A security setting in PHP ini file can prevent the html form fields to become unavailable in the php script. The fix is to access the same variable from the $_REQUEST array.

An alternate way is to run the following code at the start of your php script which will recreate $name from the $_REQUEST array for you automatically.

<?php

foreach ($_REQUEST as $key = > $value)
{
$$key = $value; // create $name from $_REQUEST[’name’] and all other such variables.
}

?>