Software Compression is a technique to store digital data in a format so that least amount of space on the storage media.
Consider the following example to understand how it works.
A Personal Assistant is able to write @ the speed of speech of his/her boss by using shorthand and special codes to represent long words. Compression works exactly like that. The difference is that while the assistant uses codes to increase the writing speed, compression agent uses codes to reduce space usage.
The above technique of representing longer words into codes will be efficient only if the longer words are repeated several times in the data that needs to be compressed.
Important to note that the “longer words” means that the code should be smaller than the word, and the word should have multiple instances in the data that needs to be compressed.
The scope of this article is limited to compression on textual data. Binary data requires more complex algorithms of compression and needs a complete set of articles to discuss the topic.
Steps to create your own compression script
Step 1: Read text into a string variable
$txtOriginalString =
“May Day! May Day! Some one help us on how compression works in programming world. Will this article help us share with its pearls of wisdumb ?”;
Step 2: Collect all words from the text into an array.
Count the spaces in a text and collect all material between two ” ” space characters, through out the string.
$arrAllWords = explode(” “,$txtOriginalString);
Step 3: Ensure that the array is “unique”. Eliminate duplicate words from your array.
$arrUniqueWords = array_unique($addAllWords);
Step 4: Count the number of unique words.
You will require these many codes to replace the orignal words.
$intUniqueWordCount = count($arrUniqueWords);
Step 5: Identify the length of a “code”.
If you are using 200 ASCII characters in your code set. Lets say from ASCII 45 to 245. Then, a “single digit” code is sufficient if the unique word count is <= 200.
If the word count is > 200 and all permutations of 200P2.
if ($intUniqueWordCount > 200) { $intCodeLength = 2; }
else {$intCodeLength =1;}
Step 6: Assign a code to each unique word.
6.a) Generate a new code.
6.b) Assign it to the first unassigned unique word.
6.c) Repeat process for every unique word.
Step 7: Write the new string $CompressedString;
7.a) Write the $intCodeLength into $txtCompressedString;
$txtCompressedString = $intCodeLength;
7.b) Write a Separator to $txtCompressedString
$txtCompressedString.=”###Separator###”;
7.c) Write the original words and their codes in a CSV format to $txtCompressedString, codes go after the words.
foreach ($arrUniqueWords as $key=> $value) //generate code for each unique word.
{
$txtCode = newCode($txtCode);
$arrCodeArr[$key] = $txtCode;
$txtCompressedString.=$value.”,”; //Write words to compressed string in CSV
}
$txtCompressedString.=”###Separator###”; //Seperate Words from Codes.
foreach($arrCodeArr as $value)
{
$txtCompressedString.=$value.”,”; //Write codes to compressed string in CSV.
}
$txtCompressedString.=”###Separator###”;
Step 8 Generate $codeString
8.a) Replace all occurrences of each unique word in $txtOriginalString with their assigned codes in $txtCodeString;
8.b) Replace all space characters ” ” in $txtCodeString with a blank “”.8.c) Append $CompressedString with $txtCodeString.
$txtCompressedString .= $txtCodeString;
Thats it !
Uncompressing the file…
Step 1: Read the string.
Step 2: Explode string using “###Separator###”;
$arrData = explode(’###Separator’,$txtCompressedString);
$intCodeLength = $arrData[0];
$strCSVUniqueWords = $arrData[1];
$strCSVCodes=$arrData[2];
$strCodeString = $addData[3];
Step 3: Replace codes with a space character and the original word.
Mistakes and issues unaddressed in the above algorithm.
If you read the article carefully, you would notice the following mistakes.
1) The uncompressed file will always contain the last character as a space.
2) If the first character of the file was a ” “. It will be lost !
3) What is the maximum number for $intUniqueWordCount that this script will work ?
How to tackle them ?
This is where you come into picture. Your task will be to analyze the above article and prove your geniass by…
1) Find out more errors in the above logic.
AND / OR
2) Propose solution to issues pointed out by you or others.
Multiple comments are not a problem, we’ll track them. But for each inaccurate mistake that you point out your points will get reduced and for each geniass issue you point out your chances to feature in the “Simply Geniass - Hall of geniasses” will increase !
Send in your entries now !