Hmmm..
sai_cool - Basic text compression works on the morse-code principles. An E in a morse code is just one character (Reason is because E is used most often of all characters). While a Q has a higher length.
E = .
Q = --.-
Now you can see why this is acceptable - E is used a lot and needs to be transmitted quick and Q hardly appears generally and is not that important so its given a longer length representation.
Similarly text compression can be done in 3 steps:
1. Get the input. Now check the input for all its characters and find the probability of each. Like if the length is 4 ("
ABCA") then A's probability of occurance is 2/4, B's and C's is 1/4 each.
2. Assign unique code words for each of these characters. Like for example am gonna assign a binary bit 0 for A and 10 for B and 11 for C. You can see that when I thus make the string in binary compressed form as
0.10.11.0 the total length reduces to 6 bits (This binary format 010110) from 4 bytes (Each character is a byte) of ABCA.
(Neglect the dots, they are for better visualizing)
3. To decode this format you have to make the decoder know what symbol is assigned to what. And once your decoder knows that its easy for you to decode.
What I've just demonstrated above is the Variable Length (Size of the unique code word for each character changes) method. This is the most basic style of text compression. There are 2-3 algorithms to assist you in generating the code word table, they are
Huffman Coding method and
Shannon-Fano coding method.
Explaining them to you is beyond the scope of this post since they are relatively simple methods you can check in the links provided and ask certain large doubts if you've got them.
(Assuming you know how trees work, else learn basic data structures first and then get into such programming levels)
Tip: Shannon Fano coding is fairly easier than Huffman coding.