Large-scale compression

As you may (or may not) know, text is usually stored compressed. The most efficient way to compress text at the per-character level is Huffman coding, which assigns shorter encodings to the characters that are used most often. There are ways to compress text further, but these all involve looking at larger patterns in the document. The strategy we’ll be using is shortening the representations of the most common groups of characters in the file.

Puzzle

Write two programs: An encoder, which takes an uncompressed text file and produces a file that has the compressed text and the information needed to decode it. A decoder, which takes the compressed file generated by the encoder and produces a replica of the original, uncompressed file. Here is an example text file where part of each line is repeated. You can use this as input for your program.

Hints and Walk-through

Encoder

Identify the TARGET

The first thing your encoder needs is a way to identify what common string is going to be shortened (your TARGET). Depending on how hard you want the puzzle to be, you can:

hard-code a value
write a method to find a string that is repeated between lines (this will work for our example file)
write a method to find the most-repeated string of length > 3 in the file (this may be difficult) I recommend the second option. For our sample file, the repeated string in each line is “.google.com”

Replace the TARGET

Next, your encoder needs a function to find all instances of the TARGET and replace them with a unicode character. Most languages have a “find” method that you can use here. For this example, we’ll use the “Δ” symbol. After we’ve replaced the target, the first line of our sample file will look like this: accountsΔ

Record what you did

In order for the decoder to be able to reverse the process, you need to include a key for what you did to compress the file. A simple way to do this would be to write “Δ=.google.com” on the first line of your file, but you can do it however you see fit.

By using the process described here, I was able to roughly halve the size of the example file.

Decoder

Read the key

The decoder needs to read the key you wrote in the previous step in order to find out what the placeholder and the TARGET are.

Replace the placeholder with the TARGET

Very similar to the step in the encoder

Remove the key from the file

once you’ve done all of that, you should have a replica of the original file.