Greedy Programming Revisited

Throughout this course (and before it in 186 as well!) Greedy Programming approaches have gotten a bad wrap.

It seems like, through sampling of the problems we've seen thusfar, there's always a more efficient, more optimal (oxymoron?), or otherwise better approach than the one offered by Greedy Algorithms.

It's time, then, for them to have their day!

Motivating Example

Consider each of the following boxes, and then capture an image of each separately.

(I didn't want it to seem like I was cheating by making these 2 different images -- use an image capture tool like Snipping Tool on Windows machines or the CMD+SHIFT+4 image capture on Macs to test).

Having captured an image of each box individually, which image file(s) do you think will have greater filesize(s) than the others, or will they all have the same?

A and C will have roughly the same size, B will be much larger than the others!

Why do you think this is the case?

These are two images with few unique colors!

In the most naive case of storing any arbitrary image file, we could record every pixel's $(X, Y)$ coordinate and the RGB value of that pixel in some sort of 2D array.

RGB values for colors are encoded as a spectrum in which each of the RGB color channels is some value between $[0, 255]$ for a total of $256 \times 256 \times 256$ different possible combinations.

This is why (for those familiar with web-design perhaps from CMSI 185) we often denote colors by their Hexadecimal code like: #FF0000 because 6 hex digits provide $16^6 = 256 \times 256 \times 256$ color combinations!

This naive approach might look like the following:

You may have noticed that Google even has their own tool to help you find the RGB / hex values corresponding to a desired color:

Now, imagine that *every* pixel in an image has to know its RGB value. That can take up a lot of space!

How many bytes does a single RGB value / hex code take to store?

$16 x 16 = 256 = 2^8$, meaning that every 2 hex digits create one byte (one RGB value). Therefore, a hex / RGB code representing a color digits takes up 3 bytes.

So, if we represent an $N \times M$ image as a 2D array of 3-byte RGB values, we can imagine its filesize is $N \times M \times 3$. Why is this wasteful, as exemplified above?

Not all images have all $256^3$ different colors!

As such, today we'll examine more parsimonious ways of encoding data that isn't so wasteful!

Data Compression

This idea of parsimonious data encoding extends beyond images and applies to many different facets of computing.

Data Compression is the process of encoding data using fewer bits than the original representation.

[Reflect] Why is data compression important and what are some example applications?

Importance: memory and bandwidth are limited resources, so maximizing them can be vital!

Example Applications:

Image Compression: finding concise ways of storing images is important lest we burn tons of memory in wasted space!
Zip Archives: finding ways to compress groups of files in a portable, concise format is vital for transmission and in some cases, security.
Text Transmission: ever played a real-time multiplayer game? Bandwidth can only transfer so many bytes at a time, so finding concise ways of transmitting vital information can improve on lag!

To make use of any compressed data (image or otherwise) we must first decompress it into its original format, or something close to it.

Note the asterisk on the decompressed version of the file.

The decompressed version may be either an exact replica or something close to it depending on the type of compression used:

There are 2 different categories of compression algorithms based on guarantees about the decompressed file:

Lossy algorithms provide approximations of the original and are typically used for images and video in which perfect fidelity is not necessary, but speed or transmission size is of highest priority.
Lossless algorithms provide perfect recreations of the original once decompressed.

Ever seen an image on a website that looks really blurry or washed out? It's likely it was reconstructed using lossy compression -- you can still see what it looks like, but it may not be as crisp as the OG.

Lossless, on the other hand, are good for transmissions of text or bank account information, since it's pretty necessary to get these spot-on!

Approach to Data Compression

Alright, seems like there are a lot of useful applications of compression. Let's think about how to go about implementing it!

We'll start, per usual, with a bit of conceptual overview:

[Consider] Is it possible to have a Universal Compression Algorithm (UCA) that, given any data, can compress it further?

No, silly! If that were possible, then we could take any original file, compress it, then compress that compressed version etc. until we were left with 1 bit!

To visualize why this is not possible, consider that we have, as our original file, Webster's Dictionary, run through some hypothetical UCA over and over!

As such, there is no universal compression algorithm.

BUT, that doesn't mean we can't try to perform compression on at least some original source data!

Using intuition from our 3-box example from above, how do you think we can go about this?

Reduce the size of how we represent oft-repeated colors and let our representation of rarer colors take up more space!

Compression Intuition: find concise ways of representing common data components at the cost of rarer components being represented with more space.

In other words: why bother using $256^3$ bytes to store some pixel colors if an image is only using 2 colors?!

Compression in an image's case can be to re-represent the color spectrum using only those representations that are relevant for a particular image.

So, our approach is going to take the *most common* data components and assign them the *smallest* representation -- what kind of an algorithm is this?

Greedy!

Finally, we get to see a case in which greed is good! You heard it here first, folks -- the real take-away from this class!

Let's take a look at how to accomplish this in a simpler domain with text, even though the strategy for images is actually the same.

Compression Goals

Rather than start with the more complex problem of image compression, we'll see the same techniques applied to text compression!

Text Encoding

Consider the extended-ASCII encoding by which every character corresponds to 8 bits = 1 byte representation:

An encoding schema defines the rules for translating information in one format into another.

In the cases of character encodings, we are mapping character symbols in text to their bytecodes, i.e., the bits that are stored in memory and can then be translated to the letters we see on the screen.

There are also different encodings for images and pretty much any file or information you store on a computer!

To go the more analog route, Morse Code is another type of encoding in which tones / different lengthed symbols translate to letters.

ASCII encoding is a type of fixed-length code, since every character takes up the same amount of space.

For instance:

Decimal	Bytecode	Character
65	0100 0001	A
33	0010 0001	!

Suppose we are storing some text file in which some characters appear much more frequently than others; why might the above be wasteful?

Just like with our RGB pixel encodings for images with 2 colors, we might be wasting space representing every character with 8 bits when we could compress it!

Can you suggest an approach to do so?

Use a variable-length character encoding that compresses by frequency of each character!

Consider the following string: $$AAAABBBCCD$$

Using extended ASCII, it would take $10 \times 8$ bits to represent these 10 letters -- perhaps we can do better!

Observing that A appears most frequently, perhaps we can instead represent it using something short like 0, and B (which appears second-most) as something like 10!

Let's explore how to go about that...

Variable-Length & Prefix-Free Codes

Having codes of different lengths that represent different characters can be tricky.

Consider the following encoding for several letters and determine: would it be appropriate for lossless text compression?

Bytecode	Character
0	A
1	B
10	C
11	D

What is wrong with the above encoding?

In any given sequence to decode, we cannot uniquely recover the original string! For example: $$"BD" \Rightarrow 111 \Rightarrow ?$$ Above, the code 111 could either be decompressed as "BD" or "DB".

This is a problem! Any variable-length encoding that we use to compress some text will have to avoid such ambiguity if it's to be lossless, otherwise we will not be able to reconstruct the original text during decompression.

What property / guarantee should our encoding make so that the above never happens?

Ensure that no character's code is a prefix of another's!

Prefix Codes (AKA Prefix-Free / Huffman Codes) are those in which no whole code is a prefix of another code in the system.

Although this is a limiting property in the amount of compression we can do, it is also a necessary one for being lossless!

So, to recap, we want to create an encoding schema that:

Compresses some corpus of text to require fewer bits to represent it than it takes in its original format.
Performs this compression by finding a variable-length, prefix-free encoding that is influenced by the rarity of each character in the corpus.
Is lossless such that the original string can be reconstructed with perfect accuracy during decompression.

We've got our work cut out for us! Mercifully, there's a really elegant solution that does all of the above.

Huffman Coding - Compression

There's a reason those "Prefix-Free" codes are AKA "Huffman Codes" -- turns out this Huffman fellow had a lot to say about this problem.

The fact that this technique (which he invented as a student while at MIT in the 50s) is still in use today is a testament to how cool and brilliant it is.

Huffman Coding is a procedure for finding a prefix-free, lossless, variable-length compression code for some data (image, text, or otherwise) by encoding the most common data components with the most concise representation.

The steps of Huffman Coding Compression are depicted as follows (we'll worry about decompression later):

Let's do the above on one sufficiently-sized example:

Find the Huffman Encoding of the following string: $$ACADACBABE$$

Step 1: Find Character Frequencies

Since we want to use the most concise encoding for the most common characters in the string, the first step is to find the distribution over each character in the corpus.

This is a simple task:

Perform a linear pass over the string, tallying each instance of each character.
Divide the final counts by the total string length to find their distribution.

So, in our example String $ACADACBABE$, we have $N=10$ characters over which:

Character	Count	Probability
A	4	0.40
B	2	0.20
C	2	0.20
D	1	0.10
E	1	0.10

Easy part's done!

Quick sanity check: which character do we want to have the smallest code? (i.e., represented with fewest bits?)

A, because it appears most frequently.

Step 2: Find Encoding Map

With this distribution in hand, we now need to find a mapping of each character to the bit-code used to represent it.

This part is a little unintuitive, but we can at least motivate the tools we're going to employ:

[Observation 1] We need something to manage the order in which we encode each character; what tool might be useful here?

A priority queue in which priority is the character's probability!

[Observation 2] We need some way of finding a parsimonious code for each character in some sort of hierarchy of that character's prevalence; what structure might be useful here?

A tree!

Let's see how these observations combine in what is actually a deceptively simple algorithm.

The Huffman Encoding Map is generated as a binary trie (another word for a Prefix Tree, AKA a Huffman Trie) in which:

Nodes are binary (at most 2 children) and contain a priority equal to the proportion of characters in their subtrees.
Leaves are allocated to each character being encoded.
Edges represent bits in each character's encoding: left = 0, right = 1 (by convention).

Construction of the Huffman Trie proceeds as follows:

  for each character to encode:
      create leaf node and add to priority queue
  while more than 1 node in queue:
      remove 2 smallest probability nodes from queue
      create new parent node of these 2 removed with sum of their probabilities
      enqueue new parent
  remaining node is the root

Construct a Huffman Trie associated with the corpus: $$ACADACBABE$$

(when you're doing these by hand, it's convenient to start with the leaves at the top since you're essentially building from the leaves up and don't know how tall the tree will get).

(thus why I'm drawing it "upside down," even though [let's be honest] this kinda looks more like an actual tree with the leaves at the top).

Some notes on the above:

Each leaf node has a character and priority / probability associated with it, but inner nodes just have the priority (no character).
The circled, orange numbers indicate the order in which parents were generated for the children from the priority queue.
The Encoding Map is generated from the Huffman Trie by tracing the path from the root to the character's corresponding leaf, which can be created simply by using a depth-first traversal.

So, let's double check our understanding:

Is the code generated using the Huffman Trie going to be optimal? i.e., will it reduce the code length of each character optimally according to its probability?

Yes! A proof by induction works here in that no subtree that occurs more frequently will never have a parent added before a subtree that occurs less frequently (where addition of a parent means greater depth in the tree and therefore more bits in the code).

Note: a victory for Greedy Algorithms!

Is the code generated using the Huffman Trie going to be prefix-free? Why or why not?

Yes! Observe that every branch leads to a different prefix, so leaves must (necessarily) share no prefix.

Is the encoding generated using the Huffman Trie unique?

It is *not* unique! In fact, any time we have a tie in the priority queue, we could have chosen another order in which to generate children!

For instance, we could have also generated the following Huffman Trie:

Does it matter which we use in terms of optimal compression?

It does not! In terms of the proportions of seeing each character, the number of bits expended will be the same in the compressed format.

Let's see why that's the case...

Step 3: Compress Corpus Using Map

With the Encoding Maps created from the previous step, finding the compressed version of the corpus is easy:

To encode a text corpus using a Huffman Encoding Map, simply replace each character with its corresponding code in the map!

Provide the compressed version of the running example corpus $ACADACBABE$ using each of the 2 Huffman Encoding Maps above.

Things to note from the above:

Both encoding maps (although different codes) lead to the same compression rate.
Compare the number of bits that would be required by Extended-ASCII to encode this corpus (10 characters x 1 byte / character = 80 bits) to the compressed bitstring associated with the Huffman Encoding: 22. Impressive savings on even a small string!

Huffman Coding - Decompression

So great! We've compressed something... how do we get it back?!

Decompressing a Huffman Coded string works in the reverse, taking a Huffman Coded bitstring and reconstructing the original.

In our case, that will mean taking a bitstring and then recovering the Extended-ASCII characters that composed the original string.

When in possession of the Huffman Trie and compressed bitstring, the decompression process is trivial:

Start at the root of the trie and the first bit in the bitstring, follow the left reference whenever a 0 is encountered in the bitstring, and the right reference when a 1 is encountered.
Add the letter corresponding to a leaf node to the output whenever the above traversal hits a leaf.
Begin again at the root for the next letter to decompress.

Consider how to decompress the bitstring we generated using Trie 1 from the previous section: $$0101~0110~0101~1000~1001~11$$

How do we guarantee that this decompression process will uniquely translate back to the original?

Because, when in possession of the Huffman Trie that generated it, we have the guarantee that it is a prefix-free encoding such that there is a unique path from the root to each character.

Is there a rather large assumption (with regards to decompression) that we're making with the above?

Yes! We're assuming that, upon decompression, we have access to the Huffman Trie that generated the code!

This may be a bit presumptuous: consider the case wherein we are compressing some corpus of text to transmit to another computer via some network:

This issue can also crop up for files that have been compressed (a la .zip files), in which the encoding map may be different from file to file.

Conferring the Encoding Map

So, how do we confer the encoding map that was used to compress the corpus to the machine decompressing it?

[Brainstorm] How would you go about ensuring that the decompression process always has access to the compression encoding map?

Include it (in some way) as part of the bitstring!

Now just how we include it as part of the bitstring will be a bit of a question mark for now, but we can consider that any bitstring we need to decode could be formatted into two parts:

A file / bitstring can be decomposed into two components: (1) the header, containing meta-data about the file and (importantly, for decompressing files) a means of recovering the encoding map, and (2) the content, which is what needs to be decoded.

Why would it be insufficient to send the characters and their frequencies in the header as a means of recovering the encoding map?

Since frequencies / Huffman Codings are not unique for a particular corpus, we may not be able to recover the exact Huffman Trie that was used to compress it!

For instance, in our two encoding maps from the previous section, 110 is the code for "C" in Map 0 but "D" in Map 1. Since both Map 0 and 1 are the results of valid Huffman Tries on the same corpus, the frequencies of each character alone will not suffice in telling us which it was that was sent: Map 0 or 1?

As such, instead, we can try to port the Huffman Trie *itself* as part of the bitstring's header!

Encoding the Huffman Trie

Disclaimer: Although a workable solution for how to encode the Huffman Trie follows, it is a simplified one meant to give you a gist for how conferring the encoding map can be done in practice.

We can adopt a simple approach to converting the Huffman Trie into the bitstring's header by performing a preorder traversal of the trie:

  encodeTrie (Node n):
      if n is a leaf:
          add 1 to header
          add decompressed character to header
      else:
          add 0 to header
          encodeTrie(n.left)
          encodeTrie(n.right)

Things to note from this strategy:

We use 0 as a flag to indicate an internal node and 1 as a flag to indicate a leaf node.
When a leaf node is encountered, by "add decompressed character to header," we mean that (if the original corpus was encoded in Extended-ASCII) we would add the bytecode for the character that the leaf represents directly following the flagged 1 bit.

This is more wordy than it is difficult; let's see how that would look!

Encode the Huffman Trie corresponding to Encoding Map 1 found in the section above as it might be transmitted in this simple header format.

If seeing the "A" and "B" in the middle of that bitstring is jarring, just note that I've saved you the work of having to lookup 0100 0001 in the Extended-ASCII table, even though that is what would actually appear in the header.

Decoding the Huffman Trie

So we've sent the full file / message consisting of both header and content -- yay!

How do we, as the recipient / machine opening the compressed file actually read the trie encoded in the header?

To reconstruct the Huffman Trie from its header encoded bitstring, we simply walk-back the preorder traversal:

Left nodes are always visited before right.
Leaf nodes are always first flagged by a 1 bit.
Internal nodes are always flagged by a 0 bit, meaning we recurse.

Trie to (yes that was on purpose) recover the Trie associated with the following header:

Fantastic! No ambiguity, the full Huffman Trie is at our fingertips, and we can continue to decode the message at our leisure!

Practice

Whew! What a whirlwind of information!

How about a few practice problems to see if you've got everything down pat?

[Practice 1] Encode the Huffman Trie associated with Encoding Map 0 from the examples above.

[Practice 2] Decode the following message! If you've been following along this far, you must be ____: $$\text{Header:}~01A01B01C01D1E~~~~\text{Content:}~1110~1111~0~1110$$