Huffman Coding

DSA Advanced: Huffman Coding

Key Takeaways: Huffman Coding is a greedy algorithm used for lossless data compression. It reduces file sizes by assigning shorter binary codes to frequently used characters, and longer codes to rarely used characters.

Data compression is the magic that allows us to stream high-definition movies, download massive games, and send images instantly over the internet.

When we compress data, we reduce its original size. There are two main types of data compression:

Huffman Coding is one of the most foundational algorithms for lossless compression. It is actively used today as a core component in formats like ZIP, GZIP, and PNG!


1. The Problem with Standard Encoding

Normally, text is stored in computers using standard encodings like ASCII or UTF-8.

In UTF-8, a standard English letter always takes up exactly 8 bits (1 byte) of memory. If we have the word "lossless", it has 8 letters. 8 letters × 8 bits = 64 bits total.

In UTF-8, the letter s (which appears 4 times) takes 8 bits. The letter e (which appears only 1 time) also takes 8 bits.

Huffman Coding asks a brilliant question: Why should a highly common letter take up just as much space as a rare letter?


2. How Huffman Coding Works

Huffman Coding solves this by using variable-length bit codes. It assigns very short codes (like 0 or 10) to characters that appear constantly, and longer codes to characters that rarely appear.

The Step-by-Step Algorithm

  1. Count Frequencies: Count exactly how many times each character appears in the text.
  2. Build a Binary Tree:
    • Start with the two characters that have the lowest count.
    • Merge them together into a new "parent" node. The parent's count is the sum of the two children.
    • Repeat this merging process until there is only one single "Root" node left.
  3. Assign Bits: For every parent node, label the edge going to its left child as 0, and the edge going to its right child as 1.
  4. Extract the Codes: To find the code for any letter, start at the Root and follow the path down to the letter, writing down the 0s and 1s along the way.

The Huffman Tree for the word "lossless". The most frequent letter ('s') ends up at the very top with the shortest code!


3. Creating a Huffman Code Manually

Let's manually compress the word lossless using the algorithm.

Step 1: Count Frequencies

Step 2: Build the Tree

Step 3: Extract the Codes By following the left (0) and right (1) branches from the root down to the letters, we get our new dictionary:

Step 4: The Final Compression Instead of 64 bits of UTF-8, our new compressed binary string for lossless (l-o-s-s-l-e-s-s) is: 10 110 0 0 10 111 0 0 -> 10110001011100

We compressed the data from 64 bits down to just 14 bits! That is a massive storage improvement.


4. The Prefix Rule (Why it doesn't break)

Looking at the continuous binary string 10110001011100, you might wonder: How does the computer know where one letter stops and the next begins?

This works perfectly because of the Prefix Rule: In Huffman Coding, no code is ever the prefix (the exact starting sequence) of another code.

Imagine a broken dictionary:

If you receive the string 100110, how do you decode it? The first bit is 1. Is that the letter a, or is it the start of the letter b (10)? You have no idea!

Because Huffman builds a strict Binary Tree where data only lives on the leaf nodes (the very bottom), it is mathematically impossible for a code to be a prefix of another. When the computer reads a sequence, there is only one valid way to decode it.

Decoding Example

Let's decode 100110110 using a valid Huffman dictionary:

  1. Read 1. No match.
  2. Read 10. Match! -> b
  3. Read 0. Match! -> a
  4. Read 11. Match! -> n
  5. Read 0. Match! -> a
  6. Read 11. Match! -> n
  7. Read 0. Match! -> a

The decoded word is banana!


5. Python Implementation (Encoding & Decoding)

To program this, we need to create a Node class to build our tree, calculate the frequencies, and generate the binary codes.

Because we also need to decode the compressed data, we must create a reverse dictionary mapping the binary string back to the original characters.

Huffman Coding Algorithm (Python)

class Node:
    def __init__(self, char=None, freq=0):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

def build_huffman_tree(word): # 1. Count Frequencies frequencies = {} for char in word: frequencies[char] = frequencies.get(char, 0) + 1 # Create a list of single-node trees nodes = [Node(char, freq) for char, freq in frequencies.items()] # 2. Build the Tree by repeatedly merging the two smallest nodes while len(nodes) > 1: # Sort nodes by frequency (A Priority Queue/Min-Heap is better for O(N log N)) nodes.sort(key=lambda x: x.freq) left = nodes.pop(0) right = nodes.pop(0) # Create merged parent node merged = Node(freq=left.freq + right.freq) merged.left = left merged.right = right nodes.append(merged) return nodes[0] # Return the Root node

def generate_codes(node, current_code, codes): if node is None: return # If it's a leaf node, save the generated code if node.char is not None: codes[node.char] = current_code # Traverse left (add '0') and right (add '1') generate_codes(node.left, current_code + '0', codes) generate_codes(node.right, current_code + '1', codes)

def huffman_encoding(word): root = build_huffman_tree(word) codes = {} generate_codes(root, '', codes) # Build the final compressed string encoded_word = ''.join(codes[char] for char in word) return encoded_word, codes

def huffman_decoding(encoded_word, codes): current_code = '' decoded_chars = [] # Invert the codes dictionary for instant lookups code_to_char = {v: k for k, v in codes.items()} for bit in encoded_word: current_code += bit if current_code in code_to_char: decoded_chars.append(code_to_char[current_code]) current_code = '' # Reset for the next letter return ''.join(decoded_chars)

# --- Execution --- text = "lossless" compressed_string, dictionary = huffman_encoding(text) decompressed_string = huffman_decoding(compressed_string, dictionary)

print(f"Original Text: {text} ({(len(text) * 8)} bits)") print(f"Huffman Dictionary: {dictionary}") print(f"Compressed Data: {compressed_string} ({(len(compressed_string))} bits)") print(f"Decoded Text: {decompressed_string}")


6. Time and Space Complexity


Exercise 1 of 2

?

Why is Huffman Coding considered a "lossless" compression algorithm?

Exercise 2 of 2

?

What prevents the computer from getting confused when decoding a continuous stream of variable-length bits (e.g., 101100010)?