Huffman Coding

What you'll learn

Understand the main ideas in Huffman Coding.
See a working example and explanation to help you learn by doing.
Follow a clear path to the next lesson in this topic.

When to use this lesson

Use this lesson when you want to understand the key concepts behind Huffman Coding.

❮ Previous Next ❯

DSA Advanced: Huffman Coding

Key Takeaways: Huffman Coding is a greedy algorithm used for lossless data compression. It reduces file sizes by assigning shorter binary codes to frequently used characters, and longer codes to rarely used characters.

Data compression is the magic that allows us to stream high-definition movies, download massive games, and send images instantly over the internet.

When we compress data, we reduce its original size. There are two main types of data compression:

Lossy Compression: Some of the original information is permanently lost or sacrificed to make the file extremely small (e.g., MP3 audio, JPEG images, MP4 video).
Lossless Compression: The data is compressed, but all original information is perfectly preserved. When decompressed, the file is an exact bit-for-bit replica of the original (e.g., ZIP files, PNG images, Text documents).

Huffman Coding is one of the most foundational algorithms for lossless compression. It is actively used today as a core component in formats like ZIP, GZIP, and PNG!

1. The Problem with Standard Encoding

Normally, text is stored in computers using standard encodings like ASCII or UTF-8.

In UTF-8, a standard English letter always takes up exactly 8 bits (1 byte) of memory. If we have the word "lossless", it has 8 letters. 8 letters × 8 bits = 64 bits total.

In UTF-8, the letter s (which appears 4 times) takes 8 bits. The letter e (which appears only 1 time) also takes 8 bits.

Huffman Coding asks a brilliant question: Why should a highly common letter take up just as much space as a rare letter?

2. How Huffman Coding Works

Huffman Coding solves this by using variable-length bit codes. It assigns very short codes (like 0 or 10) to characters that appear constantly, and longer codes to characters that rarely appear.

The Step-by-Step Algorithm

Count Frequencies: Count exactly how many times each character appears in the text.
Build a Binary Tree:
- Start with the two characters that have the lowest count.
- Merge them together into a new "parent" node. The parent's count is the sum of the two children.
- Repeat this merging process until there is only one single "Root" node left.
Assign Bits: For every parent node, label the edge going to its left child as 0, and the edge going to its right child as 1.
Extract the Codes: To find the code for any letter, start at the Root and follow the path down to the letter, writing down the 0s and 1s along the way.

The Huffman Tree for the word "lossless". The most frequent letter ('s') ends up at the very top with the shortest code!

3. Creating a Huffman Code Manually

Let's manually compress the word lossless using the algorithm.

Step 1: Count Frequencies

s : 4
l : 2
o : 1
e : 1

Step 2: Build the Tree

Take the two lowest: o (1) and e (1). Merge them into a parent (2).
Our pool is now: s (4), l (2), and Parent_oe (2).
Take the two lowest: l (2) and Parent_oe (2). Merge them into a parent (4).
Our pool is now: s (4) and Parent_loe (4).
Merge them into the final Root node (8).

Step 3: Extract the Codes By following the left (0) and right (1) branches from the root down to the letters, we get our new dictionary:

s = 0 (Only 1 bit!)
l = 10 (2 bits)
o = 110 (3 bits)
e = 111 (3 bits)

Step 4: The Final Compression Instead of 64 bits of UTF-8, our new compressed binary string for lossless (l-o-s-s-l-e-s-s) is: 10 110 0 0 10 111 0 0 -> 10110001011100

We compressed the data from 64 bits down to just 14 bits! That is a massive storage improvement.

4. The Prefix Rule (Why it doesn't break)

Looking at the continuous binary string 10110001011100, you might wonder: How does the computer know where one letter stops and the next begins?

This works perfectly because of the Prefix Rule: In Huffman Coding, no code is ever the prefix (the exact starting sequence) of another code.

Imagine a broken dictionary:

a = 1
b = 10
n = 11

If you receive the string 100110, how do you decode it? The first bit is 1. Is that the letter a, or is it the start of the letter b (10)? You have no idea!

Because Huffman builds a strict Binary Tree where data only lives on the leaf nodes (the very bottom), it is mathematically impossible for a code to be a prefix of another. When the computer reads a sequence, there is only one valid way to decode it.

Decoding Example

Let's decode 100110110 using a valid Huffman dictionary:

a = 0
b = 10
n = 11

Read 1. No match.
Read 10. Match! -> b
Read 0. Match! -> a
Read 11. Match! -> n
Read 0. Match! -> a
Read 11. Match! -> n
Read 0. Match! -> a

The decoded word is banana!

5. Python Implementation (Encoding & Decoding)

To program this, we need to create a Node class to build our tree, calculate the frequencies, and generate the binary codes.

Because we also need to decode the compressed data, we must create a reverse dictionary mapping the binary string back to the original characters.

Huffman Coding Algorithm (Python)

class Node:
    def __init__(self, char=None, freq=0):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

def build_huffman_tree(word):
    # 1. Count Frequencies
    frequencies = {}
    for char in word:
        frequencies[char] = frequencies.get(char, 0) + 1
    # Create a list of single-node trees
    nodes = [Node(char, freq) for char, freq in frequencies.items()]
    # 2. Build the Tree by repeatedly merging the two smallest nodes
    while len(nodes) > 1:
        # Sort nodes by frequency (A Priority Queue/Min-Heap is better for O(N log N))
        nodes.sort(key=lambda x: x.freq)
        left = nodes.pop(0)
        right = nodes.pop(0)
        # Create merged parent node
        merged = Node(freq=left.freq + right.freq)
        merged.left = left
        merged.right = right
        nodes.append(merged)
    return nodes[0]  # Return the Root node
def generate_codes(node, current_code, codes):
    if node is None:
        return
    # If it's a leaf node, save the generated code
    if node.char is not None:
        codes[node.char] = current_code
    # Traverse left (add '0') and right (add '1')
    generate_codes(node.left, current_code + '0', codes)
    generate_codes(node.right, current_code + '1', codes)
def huffman_encoding(word):
    root = build_huffman_tree(word)
    codes = {}
    generate_codes(root, '', codes)
    # Build the final compressed string
    encoded_word = ''.join(codes[char] for char in word)
    return encoded_word, codes
def huffman_decoding(encoded_word, codes):
    current_code = ''
    decoded_chars = []
    # Invert the codes dictionary for instant lookups
    code_to_char = {v: k for k, v in codes.items()}
    for bit in encoded_word:
        current_code += bit
        if current_code in code_to_char:
            decoded_chars.append(code_to_char[current_code])
            current_code = ''  # Reset for the next letter
    return ''.join(decoded_chars)
# --- Execution ---
text = "lossless"
compressed_string, dictionary = huffman_encoding(text)
decompressed_string = huffman_decoding(compressed_string, dictionary)
print(f"Original Text: {text} ({(len(text) * 8)} bits)")
print(f"Huffman Dictionary: {dictionary}")
print(f"Compressed Data: {compressed_string} ({(len(compressed_string))} bits)")
print(f"Decoded Text: {decompressed_string}")

class Node:
    def __init__(self, char=None, freq=0):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

def build_huffman_tree(word):
    frequencies = {}
    for char in word:
        frequencies[char] = frequencies.get(char, 0) + 1
        
    nodes = [Node(char, freq) for char, freq in frequencies.items()]

    while len(nodes) > 1:
        nodes.sort(key=lambda x: x.freq)
        left = nodes.pop(0)
        right = nodes.pop(0)
        
        merged = Node(freq=left.freq + right.freq)
        merged.left = left
        merged.right = right
        nodes.append(merged)

    return nodes[0]

def generate_codes(node, current_code, codes):
    if node is None:
        return
    if node.char is not None:
        codes[node.char] = current_code

    generate_codes(node.left, current_code + '0', codes)
    generate_codes(node.right, current_code + '1', codes)

def huffman_encoding(word):
    root = build_huffman_tree(word)
    codes = {}
    generate_codes(root, '', codes)
    
    encoded_word = ''.join(codes[char] for char in word)
    return encoded_word, codes

def huffman_decoding(encoded_word, codes):
    current_code = ''
    decoded_chars = []
    code_to_char = {v: k for k, v in codes.items()}

    for bit in encoded_word:
        current_code += bit
        if current_code in code_to_char:
            decoded_chars.append(code_to_char[current_code])
            current_code = ''

    return ''.join(decoded_chars)

text = "lossless"
compressed_string, dictionary = huffman_encoding(text)
decompressed_string = huffman_decoding(compressed_string, dictionary)

print(f"Original Text: {text} ({(len(text) * 8)} bits)")
print(f"Huffman Dictionary: {dictionary}")
print(f"Compressed Data: {compressed_string} ({(len(compressed_string))} bits)")
print(f"Decoded Text: {decompressed_string}")

6. Time and Space Complexity

Time Complexity: O(N log N). Finding the frequencies takes O(N) time. However, repeatedly extracting the smallest nodes and adding them back to the list is highly dependent on sorting. If implemented perfectly using a Priority Queue (Min-Heap), building the tree takes O(N log N) time.
Space Complexity: O(N). We must store the unique characters in a Hash Map and generate a Binary Tree with up to 2N - 1 nodes.

Exercise 1 of 2

Why is Huffman Coding considered a "lossless" compression algorithm?

Because it randomly deletes vowels from the text to save space. Because it is highly efficient for compressing 4K video files by skipping duplicate frames. Because the decompressed data perfectly retains 100% of its original information without any degradation.

Exercise 2 of 2

What prevents the computer from getting confused when decoding a continuous stream of variable-length bits (e.g., 101100010)?

The algorithm inserts blank space characters between every bit. The Prefix Rule ensures that no single letter's binary code forms the beginning sequence of another letter's code. It relies on ASCII encodings to break up the sequence.

❮ Previous Next ❯

DSA Basics

Arrays

Linked Lists

Stacks & Queues

Hash Tables

Trees

Graphs

Shortestpath

Minimum spanning tree

DSA Advanced