Key Takeaways: Huffman Coding is a greedy algorithm used for lossless data compression. It reduces file sizes by assigning shorter binary codes to frequently used characters, and longer codes to rarely used characters.
Data compression is the magic that allows us to stream high-definition movies, download massive games, and send images instantly over the internet.
When we compress data, we reduce its original size. There are two main types of data compression:
Huffman Coding is one of the most foundational algorithms for lossless compression. It is actively used today as a core component in formats like ZIP, GZIP, and PNG!
Normally, text is stored in computers using standard encodings like ASCII or UTF-8.
In UTF-8, a standard English letter always takes up exactly 8 bits (1 byte) of memory.
If we have the word "lossless", it has 8 letters.
8 letters × 8 bits = 64 bits total.
In UTF-8, the letter s (which appears 4 times) takes 8 bits. The letter e (which appears only 1 time) also takes 8 bits.
Huffman Coding asks a brilliant question: Why should a highly common letter take up just as much space as a rare letter?
Huffman Coding solves this by using variable-length bit codes. It assigns very short codes (like 0 or 10) to characters that appear constantly, and longer codes to characters that rarely appear.
0, and the edge going to its right child as 1.0s and 1s along the way.The Huffman Tree for the word "lossless". The most frequent letter ('s') ends up at the very top with the shortest code!
Let's manually compress the word lossless using the algorithm.
Step 1: Count Frequencies
s : 4l : 2o : 1e : 1Step 2: Build the Tree
o (1) and e (1). Merge them into a parent (2).s (4), l (2), and Parent_oe (2).l (2) and Parent_oe (2). Merge them into a parent (4).s (4) and Parent_loe (4).(8).Step 3: Extract the Codes
By following the left (0) and right (1) branches from the root down to the letters, we get our new dictionary:
s = 0 (Only 1 bit!)l = 10 (2 bits)o = 110 (3 bits)e = 111 (3 bits)Step 4: The Final Compression
Instead of 64 bits of UTF-8, our new compressed binary string for lossless (l-o-s-s-l-e-s-s) is:
10 110 0 0 10 111 0 0 -> 10110001011100
We compressed the data from 64 bits down to just 14 bits! That is a massive storage improvement.
Looking at the continuous binary string 10110001011100, you might wonder: How does the computer know where one letter stops and the next begins?
This works perfectly because of the Prefix Rule: In Huffman Coding, no code is ever the prefix (the exact starting sequence) of another code.
Imagine a broken dictionary:
a = 1b = 10n = 11If you receive the string 100110, how do you decode it? The first bit is 1. Is that the letter a, or is it the start of the letter b (10)? You have no idea!
Because Huffman builds a strict Binary Tree where data only lives on the leaf nodes (the very bottom), it is mathematically impossible for a code to be a prefix of another. When the computer reads a sequence, there is only one valid way to decode it.
Let's decode 100110110 using a valid Huffman dictionary:
a = 0b = 10n = 111. No match.10. Match! -> b0. Match! -> a11. Match! -> n0. Match! -> a11. Match! -> n0. Match! -> aThe decoded word is banana!
To program this, we need to create a Node class to build our tree, calculate the frequencies, and generate the binary codes.
Because we also need to decode the compressed data, we must create a reverse dictionary mapping the binary string back to the original characters.
class Node: def __init__(self, char=None, freq=0): self.char = char self.freq = freq self.left = None self.right = Nonedef build_huffman_tree(word): # 1. Count Frequencies frequencies = {} for char in word: frequencies[char] = frequencies.get(char, 0) + 1 # Create a list of single-node trees nodes = [Node(char, freq) for char, freq in frequencies.items()] # 2. Build the Tree by repeatedly merging the two smallest nodes while len(nodes) > 1: # Sort nodes by frequency (A Priority Queue/Min-Heap is better for O(N log N)) nodes.sort(key=lambda x: x.freq) left = nodes.pop(0) right = nodes.pop(0) # Create merged parent node merged = Node(freq=left.freq + right.freq) merged.left = left merged.right = right nodes.append(merged) return nodes[0] # Return the Root node
def generate_codes(node, current_code, codes): if node is None: return # If it's a leaf node, save the generated code if node.char is not None: codes[node.char] = current_code # Traverse left (add '0') and right (add '1') generate_codes(node.left, current_code + '0', codes) generate_codes(node.right, current_code + '1', codes)
def huffman_encoding(word): root = build_huffman_tree(word) codes = {} generate_codes(root, '', codes) # Build the final compressed string encoded_word = ''.join(codes[char] for char in word) return encoded_word, codes
def huffman_decoding(encoded_word, codes): current_code = '' decoded_chars = [] # Invert the codes dictionary for instant lookups code_to_char = {v: k for k, v in codes.items()} for bit in encoded_word: current_code += bit if current_code in code_to_char: decoded_chars.append(code_to_char[current_code]) current_code = '' # Reset for the next letter return ''.join(decoded_chars)
# --- Execution --- text = "lossless" compressed_string, dictionary = huffman_encoding(text) decompressed_string = huffman_decoding(compressed_string, dictionary)
print(f"Original Text: {text} ({(len(text) * 8)} bits)") print(f"Huffman Dictionary: {dictionary}") print(f"Compressed Data: {compressed_string} ({(len(compressed_string))} bits)") print(f"Decoded Text: {decompressed_string}")
O(N log N). Finding the frequencies takes O(N) time. However, repeatedly extracting the smallest nodes and adding them back to the list is highly dependent on sorting. If implemented perfectly using a Priority Queue (Min-Heap), building the tree takes O(N log N) time.O(N). We must store the unique characters in a Hash Map and generate a Binary Tree with up to 2N - 1 nodes.Why is Huffman Coding considered a "lossless" compression algorithm?
What prevents the computer from getting confused when decoding a continuous stream of variable-length bits (e.g., 101100010)?