Huffman Compression - 霍夫曼压缩

主要思想:放弃文本文件的普通保存方式:不再使用7位或8位二进制数表示每一个字符,而是用较少的比特表示出现频率最高的字符,用较多的比特表示出现频率低的字符

使用变长编码来表示字符串,势必会导致编解码时码字的唯一性问题,因此需要一种编解码方式唯一的前缀码,而表示前缀码的一种简单方式就是使用单词查找树,其中最优前缀码即为Huffman首创。

以符号F, O, R, G, E, T为例,其出现的频次如以下表格所示。

Symbol F O R G E T
Frequence 2 3 4 4 5 7
Code 000 001 100 101 01 11

则对各符号进行霍夫曼编码的动态演示如下图所示。基本步骤是将出现频率由小到大排列,组成子树后频率相加作为整体再和其他未加入二叉树中的节点频率比较。加权路径长为节点的频率乘以树的深度。

Huffman

Python 实现

  1. """
  2. Use serveral ways to compress string `everyday is awesome!`
  3. 1. use simple bits to replace ASCII value
  4. 2. use huffman coding
  5. """
  6. import heapq
  7. import collections
  8. def get_rate(compressed_binary, uncompressed_bits):
  9. return len(compressed_binary) * 100 / uncompressed_bits
  10. class SimpleCompression:
  11. def __init__(self, string):
  12. self.symbols = set(string)
  13. self.bit_len = 1
  14. while 2**self.bit_len < len(self.symbols):
  15. self.bit_len += 1
  16. self.string = string
  17. self.s2b = {}
  18. self.b2s = {}
  19. i = 0
  20. for s in self.symbols:
  21. b = bin(i)[2:]
  22. if len(b) < self.bit_len:
  23. b = (self.bit_len - len(b)) * '0' + b
  24. self.s2b[s] = b
  25. self.b2s[b] = s
  26. i += 1
  27. def compress(self):
  28. bits = ''
  29. for s in self.string:
  30. bits += self.s2b[s]
  31. return bits
  32. def uncompress(self, bits):
  33. string = ''
  34. for i in xrange(0, len(bits), self.bit_len):
  35. string += self.b2s[bits[i:i + self.bit_len]]
  36. return string
  37. class HuffmanCompression:
  38. class Trie:
  39. def __init__(self, val, char=''):
  40. self.val = val
  41. self.char = char
  42. self.coding = ''
  43. self.left = self.right = None
  44. def __cmp__(self, other):
  45. return self.val - other.val
  46. def __init__(self, string):
  47. self.string = string
  48. counter = collections.Counter(string)
  49. heap = []
  50. for char, cnt in counter.items():
  51. heapq.heappush(heap, HuffmanCompression.Trie(cnt, char))
  52. while len(heap) != 1:
  53. left = heapq.heappop(heap)
  54. right = heapq.heappop(heap)
  55. trie = HuffmanCompression.Trie(left.val + right.val)
  56. trie.left, trie.right = left, right
  57. heapq.heappush(heap, trie)
  58. self.root = heap[0]
  59. self.s2b = {}
  60. self.bfs_encode(self.root, self.s2b)
  61. def bfs_encode(self, root, s2b):
  62. queue = collections.deque()
  63. queue.append(root)
  64. while queue:
  65. node = queue.popleft()
  66. if node.char:
  67. s2b[node.char] = node.coding
  68. continue
  69. if node.left:
  70. node.left.coding = node.coding + '0'
  71. queue.append(node.left)
  72. if node.right:
  73. node.right.coding = node.coding + '1'
  74. queue.append(node.right)
  75. def compress(self):
  76. bits = ''
  77. for char in self.string:
  78. bits += self.s2b[char]
  79. return bits
  80. def uncompress(self, bits):
  81. string = ''
  82. root = self.root
  83. for bit in bits:
  84. if bit == '0':
  85. root = root.left
  86. else:
  87. root = root.right
  88. if root.char:
  89. string += root.char
  90. root = self.root
  91. return string
  92. if __name__ == '__main__':
  93. s = 'everyday is awesome!'
  94. # ASCII
  95. bits = len(s) * 8
  96. print('Total bits: %d' % bits)
  97. # simple compression
  98. sc = SimpleCompression(s)
  99. compressed = sc.compress()
  100. print('Compressed binary: ' + compressed)
  101. print('Uncompressed: ' + sc.uncompress(compressed))
  102. print(sc.s2b)
  103. print('Simple Compression-compress rate: %d%%' % get_rate(compressed, bits))
  104. print('===================')
  105. # huffman compression
  106. hc = HuffmanCompression(s)
  107. compressed = hc.compress()
  108. print('Compressed binary: ' + compressed)
  109. print('Uncompressed: ' + hc.uncompress(compressed))
  110. print(hc.s2b)
  111. print('Huffman Compression-compress rate: %d%%' % get_rate(compressed, bits))
  112. """
  113. Total bits: 160
  114. Compressed binary: 00101011001010001100001100001100000101000111000100001010001001110110010100101001
  115. Uncompressed: everyday is awesome!
  116. {'a': '0000', ' ': '0001', 'e': '0010', 'd': '0011', 'i': '0100', 'm': '0101', 'o': '0110', 's': '0111', 'r': '1000', '!': '1001', 'w': '1010', 'v': '1011', 'y': '1100'}
  117. Simple Compression-compress rate: 50%
  118. ===================
  119. Compressed binary: 011001011011110011010111111000010000111000111111010011110100011011010001
  120. Uncompressed: everyday is awesome!
  121. {'!': '0001', ' ': '001', 'e': '01', 'd': '11010', 'i': '0000', 'm': '11011', 'o': '1000', 's': '1110', 'r': '1011', 'a': '1111', 'w': '1010', 'v': '1001', 'y': '1100'}
  122. Huffman Compression-compress rate: 45%
  123. """

源码分析

简单压缩: 根据字符串出现的字符,将ASCII替换成更短的表示形式
霍夫曼压缩: 根据字符串出现频率,构建Trie树, 对每个tree node进行定义,使得频率越高的字符离root节点越近

有关霍夫曼编码的具体步骤可参考 Huffman 编码压缩算法 | 酷 壳 - CoolShell.cn霍夫曼编码 - 维基百科,自由的百科全书,清晰易懂。