Data compression -- Huffman tree and Huffman compression

The idea of Huffman compression: use fewer bits to represent frequent characters and use more bits to represent fewer characters. This reduces the total number of bits used to represent the string.

Premise: all character encodings will not be prefixes of other character encodings. Using Huffman tree can guarantee the premise.

Construct the Huffman tree:

First, the node class of Huffman tree is defined

private static class Node implements Comparable<Node> {
        private final char ch;
        private final int freq;
        private final Node left, right;

        Node(char ch, int freq, Node left, Node right) {
            this.ch    = ch;
            this.freq  = freq;
            this.left  = left;
            this.right = right;
        }

        private boolean isLeaf() {
            return (left == null) && (right == null);
        }

        public int compareTo(Node that) {
            return this.freq - that.freq;
        }
    }

Then construct the Huffman tree:

Huffman tree is a two round algorithm, it needs to scan the target string twice to compress it. The first scan counts the frequency of each character, and the second scan compresses based on the generated compiled table.

The construction process is as follows: create an independent node for each character (it can be regarded as a tree with only one node). First, find two nodes with the lowest frequency, and then create a new node with these two nodes as children (the frequency value of the new node is the sum of the frequency values of the two children); this operation will reduce the number of trees in the forest by one. Repeat the process until there is only one tree left. In this way, the path from the root node of the tree to the leaf node is the Huffman code corresponding to the characters in the leaf node.

 private static Node buildTrie(int[] freq) {
        MinPQ<Node> pq = new MinPQ<Node>();
        for (char i = 0; i < R; i++)
            if (freq[i] > 0)
                pq.insert(new Node(i, freq[i], null, null));

        while (pq.size() > 1) {
            Node left  = pq.delMin();
            Node right = pq.delMin();
            Node parent = new Node('\0', left.freq + right.freq, left, right);
            pq.insert(parent);
        }
        return pq.delMin();
    }

Decoding operation:

According to the Huffman tree, the bit stream is decoded: according to the input of the bit stream, it moves downward from the root node (when it encounters 0 left child node, when it encounters 1 right child node), when it encounters the leaf node, it outputs the characters of the leaf node and returns to the root node.

 public static void expand() {
        Node root = readTrie(); 
        int length = BinaryStdIn.readInt();

        for (int i = 0; i < length; i++) {
            Node x = root;
            while (!x.isLeaf()) {
                boolean bit = BinaryStdIn.readBoolean();
                if (bit) x = x.right;
                else     x = x.left;
            }
            BinaryStdOut.write(x.ch);
        }
        BinaryStdOut.close();
    }

Compression operation:

The compression operation is implemented according to the constructed compiled table. According to the Hoffman tree, a table of binary strings corresponding to characters and paths is established, and then the target strings are scanned. Each time a character is read in, the corresponding binary strings are obtained by looking up the table and output.

To build a compiled table:

 private static void buildCode(String[] st, Node x, String s) {
        if (!x.isLeaf()) {
            buildCode(st, x.left,  s + '0');
            buildCode(st, x.right, s + '1');
        }
        else {
            st[x.ch] = s;
        }
    }

Use compiled tables for compression:

for (int i = 0; i < input.length; i++) {
        String code = st[input[i]];
        for (int j = 0; j < code.length(); j++) {
            if (code.charAt(j) == '0')
                BinaryStdOut.write(false);
            else (code.charAt(j) == '1') 
                BinaryStdOut.write(true);
        }
    }

For any prefix code, the length of the encoded bit string is equal to the weighted external path length of the Huffman tree.

Given a set of r symbols and their frequencies, the prefix codes constructed by Huffman algorithm are optimal.

Posted on Sun, 03 May 2020 20:27:15 -0700 by Sterculius