Assignment 3a: Huffman Encoding

$24.99 $18.99

Description For this project, you will implement a program to encode text using [Huffman coding](https://en.wikipedia.org/wiki/Huffman_coding). Huffman coding is a method of lossless (the original can be reconstructed perfectly) data compression. Each character is assigned a variable length code consisting of ’0’s and ’1’s. The length of the code is based on how frequently the character…

5/5 – (2 votes)

You’ll get a: zip file solution

 

Categorys:

Description

5/5 – (2 votes)

Description

For this project, you will implement a program to encode text using

[Huffman coding](https://en.wikipedia.org/wiki/Huffman_coding).

Huffman coding is a method of lossless (the original

can be reconstructed perfectly) data compression. Each character is

assigned a variable length code consisting of ’0’s and ’1’s. The length

of the code is based on how frequently the character occurs; more

frequent characters are assigned shorter codes.

The basic idea is to use a binary tree, where each leaf node represents

a character and frequency count. The Huffman code of a character is then

determined by the unique path from the root of the binary tree to that

leaf node, where each ’left’ accounts for a ’0’ and a ’right’ accounts

for a ’1’ as a bit. Since the number of bits needed to encode a

character is the path length from the root to the character, a character

occurring frequently should have a shorter path from the root to their

node than an “infrequent” character, i.e. nodes representing frequent

characters are positioned less deep in the tree than nodes for

infrequent characters.

The key problem is therefore to construct such a binary tree based on

the frequency of occurrence of characters. A detailed example of how to

construct such a Huffman tree is provided here: [Huffman_Example.pdf](./Huffman_Example.pdf)

Note:

* You must provide test cases for all functions (Only test the provided

functions, with the exact name provided)

* Use descriptive names for data structures and helper functions.

Source Files

Start with the source files provided to you as part of GitHub Classroom,

or the ones in the [src directory](./src/) of this repository.

Functions

The following bullet points provide a guide to implement some of the

data structures and individual functions of your program. Start by

creating a file huffman.py and add functions and data definitions to

this file, if not otherwise stated. You should develop incrementally,

building test cases for each function as it is implemented.

Count Occurrences: cnt_freq(filename)

* Implement a function called cnt_freq(filename) that opens a text file

with a given file name (passed as a string) and counts the frequency

of occurrences of all the characters within that file. Use the

built-in Python List data structure of size 256 for counting the

occurrences of characters. This will provide efficient access to a

given position in the list. (In non-Python terminology you want an

array.) You can assume that in the input text file there are only

8-bit characters resulting in a total of 256 possible character

values. This function should return the 256 item list with the counts

of occurrences. There can be issues with extra characters in some

systems.

* Suppose the file to be encoded contained: ddddddddddddddddccccccccbbbbaaff

* Numbers in positions of freq counts [96:104] = [0, 2, 4, 8, 16, 0, 2, 0]

* For an empty file, this function should return a list of size 256 filled with all 0s.

Data Definition for Huffman Tree

A Huffman Tree is a binary tree of HuffmanNodes.

* A HuffmanNode class represents either a leaf or an internal node

(including the root node) of a Huffman tree. A HuffmanNode contains a

character (stored as an ASCII value) an occurrence count for that

character, as well as references to left and right Huffman subtrees,

each of which is a binary tree of HuffmanNodes. The character value

and occurrence count of an internal node are assigned as described

below. You may add fields in your HuffmanNode definition if you feel

it is necessary for your implementation. Do not change the names of

the fields specified in the huffman.py starter file given to you via

GitHub.

Build a Huffman Tree

Since the code depends on the order of the left and right branches take

in the path from the root to the leaf (character node), it is crucial to

follow a specific convention about how the tree is constructed. To do

this we need an ordering on the Huffman nodes.

* Start by defining the __lt__ (self, other) method for HuffmanNode

objects that returns true if self should come before other when added

to an OrderedList. A HuffmanNode ‘a’ should come before HuffmanNode

b’ if the occurrence count of ‘a’ is smaller than that of ‘b’. In

case of equal occurrence counts, break the tie by using the ASCII

value of the character to determine the order. If, for example, the

characters ’d’ and ’k’ appear exactly the same number of times in your

file, then ’d’ comes before ’k’, since the ASCII value of character

d’ is less than that of ’k’.

* Write a function that builds a Huffman tree from a given list of the

number of occurrences of characters returned by cnt_fref() and returns

the root node of that tree. Call this function

create_huff_tree(list_of_freqs).

* Start by creating an OrderedList (your implementation from Lab 4) of

individual Huffman trees each consisting of a single HuffmanNode

containing the character and its occurrence counts. Building the

actual tree involves removing the two nodes with the lowest

frequency count from the sorted list and connecting them to the left

and right field of a new created Huffman Node as in the example

provided. The node that comes before the other node should go in

the left field.

* Note that when connecting two HuffmanNodes to the left and right

field of a new parent node, that this new node is also a

HuffmanNode, but does not contain an actual character to

encode. Instead this new parent node should contain an occurrence

count that is the sum of the left and right child occurrence counts

as well as the minimum of the left and right character

representation in order to resolve ties in the __lt__ method.

* Once a new parent node has been created from the two nodes with the

lowest occurrence count as described above, that parent node is

inserted into the list of sorted nodes.

* This process of connecting nodes from the front of the sorted list

is continued until there is a single node left in the list, which is

the root node of the Huffman tree. create_huff_tree(list_of_freqs)

then returns this node.

* If there are no entries in the list of occurrences passed to this

function, it should return None.

* If there is only one entry in the list of occurrences passed to this

function, the tree returned by this function will consist of a

single node.

Build an Array for the Character Codes

* We have completed our Huffman tree, but we are still lacking a way to

get our Huffman codes. Implement a function named

create_code(root_node) that traverses the Huffman tree that was passed

as an argument and returns an array (using a Phython list) of 256

strings. Use the character’s respective integer ASCII representation

as the index into the arrary, with the resulting Huffman code for that

character stored at that location. Traverse the tree from the root to

each leaf node and adding a ’0’ when we go ’left’ and a ’1’ when we go

right’ constructing a string of 0’s and 1’s. You may want to:

* use the built-in ’+’ operator to concatenate strings of ’0’s and ’1’s here.

* You may want to initialize a Python list of strings that is

initially consists of 256 empty strings in create_code. When

create_code completes, this list will store for each character

(using the character’s respective integer ASCII representation as

the index into the list) the resulting Huffman code for the

character. The code will be represented by a sequence of ’0’s and

1’s in a string. Note that many entries in this list may still be

the empty string. Return this list.

2.5 Huffman Encoding

* Write a function called huffman_encode(in_file, out_file) (use that

exact name) that reads an input text file and writes to an output file

the following:

* A header (see below for format) on the first line in the file

(should end with a newline)

* Using the Huffman code, the encoded text into an output file.

* Write a function called create_header(list_of_freqs) that takes as

parameter the list of freqs previously determined from

cnt_freq(filename). The create_header function returns a string of the

ASCII values and their associated frequencies from the input file

text, separated by one space. For example,

create_header(list_of_freqs) would return “97 3 98 4 99 2” for the

text “aaabbbbcc”

* The huffman_encode function accepts two file names in that order:

input file name and output file name, represented as strings. If the

specified output file already exists, its old contents will be erased.

See example files in the test cases provided to see the format.

* Note: Writing the generated code as a string for each character into a

file will actually enlarge the file size instead compress it. The

explanation is simple: although we encoded our input text characters

in sequences of ’0’s and ’1’s, representing actual single bits, we

write them as individual ’0’ and ’1’ characters, i.e. 8 bits.

* To actually obtain a compressed the file you will also write the ’0’

and ’1’ characters as individual bits. Do this by taking the out_file

string and:

* Add _compressed to the name of the file before the .txt file. (Ex:

if out_file is named “output.txt”, you will create the string

output_compressed.txt”)

* Use the filename created above to create a HuffmanBitWriter object

using the huffman_bit_writer module.

* Use the methods in the HuffmanBitWriter object to write the header

and individual bits of the character codes to the compressed file.

Tests

* Write sufficient tests using unittest to ensure full functionality and

correctness of your program. Start by using the supplied

huffman_tests.py, and the various text files (e.g. file1.txt and

file1_soln.txt) to begin testing your programs. You may want to

initially comment out some of the tests, or ignore failures when

testing functionality that hasn’t been fully implemented.

* When testing, always consider edge conditions like the following cases:

* If the input file consists only of some number of a single

character, say “aaaaa”, it should write just that character followed

by a space followed by the number of occurrences with a newline: “97

5\n”

* In case of an empty input text file, your program should also

produce an empty file.

* If an input file does not exist, your program should raise a

FileNotFoundError.

Some Notes

When writing your own test files or using copy and paste from an editor,

take into account that most text editors will add a newline character to

the end of the file. If you end up having an additional newline

character ’\n’ in your text file, that wasn’t there before, then this

\n’ character will also be added to the Huffman tree and will therefore

appear in your generated string from the Huffman tree. In addition, an

actual newline character is treated different, depending on your

computer platform. Different operating systems have different codes for

“newline”. Windows uses ’\r\n’ = 0x0d 0x0a, while Unix and Mac use ’\n’

= 0x0a. It is always useful to use a hex editor to verify why certain

files that appear identical in a regular text editor are actually not

identical.  

Submission

You must submit (commit and push) the following files:

* huffman.py, containing the functions specified and any helper

functions necessary

* cnt_freq(filename): returns a Python list of 256 integers

representing the frequencies of characters in file (indexed by ASCII

value of characters)

* create_huff_tree(list_of_freqs): returns the root node of a Huffman

Tree, a Huffman node object

* create_code(root_node): returns a Python list of 256 strings

representing the code for each character (indexed by ASCII value of

the character). Note: Use an empty string to represent the code for

ASCII characters that do not appear in the file being compressed.

* create_header(list_of_freqs): returns a header for the output file

with ascii values and their associated counts, space separated.

* huffman_encode(in_file, out_file): encodes in_file and writes the it

to out_file

* huffman_tests.py, containing testcases for the functions specified by

the assignment

* This file should contain tests only for the functions specified

above and should only test functionality required by the

specification. These tests must run properly on any valid solution

and will be tested to see if they can catch bugs in incorrect

solutions.

* huffman_helper_tests.py, containing all testcases used in developing

your solution

* This file should contain all tests for your solution, including

tests for any helper functions that you may have used. These tests

will be used to verify that you have 100% coverage of your

solution. Your solution must pass these tests.

* ordered_list.py, your correct implementation of the OrderedList class

from Lab 4.

* huffman_bit_writer.py, unmodified, as provided to you.

Assignment 3a: Huffman Encoding
$24.99 $18.99