Huffman compressor

Pages: 12
Hello!
I have made a huffman compressor and before moving on to making the decompressor i wanted to ask a few questions about the design:
-i broke the compressor into functions was it a good decision ?
-if i were to get rid of the private functions and just put the code inside of the compress function it would allow me to declare the member variables locally inside of the function instead of declaring them in the private area should i consider doing so even tough i don't have a memory limit? will it impact performance if i decide to multithread ?
-are there parts of code that can be changed to significantly increase the performance?
-why does the code only allow me to compress .txt files ?

NOTE: I am programming as a hobby and i know the code is far from perfect.
-A destructor will be added soon once i figure out how i want to implement the decompressor
-the code works as i intended but only with .txt files other formats seem to not be supported
-will be separated into .h and .cpp files once finished

the code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
#include<iostream>
#include<fstream>
#include<map>
#include<list>
#include<vector>
#include<string>

struct node {
	unsigned char m_data{};
	uint64_t m_frequency{};
	node* m_left;
	node* m_right;

	node() : m_left{ nullptr }, m_right{ nullptr } {};
	node(node* left, node* right) :m_left{ left }, m_right{ right } {
		m_frequency = m_left->m_frequency + m_right->m_frequency;
	};
};

struct functor {
	bool operator()(const node* one, const node* two) const { return one->m_frequency < two->m_frequency; }
};

class Huffman {
private:
	node* m_root{ nullptr };//use in destructor to free the memory
	std::map<unsigned char, uint64_t> m_table;
	std::list<node*> m_list;
	std::ifstream m_inFile; 
	std::vector<bool> m_code;
	std::map<unsigned char, std::vector<bool> > m_key;
	std::string fileName;
	std::string fileExten;

	/*compressor related functions*/
	void setNameAndExten(std::string filePath);
	void setTable();
	void storeInList();
	node* makeTree();
	void encode(node* root);
	void createBinaryFile(std::string filePath);
	void createKey(std::string filePath);
	/*compressor related functions*/

public:
	void compress(std::string filePath, std::string locToCreateKey, std::string locToCompress);
};

void Huffman::setNameAndExten(std::string filePath) {
	std::size_t foundName = filePath.find_last_of("/\\");
	std::size_t foundExten = filePath.find_last_of('.');
	fileName = filePath.substr(foundName + 1, foundExten - foundName - 1);
	fileExten = filePath.substr(foundExten);
}

void Huffman::setTable() {
	while (m_inFile.good()) {
		unsigned char data = m_inFile.get();
		if (!m_inFile.good()) break;
		m_table[data]++;
	}//end of while loop
	m_inFile.clear();
	m_inFile.seekg(0, std::ios::beg);
}

void Huffman::storeInList() {
	for (std::map<unsigned char, uint64_t>::iterator index = m_table.begin(); index != m_table.end(); index++) {
		node* leaf = new node;
		leaf->m_data = index->first;
		leaf->m_frequency = index->second;
		m_list.push_back(leaf);
	}//end of for loop
}

node* Huffman::makeTree() {
	while (m_list.size() > 1) {
		m_list.sort(functor());
		node* leftSon = m_list.front();
		m_list.pop_front();
		node* rightSon = m_list.front();
		m_list.pop_front();
		node* parent = new node(leftSon, rightSon);
		m_list.push_back(parent);
	}//end of while loop
	return m_list.front();
}

void Huffman::encode(node* root) {
	if (root->m_left != nullptr) {
		m_code.push_back(0);
		encode(root->m_left);
	}//end of if
	if (root->m_right != nullptr) {
		m_code.push_back(1);
		encode(root->m_right);
	}//end of if 
	if (root->m_data) m_key[root->m_data] = m_code;
	if(!m_code.empty()) m_code.pop_back();
}

void Huffman::createBinaryFile(std::string filePath) {
	int offSet{}; unsigned char buffer{};
	std::ofstream outFile(filePath + fileName+ "Compressed.bin", std::ios::binary);
	while (m_inFile.good()) {
		buffer = m_inFile.get();
		if (!m_inFile.good()) break;
		m_code = m_key[buffer];
		buffer = 0;
		for (int index{}; index < m_code.size(); index++) {
			buffer = buffer | m_code[index] << (7 - offSet);
			++offSet;
			if (offSet == 8) {
				offSet = 0;
				outFile.put(buffer);
				buffer = 0;
			}//end of if
		}//end of for loop
	}//end of while loop
	outFile.close();
}


void Huffman::createKey(std::string filePath) {
	std::ofstream key(filePath + fileName + "Key.txt");
	for (std::map<unsigned char, std::vector<bool> >::iterator index = m_key.begin(); index != m_key.end(); index++) {
		key.put(index->first);
		for (int i{}; i < index->second.size();i++) {
			if(index->second[i]) key.put('1');
			else key.put('0');
		}//end of for loop
	}//end of for loop
	key << fileExten;
	key.close();
}


void Huffman::compress(std::string filePath, std::string locToCreateKey, std::string locToCompress) {
	setNameAndExten(filePath);
	std::cout << fileName << std::endl;
	std::cout << fileExten << std::endl;
	m_inFile.open(filePath, std::ios::binary);
	setTable();
	storeInList();
	m_root = makeTree();
	encode(m_root);
	createBinaryFile(locToCompress);
	m_inFile.close();
	createKey(locToCreateKey);
}

int main() {
	Huffman huff;
	huff.compress("C:/Users/User/Desktop/test.txt", "C:/Users/User/Desktop/", "C:/Users/User/Desktop/");
	return 0;
}
Hi,

-i broke the compressor into functions was it a good decision ?


Yes, that's almost always a good idea.

-if i were to get rid of the private functions and just put the code inside of the compress function it would allow me to declare the member variables locally inside of the function instead of declaring them in the private area should i consider doing so even tough i don't have a memory limit?


If you were to do that, the variables go out of scope as soon as the function ends, it's better to have them private.


-will be separated into .h and .cpp files once finished


Try to avoid doing that after the fact :+). When you create a new class, your IDE should create the .cpp and .hpp files for you, no reason for it to be any harder that way :+)

Pass string by const ref where you can:

1
2
3
void Huffman::compress(const std::string& filePath, 
                      const std::string& locToCreateKey, 
                      const std::string& locToCompress) {


I find that formatting easier when there are multiple parameters :+)

Rather than iterators, try a range based for loop, when you want to iterate the whole container:
http://en.cppreference.com/w/cpp/language/range-for

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void Huffman::createKey(std::string filePath) {
	std::ofstream key(filePath + fileName + "Key.txt");
for (std::map<unsigned char, std::vector<bool> >::iterator index = m_key.begin(); index != m_key.end(); index++) {
for (const auto& index : m_key) {

		key.put(index->first);
		for (int i{}; i < index->second.size();i++) {
			if(index->second[i]) key.put('1');
			else key.put('0');
		}//end of for loop
	}//end of for loop
key << fileExten;
	key.close();
}


A fancier version of that uses a forwarding reference (the auto&&) :

void createKey(std::string filePath, const std::map<unsigned char, std::vector<bool>>& m_keyArg);

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void Huffman::createKey(std::string filePath, const std::map<unsigned char, std::vector<bool>>& m_keyArg) {
	std::ofstream key(filePath + fileName + "Key.txt");

for ( auto&& index : m_keyArg) {

		key.put(index.first);
		for (std::size_t i{}; i < index.second.size();i++) { //could do ranged based for here too
			if(index.second[i]) key.put('1');
			else key.put('0');
		}//end of for loop
	}//end of for loop
key << fileExten;
	key.close();
}


147
148
149
m_inFile.close();
	createKey(locToCreateKey, m_key);
}


Forwarding references are the safest because they do the right thing for const, non-const, lvalues and rvalues.

Notice I use the dot notation rather than -> because it's reference now, as opposed to an iterator.

In member function 'void Huffman::createBinaryFile(std::string)':
109:27: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
In member function 'void Huffman::createKey(std::string, const std::map<unsigned char, std::vector<bool> >&)':
143:19: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]


These warnings arise because size functions return a std::size_t , so use that type in the for loop, rather than int

Note what I said in the other topic about the efficiency of vector versus other containers. Do some testing :+D Hopefully you have version control, it's easier if you are going alter types here and there.

There you go, some ideas :+)
Thanks for the response!
I have implemented the range FOR instead if using iterators and was just curious why people dislike them ?

If you were to do that, the variables go out of scope as soon as the function ends, it's better to have them private.

I don't really need the variables once the compress function finishes executing is it still worth leaving them private? (NOTE: there is no memory restriction)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void Huffman::createKey(std::string filePath, const std::map<unsigned char, std::vector<bool>>& m_keyArg) {
	std::ofstream key(filePath + fileName + "Key.txt");

for ( auto&& index : m_keyArg) {

		key.put(index.first);
		for (std::size_t i{}; i < index.second.size();i++) { //could do ranged based for here too
			if(index.second[i]) key.put('1');
			else key.put('0');
		}//end of for loop
	}//end of for loop
key << fileExten;
	key.close();
}


I don't understand why you passed in the m_key by const reference if it can be seen by the function without being passed in wouldn't it be the same if i declared the function as following ?:
void Huffman::createKey(const std::string& filePath) const;

Just wanted to post the change i made for the createKey critique is needed :)
1
2
3
4
5
6
7
8
9
10
11
12
void Huffman::createKey(const std::string& filePath) const {
	std::ofstream key(filePath + fileName + "Key.txt");
	for (auto&& index : m_key) {
		key.put(index.first);
		for (const auto& itr : index.second) {
			if (itr) key.put('1');
			else key.put('0');
		}//end of for loop
	}//end of for loop
	key << fileExten;
	key.close();
}


Last edit for today: i am going to change the way i access the files, i think i know why i cant read other formats and i'm going to go over this
Note what I said in the other topic about the efficiency of vector versus other containers. Do some testing :+D Hopefully you have version control, it's easier if you are going alter types here and there.
today. I have a long sleepless night ahead :)
Last edited on
Hi,
Thanks for the response!


Thanks for taking notice of what I mentioned :+D


I have implemented the range FOR instead if using iterators and was just curious why people dislike them ?


Range based for is safer because it is never going to go out bounds - it is controlled by the container's begin and end iterators which are obtained implicitly. Also it's less typing and easier to read once the idiom is mastered.

I don't really need the variables once the compress function finishes executing is it still worth leaving them private?


Ok, if if you really don't need them - then that is fine.

I don't understand why you passed in the m_key by const reference if it can be seen by the function without being passed in wouldn't it be the same if i declared the function as following ?:
void Huffman::createKey(const std::string& filePath) const;


I wanted to make the container const, one can't have const auto&& , silly me - you fixed that by making the whole function const :+D Well done !

If you do happen to find yourself doing this one day:

void Huffman::createKey(std::string filePath, const std::map<unsigned char, std::vector<bool>>& m_keyArg) {

then it can be made easier with:
1
2
using KeyContainer =  std::map<unsigned char, std::vector<bool>>;
void Huffman::createKey(std::string filePath, const KeyContainer& m_keyArg) {


Also, if you do use iterators but not range for, this:

for (std::map<unsigned char, uint64_t>::iterator index = m_table.begin(); index != m_table.end(); index++) {

can be made a lot easier with auto:

for (auto index = m_table.begin(); index != m_table.end(); index++) {

Thanks for the replay once again!
Is it ok if i post the code again once i finish optimizing and ask for your critique ?
Yeah sure :+D but hopefully some answers from others too, I am only a back yard basher!

Well here's the updated code:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
#include<iostream>
#include<fstream>
#include<map>
#include<list>
#include<vector>
#include<string>
#include<time.h>

struct node {
	unsigned char m_data{};
	uint64_t m_frequency{};
	node* m_left;
	node* m_right;

	node() : m_left{ nullptr }, m_right{ nullptr } {};
	node(node* left, node* right) :m_left{ left }, m_right{ right } {
		m_frequency = m_left->m_frequency + m_right->m_frequency;
	};
};

struct functor {
	bool operator()(const node* one, const node* two) const { return one->m_frequency < two->m_frequency; }
};

class Huffman {
private:
	node* m_root{ nullptr };
	std::map<unsigned char, uint64_t> m_table;
	std::list<node*> m_list;
	std::vector<bool> m_code;
	std::map<unsigned char, std::vector<bool> > m_key;
	std::string m_fileName;
	std::string m_fileExten;
	std::string m_fileContent;

	/*compressor related functions*/
	void readFile(const std::string& filePath);
	void setNameAndExten(const std::string& filePath);
	void setTable();
	void storeInList(const std::map<unsigned char, uint64_t>& table);
	node* makeTree();
	void encode(const node* const &root);
	void createBinaryFile(const std::string& filePath);
	void createKey(const std::string& filePath) const;
	/*compressor related functions*/

public:
	void compress(const std::string& filePath, const std::string& locToCreateKey, const std::string& locToCompress);
};

void Huffman::readFile(const std::string& filePath) {
	std::ifstream inFile(filePath, std::ios::binary);
	auto const start_pos = inFile.tellg();
	inFile.ignore(std::numeric_limits<std::streamsize>::max());
	auto const char_count = inFile.gcount();
	inFile.seekg(start_pos);
	m_fileContent = std::string(char_count, unsigned char{});
	inFile.read(&m_fileContent[0], m_fileContent.size());
	inFile.close();
}

void Huffman::setNameAndExten(const std::string& filePath) {
	std::size_t foundName = filePath.find_last_of("/\\");
	std::size_t foundExten = filePath.find_last_of('.');
	m_fileName = filePath.substr(foundName + 1, foundExten - foundName - 1);
	m_fileExten = filePath.substr(foundExten);
}

void Huffman::setTable() {
	for (const auto& data : m_fileContent) {
		m_table[data]++;
	}//end of for loop
}

void Huffman::storeInList(const std::map<unsigned char, uint64_t>& table) {
	for (auto&& index : table) {
		node* leaf = new node;
		leaf->m_data = index.first;
		leaf->m_frequency = index.second;
		m_list.push_back(leaf);
	}//end of for loop
}

node* Huffman::makeTree() {
	while (m_list.size() > 1) {
		m_list.sort(functor());
		node* leftSon = m_list.front();
		m_list.pop_front();
		node* rightSon = m_list.front();
		m_list.pop_front();
		node* parent = new node(leftSon, rightSon);
		m_list.push_back(parent);
	}//end of while loop
	return m_list.front();
}

void Huffman::encode(const node* const &root) {
	if (root->m_left != nullptr) {
		m_code.push_back(0);
		encode(root->m_left);
	}//end of if
	if (root->m_right != nullptr) {
		m_code.push_back(1);
		encode(root->m_right);
	}//end of if 
	if (root->m_data) m_key[root->m_data] = m_code;
	if(!m_code.empty()) m_code.pop_back();
}

void Huffman::createBinaryFile(const std::string& filePath) {
	int offSet{}; unsigned char buffer{};
	std::ofstream outFile(filePath + m_fileName+ "Compressed.bin", std::ios::binary);
	for(const auto& data:m_fileContent ){
		buffer = data;
		m_code = m_key[buffer];
		buffer = 0;
		for (const auto& index : m_code) {
			buffer = buffer | index << (7 - offSet);
			++offSet;
			if (offSet == 8) {
				offSet = 0;
				outFile.put(buffer);
				buffer = 0;
			}//end of if
		}//end of for loop
	}//end of for loop
	outFile.close();
}

void Huffman::createKey(const std::string& filePath) const {
	std::ofstream key(filePath + m_fileName + "Key.txt", std::ios::binary);
	for (auto&& index : m_key) {
		key.put(index.first);
		for (const auto& itr : index.second) {
			if (itr) key.put('1');
			else key.put('0');
		}//end of for loop
	}//end of for loop
	key << m_fileExten;
	key.close();
}

void Huffman::compress(const std::string& filePath,const std::string& locToCreateKey,const std::string& locToCompress) {
	setNameAndExten(filePath);
	readFile(filePath);
	setTable();
	storeInList(m_table);
	m_root = makeTree();
	encode(m_root);
	createBinaryFile(locToCompress);
	createKey(locToCreateKey);
}

int main() {
	clock_t tStart = clock();
	Huffman huff;
	huff.compress("C:/Users/User/Desktop/gsm.jpg", "C:/Users/User/Desktop/", "C:/Users/User/Desktop/");
	std::cout << (double)(clock() - tStart) / CLOCKS_PER_SEC << "(s)" << '\n';
	return 0;
}


The problem is the time it takes to compress:
5.8MB .jpg picture: 2min to compress
383KB .docx world file: 9 seconds to compress
225KB .txt text file: 4 seconds to compress

About the .jpg file i can understand that its no a good algorithm to compress pictures however the time it takes for the <1MB files isn't normal.
Now the aim is to increase efficiency.
Silly question: Did you compile to release mode?

Trying to compress a jpg file is fairly pointless, I am afraid.

I am trying to work through your new code, at the moment there are quite a few warnings ......
Hi,

The warnings I had were mainly sign conversion ones - to do with the unsigned char type. There are a couple of options:

1. Put static_cast<unsigned char >() everywhere; or
2. change unsigned char to char.

Use of unsigned char sounds like a good idea: signed semantics can be a problem. But with option 2, cppreference says this (near the bottom underneath the range of values table):

cppreference wrote:
1 ↑ As of C++14, char must represent 256 distinct values, bijectively convertible to the values 0..255 of unsigned char, which may require a wider range of values.


http://en.cppreference.com/w/cpp/language/types

I am not sure, does this mean we can use char instead of unsigned char ?

The other things worth doing is change std::list to std::vector, and use unique_ptr instead of new.

With the casting, prefer C++ casting rather than C casts. The latter has a bunch of problems.

std::cout << (double)(clock() - tStart) / CLOCKS_PER_SEC << "(s)" << '\n';

std::cout << static_cast<double>(clock() - tStart) / CLOCKS_PER_SEC << "(s)" << '\n';

I have been compiling with these options:

g++ -std=c++14 -Wall -Wextra -pedantic-errors -Wswitch-default -Wswitch-enum -Wuninitialized -Wfloat-equal -Wconversion -Wsign-conversion -Wfloat-conversion
Well i forgot to put it on release mode lol now it does alot better :
5.8MB .jpg picture: 1.2(s) to compress
383KB .docx world file: 0.08 seconds to compress
225KB .txt text file: 0.039 seconds to compress

recent updates:
-The compression key of .jpg was wrong (written in japanese somewhy) was fixed by changing all of the unsigned char's to regular char (i think the problem was extended ascii with unsigned)
-the only warning i het now is line 57 :
Source.cpp(57): warning C4244: 'argument': conversion from 'const std::streamsize' to 'unsigned int', possible loss of data

i understand what the warning means but i cant seem to find how to fix it (i don't know which variable is of type const::std::streamsize so that i can static_cast the other i will experiment atm)

UPDATE:
changing line 55 to:
auto const char_count = static_cast<unsigned int>(inFile.gcount());
got rid of the warning
Last edited on
i understand what the warning means but i cant seem to find how to fix it (i don't know which variable is of type const::std::streamsize so that i can static_cast the other i will experiment atm)


I had this:

1
2
3
4
5
6
7
8
9
10
void Huffman::readFile(const std::string& filePath) {
	std::ifstream inFile(filePath, std::ios::binary);
	auto const start_pos = inFile.tellg();
	inFile.ignore(std::numeric_limits<std::streamsize>::max());
	std::streamsize  char_count = inFile.gcount();
	inFile.seekg(start_pos);
	m_fileContent = std::string(static_cast<std::size_t>(char_count ),  '0');
	inFile.read(&m_fileContent[0], static_cast<std::streamsize> (m_fileContent.size()) );
	inFile.close();
}


One can work what the types are by the return values of the function calls.

Just wondering what size the jpg file went down to?
The jpg went from 5.8MB to 5.6MB not really worth it.

The other things worth doing is change std::list to std::vector

I will be replacing the list with vector (list just felt more natural in the beginning) .

and use unique_ptr instead of new.

Never used unique_ptr will have to learn this one before implementing.
just a quick question do i release the memory with delete as if it was operator new?

will gladly use:
1
2
3
4
5
6
7
8
9
10
void Huffman::readFile(const std::string& filePath) {
	std::ifstream inFile(filePath, std::ios::binary);
	auto const start_pos = inFile.tellg();
	inFile.ignore(std::numeric_limits<std::streamsize>::max());
	std::streamsize  char_count = inFile.gcount();
	inFile.seekg(start_pos);
	m_fileContent = std::string(static_cast<std::size_t>(char_count ),  '0');
	inFile.read(&m_fileContent[0], static_cast<std::streamsize> (m_fileContent.size()) );
	inFile.close();
}

with your permission ofcourse :)
I was just trying to change things to unique_ptr, but there is a problem: shared ownership between the table and list. Apparently shared_ptr is something to be ideally avoided. Am going to mention a different tack: try to do it with references. Will need std::reference_wrapper because one can't have a container of references.
http://en.cppreference.com/w/cpp/utility/functional/reference_wrapper

So that way we will have no new and no pointers.

Never used unique_ptr will have to learn this one before implementing.
just a quick question do i release the memory with delete as if it was operator new?


No, that is the main idea of smart pointers - to avoid the need for delete.

with your permission ofcourse :)


It was your code, I just changed a couple of things :+)

Did you manage to find the equivalent of -Wconversion and -Wsign-conversion in your compiler? It was those which identified the problem in the first place.

I was going to rename some of the variables as well. I used to pre-pend m_ to member variables, but have stopped doing that awhile back. More importantly I was going to change m_table to freq_table or similar. The reason was I looking at the node struct, and erroneously thought the frequency wasn't being changed. I might have realised if the variable name was changed as well as the function setTable to UpdateFreqTable say.

The storeInList is a little problem now the type is changed - so try to avoid putting types into function names :+)

Anyway, it's so late here it's now very early !! I might not be back for awhile ZZZZZZzzzzzz.........
Does this mean we can use char instead of unsigned char ?

At least on line 57 of @globaltourist's code:
m_fileContent = std::string(char_count, unsigned char{});

You are reading the file into a std::string. std::string is an alias for std::basic_string<char, std::char_traits<char>>. If you wish the code to remain portable, you must use a char there: it won't compile everywhere right now.
@mbozzi

Splendid, I was hoping someone would reply to that :+)

What I really getting at was with this part:

cppreference wrote:
1 ↑ As of C++14, char must represent 256 distinct values, bijectively convertible to the values 0..255 of unsigned char, which may require a wider range of values.


Does that mean the compiler will implicitly convert a signed char (say) to a unsigned char? And that we can use char without having to worry about sign problems? The thing is the default for x86 systems is signed char. I am guessing (hoping ) this problem was the whole point of making that change in C++14.

Regards :+)
code update:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
]
#include<iostream>
#include<fstream>
#include<map>
#include<algorithm>
#include<vector>
#include<string>
#include<time.h>

struct node {
	char     m_data{};
	uint64_t m_frequency{};
	node*    m_left;
	node*    m_right;

	node() : m_left{ nullptr }, m_right{ nullptr } {};
	node(node* left, node* right) :m_left{ left }, m_right{ right } {
		m_frequency = m_left->m_frequency + m_right->m_frequency;
	};
};

struct functor {
	bool operator()(const node* one, const node* two) const { return one->m_frequency > two->m_frequency; }
};

class Huffman {
private:
	std::map<char, uint64_t>           m_freq_table;
	std::map<char, std::vector<bool> > m_key;
	std::vector<node*>                 m_node_vec;
	std::vector<bool>                  m_code;
	std::string                        m_fileName;
	std::string                        m_fileExten;
	std::string                        m_fileContent;
	node*                              m_root{ nullptr };

	/*compressor related functions*/
	void storeFreqTable(const std::map<char, uint64_t>& table);
	void createBinaryFile(const std::string& filePath);
	void createKey(const std::string& filePath) const;
	void setNameAndExten(const std::string& filePath);
	void readFile(const std::string& filePath);
	void encode(const node* const &root);
	void UpdateFreqTable();
	node* makeTree();
	/*compressor related functions*/

public:
	void compress(const std::string& filePath, const std::string& locToCreateKey, const std::string& locToCompress);
};

void Huffman::storeFreqTable(const std::map<char, uint64_t>& table) {
	m_node_vec.reserve(m_freq_table.size());
	for (auto&& index : table) {
		node* leaf = new node;
		leaf->m_data = index.first;
		leaf->m_frequency = index.second;
		m_node_vec.emplace_back(leaf);
	}//end of for loop
}

void Huffman::createBinaryFile(const std::string& filePath) {
	int offSet{}; char buffer{};
	std::ofstream outFile(filePath + m_fileName + "Compressed.bin", std::ios::binary);
	for (const auto& data : m_fileContent) {
		buffer = data;
		m_code = m_key[buffer];
		buffer = 0;
		for (const auto& index : m_code) {
			buffer = buffer | index << (7 - offSet);
			++offSet;
			if (offSet == 8) {
				offSet = 0;
				outFile.put(buffer);
				buffer = 0;
			}//end of if
		}//end of for loop
	}//end of for loop
	outFile.close();
}

void Huffman::createKey(const std::string& filePath) const {
	std::ofstream key(filePath + m_fileName + "Key.txt", std::ios::binary);
	for (auto&& index : m_key) {
		key.put(index.first);
		for (const auto& itr : index.second) {
			if (itr) key.put('1');
			else key.put('0');
		}//end of for loop
	}//end of for loop
	key << m_fileExten;
	key.close();
}

void Huffman::setNameAndExten(const std::string& filePath) {
	std::size_t foundName = filePath.find_last_of("/\\");
	std::size_t foundExten = filePath.find_last_of('.');
	m_fileName = filePath.substr(foundName + 1, foundExten - foundName - 1);
	m_fileExten = filePath.substr(foundExten);
}

void Huffman::readFile(const std::string& filePath) {
	std::ifstream inFile(filePath, std::ios::binary);
	auto const start_pos = inFile.tellg();
	inFile.ignore(std::numeric_limits<std::streamsize>::max());
	std::streamsize  char_count = inFile.gcount();
	inFile.seekg(start_pos);
	m_fileContent = std::string(static_cast<std::size_t>(char_count), '0');
	inFile.read(&m_fileContent[0], static_cast<std::streamsize> (m_fileContent.size()));
	inFile.close();
}

void Huffman::encode(const node* const &root) {
	if (root->m_left != nullptr) {
		m_code.push_back(0);
		encode(root->m_left);
	}//end of if
	if (root->m_right != nullptr) {
		m_code.push_back(1);
		encode(root->m_right);
	}//end of if 
	if (root->m_data) m_key[root->m_data] = m_code;
	if (!m_code.empty()) m_code.pop_back();
}

void Huffman::UpdateFreqTable() {
	for (const auto& data : m_fileContent) {
		m_freq_table[data]++;
	}//end of for loop
}

node* Huffman::makeTree() {
	while (m_node_vec.size() > 1) {
		std::sort(m_node_vec.begin(), m_node_vec.end(), functor());
		node* leftSon = m_node_vec.back();
		m_node_vec.pop_back();
		node* rightSon = m_node_vec.back();
		m_node_vec.pop_back();
		node* parent = new node(leftSon, rightSon);
		m_node_vec.emplace_back(parent);
	}//end of while loop
	m_node_vec.shrink_to_fit();
	return m_node_vec.front();
}

void Huffman::compress(const std::string& filePath,const std::string& locToCreateKey,const std::string& locToCompress) {
	setNameAndExten(filePath);
	readFile(filePath);
	UpdateFreqTable();
	storeFreqTable(m_freq_table);
	m_root = makeTree();
	encode(m_root);
	createBinaryFile(locToCompress);
	createKey(locToCreateKey);
}

int main() {
	clock_t tStart = clock();
	Huffman huff;
	huff.compress("C:/Users/User/Desktop/big.txt", "C:/Users/User/Desktop/", "C:/Users/User/Desktop/");
	std::cout << static_cast<double>(clock() - tStart) / CLOCKS_PER_SEC << "(s)" << '\n';
	return 0;
}


I was just trying to change things to unique_ptr, but there is a problem: shared ownership between the table and list. Apparently shared_ptr is something to be ideally avoided. Am going to mention a different tack: try to do it with references. Will need std::reference_wrapper because one can't have a container of references.

I will have to release the allocated memory at the second the compress function finishes which doesn't mean that the object will get out of scope until than so i think the options are:
a) call the destructor to release the memory explicitly (will have to make one now :) ).
b) make a function that uses m_root to release the memory the same way you would make a destructor for a binary tree.

I was going to rename some of the variables as well. I used to pre-pend m_ to member variables, but have stopped doing that awhile back. More importantly I was going to change m_table to freq_table or similar. The reason was I looking at the node struct, and erroneously thought the frequency wasn't being changed. I might have realised if the variable name was changed as well as the function setTable to UpdateFreqTable say.

Not ready to get over the m_ stage yet :)

The storeInList is a little problem now the type is changed - so try to avoid putting types into function names :+)

Changed (and reorganized the code so it looks a little cleaner).


Did you manage to find the equivalent of -Wconversion and -Wsign-conversion in your compiler? It was those which identified the problem in the first place.

Didn't find those in visual studio 2015 although the compiler does tell you if you do type conversion
I will have to release the allocated memory at the second the compress function finishes which doesn't mean that the object will get out of scope until than so i think the options are:
a) call the destructor to release the memory explicitly (will have to make one now :) ).
b) make a function that uses m_root to release the memory the same way you would make a destructor for a binary tree.


If one can get rid of the new and use the STL containers in a pure fashion, then there shouldn't be a need to worry about manually doing memory management at all. That's why I mentioned using a

std::vector<std::reference_wrapper<node>> node_data

Note: keep types out variable names as well.

With line 70, I minimally changed it to this:

buffer |= index << (7 - offSet);

But there is still a warning about sign conversion because (7 - offSet) could be negative. Although you prevent that a few lines later, one could put that in an if statement, just to stop the warning:

1
2
3
4
if (offSet < 8) {
   buffer |=   index << (7 - offSet);
   ++offset;
}


Didn't find those in visual studio 2015 although the compiler does tell you if you do type conversion


Can you enable C4018 ? Not sure if that is the right one, but it will be in there somewhere :+)
Last edited on
So I changed your code a bit. I was misleading saying to use std::vector. Another efficient container is std::deque and it has functions such pop_front() Sorry about that, it is a result of me being an amateur coder.

Here is the code I have, it compiles but I haven't tested it:

Huffman.hpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
#ifndef HUFFMAN_H
#define HUFFMAN_H

#include<fstream>
#include<map>
#include<list>
#include<vector>
#include<string>
#include <memory>
#include <functional>
#include <deque>


struct node {
	 char m_data{};
	uint64_t m_frequency{};
	node* m_left;
	node* m_right;

	node() : m_left{ nullptr }, m_right{ nullptr } {}
	node(node* left, node* right) :m_left{ left }, m_right{ right } {
		m_frequency = m_left->m_frequency + m_right->m_frequency;
	}
};

struct functor {
	bool operator()(const node one, const node two) const { return one.m_frequency < two.m_frequency; }
};

class Huffman {
private:
	node m_root;
	std::map< char, uint64_t> m_table;
	//std::list<node*> NodeData;
	std::deque<std::reference_wrapper<node>> NodeData;
	std::vector<bool> m_code;
	std::map< char, std::vector<bool> > m_key;
	std::string m_fileName;
	std::string m_fileExten;
	std::string m_fileContent;

	/*compressor related functions*/
	void readFile(const std::string& filePath);
	void setNameAndExten(const std::string& filePath);
	void setTable();
	void storeInList(const std::map<char, uint64_t> &table);
	node& makeTree();
	void encode(const node& root);
	void createBinaryFile(const std::string& filePath);
	void createKey(const std::string& filePath) const;
	/*compressor related functions*/

public:
	void compress(const std::string& filePath, const std::string& locToCreateKey, const std::string& locToCompress);
};


#endif // HUFFMAN_H 


Huffman.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
#include "Huffman.hpp"

#include <limits>
#include <algorithm>


void Huffman::readFile(const std::string& filePath) {
	std::ifstream inFile(filePath, std::ios::binary);
	auto const start_pos = inFile.tellg();
	inFile.ignore(std::numeric_limits<std::streamsize>::max());
	std::streamsize  char_count = inFile.gcount();
	inFile.seekg(start_pos);
	m_fileContent = std::string(static_cast<std::size_t>(char_count ),  '0');
	inFile.read(&m_fileContent[0], static_cast<std::streamsize> (m_fileContent.size()) );
	inFile.close();
}

void Huffman::setNameAndExten(const std::string& filePath) {
	std::size_t foundName = filePath.find_last_of("/\\");
	std::size_t foundExten = filePath.find_last_of('.');
	m_fileName = filePath.substr(foundName + 1, foundExten - foundName - 1);
	m_fileExten = filePath.substr(foundExten);
}

void Huffman::setTable() {
	for (const auto& data : m_fileContent) {
		m_table[data]++;
	}//end of for loop
}

void Huffman::storeInList(const std::map< char, uint64_t>& table) {
	for (auto&& index : table) {
		node leaf;
		leaf.m_data = index.first;
		leaf.m_frequency = index.second;
		NodeData.push_back(leaf);
	}
}

node& Huffman::makeTree() {
	while (NodeData.size() > 1) {

		std::sort(NodeData.begin(), NodeData.end(), functor());

		auto leftSon = NodeData.front();
		NodeData.pop_front();

		auto rightSon = NodeData.front();
		NodeData.pop_front();

		node parent;
		parent.m_left = &leftSon.get();
		parent.m_right = &rightSon.get();
		NodeData.push_back(parent);


	}
	return NodeData.front();
}

void Huffman::encode(const node& root) {
	if (root.m_left != nullptr) {
		m_code.push_back(0);
		encode(*root.m_left);
	}
	if (root.m_right != nullptr) {
		m_code.push_back(1);
		encode(*root.m_right);
	}
	if (root.m_data) m_key[root.m_data] = m_code;
	if(!m_code.empty()) m_code.pop_back();
}

void Huffman::createBinaryFile(const std::string& filePath) {
	 int offSet{};
	 int buffer{};
	std::ofstream outFile(filePath + m_fileName+ "Compressed.bin", std::ios::binary);

	for(const auto& data : m_fileContent ){
		buffer = data;
		m_code = m_key[static_cast<char>(buffer)];
		buffer = 0;
		for (const auto& index : m_code) {
			buffer |=   index << (7 - offSet);
			++offSet;
			if (offSet == 8) {
				offSet = 0;
				outFile.put(static_cast<char>(buffer));
				buffer = 0;
			}//end of if
		}//end of for loop
	}//end of for loop
	outFile.close();
}

void Huffman::createKey(const std::string& filePath) const {
	std::ofstream key(filePath + m_fileName + "Key.txt", std::ios::binary);
	for (auto&& index : m_key) {
		key.put(index.first);
		//for (const auto& itr : index.second) {
		for ( auto&& itr : index.second) {
			if (itr) key.put('1');
			else key.put('0');
		}//end of for loop
	}//end of for loop
	key << m_fileExten;
	key.close();
}

void Huffman::compress(const std::string& filePath,
		       const std::string& locToCreateKey,
		       const std::string& locToCompress) {
	setNameAndExten(filePath);
	readFile(filePath);
	setTable();
	storeInList(m_table);
	m_root = makeTree();
	encode(m_root);
	createBinaryFile(locToCompress);
	createKey(locToCreateKey);
}


main.cpp
1
2
3
4
5
6
7
8
9
10
11
12
13
#include <iostream>

#include "Huffman.hpp"


int main() {
	clock_t tStart = clock();
	Huffman huff;
	huff.compress("C:/Users/User/Desktop/gsm.jpg","C:/Users/User/Desktop/","C:/Users/User/Desktop/");

	std::cout << static_cast<double>(clock() - tStart) / CLOCKS_PER_SEC << "(s)" << '\n';
	return 0;
}


Can you enable C4018 ? Not sure if that is the right one, but it will be in there somewhere :+)

Is enabled by default in visual studio 2015

But there is still a warning about sign conversion because (7 - offSet) could be negative.

does the warning really matter if i handle it afterwards ?

In the code you posted:
1) Thanks for putting in your time to help me out ! I will definitely give you credit when i will post the whole code with the decompressor.
2) i have some things that confuse me:

line 76 in Huffman.cpp you did this int buffer{}; and than you used static_cast<char>() every time you use it. So why not just make it char buffer{}; from the beginning ?

lines 52-53 in Huffman.cpp
1
2
parent.m_left = &leftSon.get();
parent.m_right = &rightSon.get();

where did the get() method come from ?

3) things that i don't know at all :
how this :std::deque<std::reference_wrapper<node>> NodeData; works.
the only thing i understood is that a deque is a doubly linked que that is most efficient to insert elements and delete them on request.
std::reference_wrapper<node> never worked with this one before still need to learn it
what are these includes for ?
1
2
3
#include <limits>
#include <memory>
#include <functional> 

The way you insert new elements into the deque aren't they still raw pointers ?
Pages: 12