Huffman compressor

Pages: 12
Line 76 in Huffman.cpp you did this int buffer{}; and than you used static_cast<char>() every time you use it. So why not just make it char buffer{}; from the beginning ?


Ok, good thinking :+) See next post.

where did the get() method come from ?


It's a member function of std::reference_wrapper. I needed that to make it work.

how this :std::deque<std::reference_wrapper<node>> NodeData; works.


A container can't hold references. It would be undefined as to what would happen if something happen to that reference, so it isn't allowed. So we have reference_wrapper which basically turns it into pointer to reference, so we can have a container of those. So now we can have reference semantics.

the only thing i understood is that a deque is a doubly linked que that is most efficient to insert elements and delete them on request.


A std::deque is a double ended queue, not a double linked. It is efficient for inserting / removing at the front and back. But there are tradeoffs:
http://en.cppreference.com/w/cpp/container/deque

what are these includes for ?
1
2
3
#include <limits>
#include <memory>
#include <functional>  


Could get rid of <memory> I had that while experimenting with unique_ptr <functional> is for std::reference_wrapper
Not sure where the limits came from :+)

The way you insert new elements into the deque aren't they still raw pointers ?


I push_back a parent, which is a node and it becomes a reference, as far as I understand it.

Oh hell , maybe I am discombobulated here, should there just be a std::deque<node> and just use emplace_back ? I will try that :+)
Last edited on
line 76 in Huffman.cpp you did this int buffer{}; and than you used static_cast<char>() every time you use it. So why not just make it char buffer{}; from the beginning ?


This gives this warning:

./Huffman/Huffman.cpp:85:36: warning: conversion to 'char' from 'int' may alter its value [-Wconversion]
buffer |= index << (7 - offSet);
So I tried std::deque<node> , and it is compiling:

1
2
3
4
5
6
7
class Huffman {
private:
	node m_root;
	std::map< char, uint64_t> m_freq_table;
	
	std::deque<node> NodeData;
// ...... 


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
node& Huffman::makeTree() {
	while (NodeData.size() > 1) {

		std::sort(NodeData.begin(), NodeData.end(), functor());

		auto leftSon = NodeData.front();
		NodeData.pop_front();

		auto rightSon = NodeData.front();
		NodeData.pop_front();

		NodeData.emplace_back(&leftSon, &rightSon);

	}
	return NodeData.front();
}



Also added a another constructor for node, so we can use emplace_back in the storeFreqTable function, this avoiding push_back which copies.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
struct node {
	 char m_data{};
	std::size_t m_frequency{};
	node* m_left;
	node* m_right;

	node() : m_left{ nullptr }, m_right{ nullptr } {}
	node(node* left, node* right) :m_left{ left }, m_right{ right } {
		m_frequency = m_left->m_frequency + m_right->m_frequency;
	}
	node (char DataArg,
	      std::size_t FrequencyArg,
	      node* LeftChildArg,
	      node* RightChildArg
	      )
	  : m_data(DataArg),
	    m_frequency( FrequencyArg),
	    m_left(LeftChildArg),
	    m_right(RightChildArg)
	{}
};


1
2
3
4
5
6
7
8
9
void Huffman::storeFreqTable(const std::map< char, uint64_t>& table) {
	for (auto&& index : table) {
//		node leaf;
//		leaf.m_data = index.first;
//		leaf.m_frequency = index.second;
//		NodeData.push_back(leaf);
	    NodeData.emplace_back(index.first, index.second, nullptr, nullptr);
	}
}
Well you cleared a lot for me but i still have some annoying questions :(
-just was wondering how portable would the reference_wrapper be ?

This gives this warning:

./Huffman/Huffman.cpp:85:36: warning: conversion to 'char' from 'int' may alter its value [-Wconversion]
buffer |= index << (7 - offSet);

-Well as far as i can understand it warns me that it can potentially try to modify a 10th bit for example if offset = -3 and it would force a type conversion but as i take care of it with the latter if statement i don't see it happening (as well as a situation like this: 7 - (-3) wouldn't happen as offset will never go negative).
-Other than that i try to avoid the generic int because it can cause a portability issue between architectures.

-Another thought i had is should i make the m_code variable a deque<bool> ? as it only does insertion and deletion and should be more efficient in theory than a vector.

-Should i change all of the uint64_t declarations to std::size_t ? (note that the concern is portability to other architectures)

-In my code i used shrink_to_fit() before returning the front element from the maketree function because it freed the memory used by the vector, shouldn't i do it with a deque aswell before returning the front element?
Hi,

-Well as far as i can understand it warns me that it can potentially try to modify a 10th bit for example if offset = -3 and it would force a type conversion but as i take care of it with the latter if statement i don't see it happening (as well as a situation like this: 7 - (-3) wouldn't happen as offset will never go negative).


Well that is fine, but the way I had it there were no warnings. For me that is a milestone moment: I am not finished until there are no warnings / errors. Some people compile with -Werror , which turns all warnings into errors.

-Other than that i try to avoid the generic int because it can cause a portability issue between architectures.


Does it? I don't know, maybe someone else can comment on how to do these things in an independent way. I would have thought that a program that works on one system would work the same on another system. By that I mean normal C++ expressions, not for anything written specifically for a particular OS.

It's worthwhile to compile your code on different systems and with different compilers. You could look at getting some type of Linux system: maybe Cygwin which will run on Windows; or if you have a spare disk partition / HD, try a full Linux system. One can put the whole thing on a USB stick in order to try it without installing on the HD. If you like it you can install from there. I have Fedora25, Ubuntu is popular, there are many others. If you really want to be bold, have a go at building the upstream versions of gcc (7.0) and clang++ (5.0), then you can have the full implementation c++17 :+)

-Another thought i had is should i make the m_code variable a deque<bool> ? as it only does insertion and deletion and should be more efficient in theory than a vector.


So there is only push_back and pop_back, so I doubt if it will make any difference. Note there are costs with having a deque. If there was push_front and pop_front , then it would be worth it. On the other hand, std::vector<bool> is a bit of infamous container, one of the alternatives is std::deque<bool>. But all you are doing is pushing and popping the end, that is, not anything tricky like iterating it for example.

-Should i change all of the uint64_t declarations to std::size_t ? (note that the concern is portability to other architectures)


Well std::size_t is usually the largest unsigned integer type the system has, so for 64 bit it is the same as uint64_t This is still the case on 32 bit, unsigned long long is still 64 bit. So I guess it's a moot point.
http://en.cppreference.com/w/cpp/language/types

Are you really looking at running this on 32 bit, it's very nearly dead isn't it?

-In my code i used shrink_to_fit() before returning the front element from the maketree function because it freed the memory used by the vector, shouldn't i do it with a deque aswell before returning the front element?


Yes that is fine, AFAIK.

I have another idea: On the wiki page it talks about using a priority queue, the STL has std::priority_queue. It probably does the same thing as your code, but anything in the STL is likely to be better in some way.
http://en.cppreference.com/w/cpp/container/priority_queue

Regards :+)

Hello again!
Iv'e had a short break because exams....
Going back to coding now.

Does it? I don't know, maybe someone else can comment on how to do these things in an independent way. I would have thought that a program that works on one system would work the same on another system. By that I mean normal C++ expressions, not for anything written specifically for a particular OS.

An int is an int on any os but the difference is the processor the program is ran on and it will impact how much bytes are given to the type for example:
on one pc int can be 4bytes
on another 2bytes
and on some microprocessors can be 3 bytes
that's what i meant by portable to other architectures (char is 1 byte on any architecture).

Other than that i will be working on the small patches atm.
-Should i change all of the uint64_t declarations to std::size_t ? (note that the concern is portability to other architectures)

Do you realize that most of the fixed sized types are optional?
http://en.cppreference.com/w/cpp/types/integer

Do you realize that size_t is an implementation defined unsigned type. On many systems it is the same an unsigned int, however it could be an unsigned long, or an unsigned long long. If you're really concerned about portability then you should use a size_t since this size is guaranteed to be large enough to hold the size of any object the system can accommodate. If you use an uint64_t instead of a size_t you may risk allocation failures if the system's size_t is a 32 bit type.


Well i tried running this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
#include<iostream>
#include<fstream>
#include<map>
#include<algorithm>
#include<vector>
#include<string>
#include<deque>
#include<time.h>

struct node {
	char        m_data{};
	std::size_t m_frequency{};
	node*       m_left;
	node*       m_right;

	node() : m_left{ nullptr }, m_right{ nullptr } {}

	node(node* left, node* right) :m_left{ left }, m_right{ right } {
		m_frequency = m_left->m_frequency + m_right->m_frequency;
	}

	node(char dataArg, std::size_t frequencyArg, node* leftChild = nullptr, node* rightChild = nullptr) :
		m_data{ dataArg }, m_frequency{ frequencyArg }, m_left{ leftChild }, m_right{ rightChild } {}
};

struct functor {
	bool operator()(const node& one, const node& two) const { return one.m_frequency > two.m_frequency; }
};

class Huffman {
private:
	std::map<char, std::size_t>        m_freq_table;
	std::map<char, std::vector<bool> > m_key;
	std::deque<node>				   m_nodeData;
	std::vector<bool>                  m_code;
	std::string                        m_fileName;
	std::string                        m_fileExten;
	std::string                        m_fileContent;
	node                              m_root;

	/*compressor related functions*/
	void storeFreqTable(const std::map<char, std::size_t>& table);
	void createBinaryFile(const std::string& filePath);
	void createKey(const std::string& filePath) const;
	void setNameAndExten(const std::string& filePath);
	void readFile(const std::string& filePath);
	void encode(const node* const &root);
	void UpdateFreqTable();
	node& makeTree();
	/*compressor related functions*/

public:
	void compress(const std::string& filePath, const std::string& locToCreateKey, const std::string& locToCompress);
};

void Huffman::storeFreqTable(const std::map<char, std::size_t>& table) {
	for (auto&& index : table) {
		m_nodeData.emplace_back(index.first, index.second);
	}//end of for loop
}

void Huffman::createBinaryFile(const std::string& filePath) {
	char offSet{}; char buffer{};
	std::ofstream outFile(filePath + m_fileName + "Compressed.bin", std::ios::binary);
	for (const auto& data : m_fileContent) {
		buffer = data;
		m_code = m_key[buffer];
		buffer = 0;
		for (const auto& index : m_code) {
			buffer |= index << (7 - offSet);
			++offSet;
			if (offSet == 8) {
				offSet = 0;
				outFile.put(buffer);
				buffer = 0;
			}//end of if
		}//end of for loop
	}//end of for loop
	outFile.close();
}

void Huffman::createKey(const std::string& filePath) const {
	std::ofstream key(filePath + m_fileName + "Key.txt", std::ios::binary);
	for (auto&& index : m_key) {
		key.put(index.first);
		for (const auto& itr : index.second) {
			if (itr) key.put('1');
			else key.put('0');
		}//end of for loop
	}//end of for loop
	key << m_fileExten;
	key.close();
}

void Huffman::setNameAndExten(const std::string& filePath) {
	std::size_t foundName = filePath.find_last_of("/\\");
	std::size_t foundExten = filePath.find_last_of('.');
	m_fileName = filePath.substr(foundName + 1, foundExten - foundName - 1);
	m_fileExten = filePath.substr(foundExten);
}

void Huffman::readFile(const std::string& filePath) {
	std::ifstream inFile(filePath, std::ios::binary);
	auto const start_pos = inFile.tellg();
	inFile.ignore(std::numeric_limits<std::streamsize>::max());
	std::streamsize  char_count = inFile.gcount();
	inFile.seekg(start_pos);
	m_fileContent = std::string(static_cast<std::size_t>(char_count), '0');
	inFile.read(&m_fileContent[0], static_cast<std::streamsize> (m_fileContent.size()));
	inFile.close();
}

void Huffman::encode(const node* const &root) {
	if (root->m_left != nullptr) {
		m_code.push_back(0);
		encode(root->m_left);							 
	}//end of if
	if (root->m_right != nullptr) {
		m_code.push_back(1);
		encode(root->m_right);
	}//end of if 
	if (root->m_data) m_key[root->m_data] = m_code;
	if (!m_code.empty()) m_code.pop_back();
}

void Huffman::UpdateFreqTable() {
	for (const auto& data : m_fileContent) {
		m_freq_table[data]++;
	}//end of for loop
}

node& Huffman::makeTree() {
	while (m_nodeData.size() > 1) {
		std::sort(m_nodeData.begin(), m_nodeData.end(), functor());
		auto leftSon = m_nodeData.front();
		m_nodeData.pop_front();
		auto rightSon = m_nodeData.front();
		m_nodeData.pop_front();
		m_nodeData.emplace_back(&leftSon, &rightSon);
	}//end of while loop
	m_nodeData.shrink_to_fit();
	return m_nodeData.front();
}

void Huffman::compress(const std::string& filePath, const std::string& locToCreateKey, const std::string& locToCompress) {
	setNameAndExten(filePath);
	readFile(filePath);
	UpdateFreqTable();
	storeFreqTable(m_freq_table);
	m_root = makeTree();
	encode(&m_root);
	createBinaryFile(locToCompress);
	createKey(locToCreateKey);
}

int main() {
	clock_t tStart = clock();
	Huffman huff;
	huff.compress("C:/Users/User/Desktop/test.txt", "C:/Users/User/Desktop/", "C:/Users/User/Desktop/");
	std::cout << static_cast<double>(clock() - tStart) / CLOCKS_PER_SEC << "(s)" << '\n';
	return 0;
}


It compiles but crashes as soon as i run it.

**HAVE BEEN EDITED**
Last edited on
I've just realized i big mistake i made with the key generation:
lets say if the characters compressed are a, 1 and 0 and their code is:
a 00
1 01
0 11

the key would be:
a00101011.txt

and here the problem happens we just lost the characters '1' and '0' code.

this is the format i'l be moving to:
a 00 1 01 0 11.txt
this way we can clearly separate the characters from their respective codes. I am open to any other ideas.
It compiles but crashes as soon as i run it.


Time to break out the debugger :+) I hope it hasn't come from something I have done.

I had in the compress function:

m_root = makeTree();

because makeTree returns a reference, but you have:

*m_root = makeTree();

That didn't compile for me, (although it did on cpp.sh, not sure why):

../Huffman/Huffman.cpp: In member function 'void Huffman::compress(const string&, const string&, const string&)':
../Huffman/Huffman.cpp:113:2: error: no match for 'operator*' (operand type is 'node')


Another thing: #include <ctime> , not ctime.h

I will have a go at executing the code, and any debugging that may be required :+)
Well for me m_root = makeTree(); wont even compile it tells that there is no suitable conversion from node to node* even though we return a reference...

Another thing: #include <ctime> , not ctime.h
didn't make any problems for me atleast.

Btw i am working on a copy of the code that uses dynamic allocations for nodes and all seems to work (finally made a destructor).
Well for me m_root = makeTree(); wont even compile


I had :

1
2
3
class Huffman {
private:
	node m_root;


Where you have a pointer there :+) May not be a problem either way.

didn't make any problems for me atleast.


It's just that <ctime> is the C++ header, not the C one.

Btw i am working on a copy of the code that uses dynamic allocations for nodes and all seems to work (finally made a destructor).


Sounds good, I was distracted today by a C++17 issue.


Edit:

What do you mean by dynamic allocations?
Last edited on
If you set up version control using git say, you could create separate branches for your own, and my code. One can get it track different versions of both of them, and show differences between them.
What do you mean by dynamic allocations?


I mean using new and delete

This is the implementation for the deletion:
1
2
3
4
5
6
7
void Huffman::deleteTree(node* root) {
	if (root != nullptr) {
		deleteTree(root->m_left);
		deleteTree(root->m_right);
		delete root;
	}//end of if
}


If you set up version control using git say, you could create separate branches for your own, and my code. One can get it track different versions of both of them, and show differences between them.

That's a good idea.

Where you have a pointer there :+) May not be a problem either way.

I have updated the code above to use node still doesn't work.

NOTE: to prevent the warning in the buffer |= index << (7 - offSet);
i have made both offSet and buffer char doesn't give any warnings for me now.

Btw i decided to keep the vector<bool> inside of the encode function because the performance there would be similar to a deque with a big differance in the memory allocated, in other words a deque takes more memory per object than a vector.

I also implemented the new structure of the keys generated and now i want to make them as .bin files because they would take less memory for example:
lets say a has the code 11101, this code would take 8*5 bytes of memory whereas in a .bin file we can just fill in 11101 and it would be 5bit.
Iv'e decided to make that decision because the bigger the file is the key would most likely increase in size aswell dramatically and i would like it to occupy as little as possible
Last edited on
I mean using new and delete


Ah I thought so, but that is something to be avoided. One of the big misconceptions is the "need" to use dynamic (heap) memory. The thing is that the STL containers already put their data on the heap. One of the motivations for me changing your code, was to get rid of new and delete. The main problem with delete is that something throws an exception, even if it is handled, the code never makes it to the delete, so we have a memory leak.

I have updated the code above to use node still doesn't work.


It probably doesn't matter, but my code compiles without pointers there. Actually, another goal might be to get rid of pointers in the node (left and right) or at least make them smart pointers. Edit: They may be fine, they are owning pointers.

i have made both offSet and buffer char doesn't give any warnings for me now.


It still does for me -Wconversion, I have them both with int, no warnings. However, I tried to have char seven = 7; and buffer |= index << (seven - offSet); along with buffer and index being char. I thought this might be an improvement because we have removed the int that comes from the literal 7 we had before. But I still get a -Wconversion warning there. Anyway, the way I have it there are no warnings :+) The other thing to remember is that we are using different compilers, the respective set set of warnings we have enabled may not be the same. If one compiler A gives a warning and the other B doesn't - that means the warning is somehow not enabled or has different behaviour on compiler B.

@others I know this all sounds pedantic, but I for one would like to know why that change above still gives a warning when all the variables are char? Is it because of the subtraction?

Btw i decided to keep the vector<bool> inside of the encode function because the performance there would be similar to a deque with a big differance in the memory allocated, in other words a deque takes more memory per object than a vector.


That's fine to keep it, but I am not so sure about the space used by vector<bool> I have a vague idea about vector<bool> storing each bool as a single byte, this was to get around the problems of vector<bool> not satisfying all the requirements for a container. I might be wrong, maybe someone else can comment on that? Another alternative is boost dynamic bitset:
http://www.boost.org/doc/libs/1_64_0/libs/dynamic_bitset/dynamic_bitset.html

STL has std::bitset, but the size has to be known at compile time.

I also implemented the new structure of the keys generated and now i want to make them as .bin files because they would take less memory for example:


Sounds good :+)
Last edited on
TheIdeasMan wrote:
@others I know this all sounds pedantic, but I for one would like to know why that change above still gives a warning when all the variables are char? Is it because of the subtraction?


I found out why there are warnings when all the variables are char:

http://en.cppreference.com/w/cpp/language/implicit_conversion#Integral_promotion

cppreference wrote:
Numeric promotions
Integral promotion

prvalues of small integral types (such as char) may be converted to prvalues of larger integral types (such as int). In particular, arithmetic operators do not accept types smaller than int as arguments, and integral promotions are automatically applied after lvalue-to-rvalue conversion, if applicable. This conversion always preserves the value.


So this happens because of the bitwise << operator, and |=, and originally the literal 7. It's also why I had no warnings when I made the types int. But it also meant casting back to char where necessary - so no warnings Happy Days :+) ++ThingsLearnt;
It still does for me -Wconversion

Visual studio is pretty known to change the original code when set to release mode to make it more efficient its why i'm sure it does the casts itself because if i by accident convert int to char it will throw a warning right away.

boost dynamic bitset

I will start exploring the way its used and performance vs vector<bool>

It probably doesn't matter, but my code compiles without pointers there.

It compiles for me aswell but it doesn't execute the program it just crashes.
Hi,

I spent quite a bit of time trying to find out why I had a crash in my code. Unfortunately I am none the wiser :+( Not sure why, I seem to have the same logic as you have.

I have looked at the createBinaryFile, and am confused about why it does what it does - it looks more like encryption. Wouldn't one just write the m_key.second (The vector<bool>) to the file? Why apply the buffer = buffer | index << (7 - offSet); ? Even if it was supposed to be encryption, that operation is not reversible given the implicit conversion back to char type.

Hi,

I think i understood why your code crashed, if you look at makeTree:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
node& Huffman::makeTree() {
	while (NodeData.size() > 1) {

		std::sort(NodeData.begin(), NodeData.end(), functor());

		auto leftSon = NodeData.front();
		NodeData.pop_front();

		auto rightSon = NodeData.front();
		NodeData.pop_front();

		NodeData.emplace_back(&leftSon, &rightSon);

	}
	return NodeData.front();
}


leftSon and rightSon are temporary objects which you store by reference and when the function ends they get out of scope and get deleted, however because you passed them by reference their address is valid but the content isn't (this option is available in c++ to prevent dangling pointers). Next on we try to encode that invalid data:
1
2
m_root = makeTree();
encode(&m_root);

The problem here is that the encode function tries to access invalid content through valid addresses and this leads to crashes.

I have looked at the createBinaryFile, and am confused about why it does what it does - it looks more like encryption. Wouldn't one just write the m_key.second (The vector<bool>) to the file? Why apply the buffer = buffer | index << (7 - offSet); ? Even if it was supposed to be encryption, that operation is not reversible given the implicit conversion back to char type.

Firstly, the point behind this is because vector<bool> stores single bits and you cant just put a single bit in because its not even a char for example what would be the ascii of 1 if there is no 0's to buffer it to a char like so: 0000 0001 and even if we do buffer it we just added useless data.
Secondly, there is no operator << overload for vector<bool> so we cant just write it in.
Thirdly, we cant expect the code to be not longer than a char in size lets say the code in the vector is 0101 1111 11 we cant write it in because there is no type of this size its 10bits so
buffer = buffer | index << (7 - offSet); takes care of this for us.
Well thanks for helping me out this long i have finished the compressor and im happy with the performance at the moment i will be moving on to making the decompressor in a seperate post.
Topic archived. No new replies allowed.
Pages: 12