Binary Compression

I'm currently taking a beginners C++ class and this program has been throwing me for a loop. I've completed a section of code but am clueless on what to do next. I haven't had the best of teachers and have pretty much used the book we are using to get this far. Any help is appreciated. I will post the code I've completed so far on the next post. Below is the program requirements and purpose.

PURPOSE:

A ebook text files are to be compressed into a small, portable file format with the following characteristics.

1) The file should be compressed in such a way that it need not be entirely decompressed in order to be used. Unlike a ZIP or other general-purpose compression algorithm, this one will compress the data without inter-dependency of the bytes.

2) The compressed file should be searchable and displayable as-is, without a decompression phase. This allows very large files to be displayed and searched without storing the entire file in memory.

APPROACH:

The compression technique to be used will be based on a word-to-byte look-up table consisting of a one-byte encoding for the (approximately) 255 most common words in the text, with the 256th byte value representing an ‘escape’ character indicating that the following characters, up to the next delimiter, are not encoded. Following is a tiny example, where the notation <153> represents a single byte containing the value 153 decimal.

Given some text file, an encoding table of the most commonly occurring words might be constructed like:
<256> the
<255> a
<254> an
... of
<240>
... this
<200>
<199> is
... interesting
<169>
... This means ‘following text is uncompressed’
<0>
etc.

A sentence fragment such as the following:

...the interesting part of a problem like this is the use of an
algorithmic...


could be encoded as (spaces included below only for readability):

<256> <169> <0> part <240> <255> <0> problem <0> like <200> <199> <256> <0> use <240> <254> <0> algorithmic

The uncompressed text is about 72 characters (bytes), whereas the compressed version would require 43 bytes.

_____________________________________________________________________________

Other requirements:
1. Execution- Console program, accepts file name ending in TXT or
CMP; with no parameter, or other extension, prints USAGE explanation

2. Execution- If the TXT or CMP file named does not exist, print ‘File
not found’ error message OR if the .KEY file matching the .CMP does not
exist, print ‘Matching KEY file does not exist’.

3. Execution- Same TXT behavior; If the filename has CMP extension,
the program will open the CMP and KEY files, and create a TXT file
which is an exact copy of the original TXT file.
10
4. Execution- Compression of TXT file is based on # occurrences X
length of word, with approximately the 255 most frequent product
results words represented by a single byte. Capitalization, implied
spaces, other considerations are left to the programmer’s discretion.

5. Execution- In either mode, after compression or decompression, the
program will display:
Uncompressed bytes: 9999 Compressed bytes: 9999 99%
where the % value (1 or 2 digits) is compressed / uncompressed bytes

 
Here is the code I've completed so far.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
#include <iostream>
#include <map> // a map will be used to count the words.
#include <fstream> // will be used to read from a file.
#include <string> // allows strings to be used.
#include <iomanip> //used to help make the output nicer by using setw();
using namespace std;

// function checkFileExtension will determine if the file entered is a valid file type (.TXT or .CMP)
int checkFileExtension(string filename)
{
	int len = filename.length(); // determines the length of the file
	string str = filename.substr(len - 3, 3); // take the file name length and move to the third position from the right to compare last three letters
	if (str.compare("txt") == 0) // if txt is found, return one and execute correct following function
	{
		return 1;
	}
	else if (str.compare("cmp") == 0) // if cmp, return -1 to execute correct follwing function
	{
		return -1;
	}
	else return 0; // if not a valid type, send to error message
}

// function countWordOccurrences determines if a word has been read in yet and help keeps count of each occurence
void countWordOccurrences(ifstream& file, map<string, int> &words, map<char, int> &symbols)
{
	string tmpword = ""; // default string value
	char tmpchar; // variable to read character by character

	while (file >> noskipws >> tmpchar) // noskipws makes sure the str does not get rid of/ skip whitespaces
	{
		// check to see if the character is a valid character
		if ((tmpchar >= 'A' && tmpchar <= 'Z') || (tmpchar >= 'a' && tmpchar <= 'z'))
		{
			tmpword = tmpword + (char)tolower(tmpchar); // build tmword to be checked
		}
		else if (tmpchar == ' ') // check tmpword for occurrence once a whitespace is detected
		{
			if (words.find(tmpword) == words.end()) // the word is not present in the map, make its count 1
			{
				words[tmpword] = 1;
			}
			else // the word is already present, increment its count by 1
			{
				words[tmpword] = words[tmpword] + 1;
			}
			tmpword.clear(); // empty out tmpword for the next iteration
			tmpword = ""; // declare default tmword value to be empty again
		}
		else // if it is any other character
		{
			if (tmpword.compare("") != 0)
			{
				if (words.find(tmpword) == words.end())
				{
					words[tmpword] = 1;
				}
				else
				{
					words[tmpword] = words[tmpword] + 1;
				}
				tmpword.clear();
				tmpword = "";
			}
			if (symbols.find(tmpchar) == symbols.end()) // determine character was a symbol
			{
				symbols[tmpchar] = 1; // the symbol is not present in the map, make its count 1
			}
			else
			{
				symbols[tmpchar] = symbols[tmpchar] + 1; // the symbol is already present, increment its count by 1
			}
		}
	}
}

// function used to print out map used to find occurences
void printToFile(map<string, int> words, map<char, int> symbols)
{
	map<string, int>::iterator w_iter;
	map<char, int>::iterator s_iter;

	ofstream out("WealthOfNations.cmp", ofstream::out); // will send output to file named WeathOfNations.cmp

	out << "--WORDS AND THEIR COUNTS--" << endl;
	for (w_iter = words.begin(); w_iter != words.end(); w_iter++) // loop to output each word and occurence number
	{
		out << w_iter->first << "\t" << w_iter->second << setw(20) << endl; // output to file, with setw(20) to make numbers line up nicely
	}
	out << endl;
	out << "--SYMBOLS AND THEIR COUNTS--" << endl;
	for (s_iter = symbols.begin(); s_iter != symbols.end(); s_iter++) // loop to output each symbol and occurence number
	{
		out << s_iter->first << "\t" << s_iter->second << setw(20) << endl; // output to file, with setw(20) to make numbers line up nicely
	}
	out.close(); // close WealthOfNations.cmp file
}

int main(int argc, char *argv[])
{
	for (int i = 0; i < argc; i++)
	{
		cout << "USAGE: " << argv[i];
		cout << endl;

	}

	if (argc != 2) // argc should be 2 for correct execution
	{
		cout << "USAGE: " << argv[0] << " <filename>\n"; // print argv[0] assuming it is the program name
		cout << endl;
		system("pause");
		return 0;
	}

	int returncode = 0;
	returncode = checkFileExtension(argv[1]); // assign checkFileExtension return value to returncode in order to determine what the program will do next

	if (returncode == -1) // if returncode is -1 the file was a .cmp and will be "decoded", however this assignment does not require that yet
	{
		cout << "You have specified a CMP file, there is no use for this yet! Please specify a file of type TXT" << endl;
		system("pause");
		return 0;
	}
	else if (returncode == 0) // if returncode is 0 send error message to enter valid txt name
	{
		cout << "Incorrect file specified, check file type/name!" << endl;
		system("pause");
		return 0;
	}
	else // if returncode is 1 execute below code
	{
		ifstream inputfile(argv[1]); // read file
		if (!inputfile.is_open()) // if file cannot be opened or does not exist send error message
		{
			cout << "Couldn't open the file! Please check if file with the specified name exists at the specified location!" << endl;
			system("pause");
			return 0;
		}
		cout << endl;
		cout << "Running successfully. Please check for WeathOfNations.cmp output file once command box closes.";
		map <string, int> words; // declaration of map for word count
		map <char, int> symbols; // declaration of map for symbol count

		countWordOccurrences(inputfile, words, symbols); // call countWordOccurrences function
		printToFile(words, symbols); // call printToFile function

		inputfile.close(); // close file that was read into the program

		return 0;
	}
}
First, you are doing text compression, not binary. A nit, but there, I said it.

------

I think you misunderstood the instruction about usage. Make yourself a function that displays usage, and just call it if anything doesn't match up.

1
2
3
4
5
6
int usage( const char* appname )
{
  cerr << "usage:\n  " << appname << " FILENAME.txt\t\tCompress FILENAME.txt to FILENAME.cmp\n  "
                     << appname << "FILENAME.cmp\t\tDecompress FILENAME.cmp to FILENAME.txt\n";
  return 0;
}
1
2
3
4
int main( ... )
{
  if (argc != 2) return usage( argv[0] );
  ...


------

Your checkFileExtension() function can be fooled. I recommend you use some more robust utilities, like these:
http://www.cplusplus.com/forum/general/34348/#msg185786

You can change the delimiter argument to delimiters for all of those:

 
string ExtractDirectory( const string& path, const string& delimiter = "\\/" )

And use it easily enough:

 
  if (ExtractFileExtension(filename) == ".txt") ...

You can also apply it usefully to get your working file names:

1
2
3
  string txtfilename = ChangeFileExtension(argv[1], ".txt");
  string cmpfilename = ChangeFileExtension(argv[1], ".cmp");
  string keyfilename = ChangeFileExtension(argv[1], ".key");


------

You might also want to consider using an enumeration to help with the checkFileExtension() stuff:

1
2
3
4
5
6
enum FileType { CompressedFile, TextFile, InvalidFile };

FileType checkFileExtension( string filename )
{
  ...
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
  switch (checkFileExtension(argv[1]))
  {
    CompressedFile: 
      cout << "Alas, I can't decompress yet!\n";
      return 1;

    TextFile:
      CompressFile(txtfilename, cmpfilename, keyfilename);
      break;

    UnknownFile:
    default:
      return usage(argv[1]);
  }


------

You have created a very nice symbol file -- but your homework doesn't actually ask for all that.

When compressing, you should read the text file two times. Once to get the names (and frequencies) of the words in your histogram. (Which you do nicely.) Trim your histogram to contain only the 255 most common words.
Then read the source text file again to use your histogram to compress the file.

When compressing, you should write two files. First, the key file, which simply lists the words you are using in numerical order. Second, the cmp file.

To create the cmp file, read the next word from the source file (just like you do when creating the histogram). If the word appears in the histogram map, write the cout as a single character (byte). If not, write an "uncompressed" code, followed by the exact text that is not compressed, followed by a terminating code.

You might want to ask your professor about code collisions with uncompressed text. He hasn't been clear about it, and (if I read the algorithm correctly), they can happen easily.

(If you use all 255 available codes, then the uncompressed text will necessarily be indeterminate -- meaning, there will be no way for you to determine where the uncompressed text ends and compressed codes begin.

Not unless you terminate the uncompressed text block with the same code that initialized it, which I would make a zero, but your assignment asks it to be value 255 (which is the 256th value).

Hope this helps.
That has helped quite a bit so far. Thank you!
Topic archived. No new replies allowed.