Extracting text from sgm file.

I want to read from a file and extract text between two keywords and I've been trying with this
1
2
3
4
5
6
7
8
9
10
 
 bool found = false;
    char symbols[3] = {'<', '>', '/'};
    auto pos = Words.find(symbols);
    auto endpos = Words.find(symbols, pos + sizeof(symbols));
    extracted = Words.substr(pos + sizeof(symbols), endpos - sizeof(symbols) - pos);
    if (!symbols)
    {
        cout << " " << extracted << " ";
    }


but "pos" and "endpos" are basically useless since their value is exactly equal at every time I run and compile the code and the exception is thrown. The extracted text is bound between 2 strings " <BODY>" and "</BODY>" and I am only able to use iostream, fstream, string, and algorithm. Help is need. Also, I am not an expert in programming so be patient with me.
"Words" is a global string used for oop.
Last edited on
Extracting text between <BODY> and </BODY> was asked and answered here http://www.cplusplus.com/forum/general/275203/

You are also mis-understanding .find(). It will find the first occurrence starting at pos of the specified char/string. You are trying to find the first occurrence of "<>/" (assuming symbols is null-terminated which is not guaranteed as no terminating null is specified).

What you probably mean is to use find_first_of() that searches the string for the first character that matches any of the characters specified in its arguments. Again though, the specified chars need to be null-terminated for a c-style string.

1
2
3
4
const char * const symbols {"<>/"};

auto pos = Words.find_first_of(symbols):
...


See http://www.cplusplus.com/reference/string/string/find_first_of/
Last edited on
@seeplus yes, I know but that forum did not work out for me since everybody was using 3rd party libraries and I didn't make myself clear enough I suppose. And I know that find() and find_first_of() differ but it is still throwing exception out_of_range and I can't seem to fix it.
1
2
3
4
5
6
7
8
9
10
11
 
  while (file >> Words) {
                size_t found = Words.find_first_of("<BODY");
                string endDel = Words.substr(found);
                found = endDel.find_first_of(">");
                endDel = endDel.substr(found+1);
                size_t end_found = endDel.find_first_of("</BODY");
                string findString = endDel.substr(0,end_found);
                cout << " " << findString << " ";

                }


I tried this after reading the xml parse thread.
You need to check if found is actually != string::npos on line 3 before going further.
I assume not every word contains <BODY
Also it would be helpful if you should us the input file.
I know but that forum did not work out for me since everybody was using 3rd party libraries


Incorrect. My posted code only used standard C++.

If you want to do it this way (and the text between <BODY> and </BODY> can't have any white space chars), then consider:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <fstream>
#include <iostream>
#include <string>
#include <algorithm>
#include <sstream>

int main()
{
	std::istringstream file {"<BODY>thetext</BODY> not used <BODY>moretext</BODY>"};

	for (std::string Words; file >> Words; ) {
		const size_t found1 {Words.find("<BODY")};

		if (found1 != std::string::npos) {
			std::string endDel {Words.substr(found1 + 5)};	// Add size of <BODY
			const size_t found2 {endDel.find(">")};

			endDel = endDel.substr(found2 + 1);
			const size_t end_found {endDel.find("</BODY")};

			if (end_found != std::string::npos) {
				const std::string findString {endDel.substr(0, end_found)};
				std::cout << findString << '\n';
			}
		}
	}
}



thetext
moretext


Last edited on
> I am only able to use iostream, fstream, string, and algorithm.
Why?

I'm sure it's fine for a homework toy, but XML parsing is far more complicated that simple string matching.
 
std::istringstream file {"<DIV>thetext</DIV><!-- was <BODY> --> not used <BODY>moretext</BODY>"};


> I am only able to use iostream, fstream, string, and algorithm.
Pretty sure my code met that requirement. I don't see how this thread is significantly different from the last (other than you actually wrote code this time, so good for you).

The folks here using stringstreams are just using it as a temporary replacement for fstreams because all stream objects can be operated on in the same way in C++.
@thmm this is an example input file.
<REUTERS ... >
<DATE>26-FEB-1987 15:01:01.79</DATE><TOPICS><D>cocoa</D></TOPICS>
<PLACES><D>el-salvador</D><D>usa</D><D>uruguay</D></PLACES><PEOPLE></PEOPLE>
<ORGS></ORGS><EXCHANGES></EXCHANGES><COMPANIES></COMPANIES><UNKNOWN> ... </UNKNOWN><TEXT> ...
<TITLE>BAHIA COCOA REVIEW</TITLE>
<DATELINE> SALVADOR, Feb 26 - </DATELINE>
<BODY>
Showers continued throughout the week in the Bahia cocoa zone, alleviating the drought since
...
...
Brazilian Cocoa Trade Commission after carnival which ends midday on February 27.
Reuter
&#3;
</BODY></TEXT>
</REUTERS>




The code has to read an entire directory of 21 similar files (with much more text) and sort top 10 most repeated words in the entire directory(it is an article split into 21 .sgm files) using only pointers and arrays. I must repeat myself, I am not experienced whatsoever so bare with me.
Yes, as per my code in http://www.cplusplus.com/forum/general/275203/ that allows the file name(s) to be specified at run-time on the command line and if you set it up so that wild-cards are expanded as per later posts in that thread, then you can obtain the required data from all the stipulated files.

That code just displays the extracted data. It seems that you want to show the top 10 most repeated words?

That's fairly easy using a std::map

Why say only pointers and arrays?

Last edited on
a rough idea in pseudocode using only C like things:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

struct word_t
{
    char word[1000]{};
    int count{};
};

 int main () 
{
   int used{}; //keep track of how many words you have so you don't search empty words. 	
   word_t *wordcounter = new word_t[100000];
   for(all the text)
   {
	 split out a 'word'
     convert the word to upper or lower case
     search the array to see if it is already in it. //search can be a for loop.  
     if not, put it in and wordcounter[used].count++ and also used++
     else if you found it, count++ at that location 	 
   }
}


when that is all done sort the array on count and the top 10 biggest are right there to be used.
Last edited on
Just using an array to store the word cnts, consider:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
#include <iostream>
#include <string>
#include <utility>
#include <fstream>
#include <sstream>
#include <cctype>
#include <functional>
#include <algorithm>

const size_t maxwrds {500};

struct Words {
	size_t cnt {};
	std::string wrd;
};

using WrdsCnt = Words[maxwrds];

std::string tolower(const std::string& str)
{
	std::string low;
	low.reserve(str.size());

	for (const auto ch : str)
		if (!std::ispunct(ch))				// Ignore punctuation
			low += (char)std::tolower(ch);	// Make lower case

	return low;
}

size_t getwrd(const std::string& line, WrdsCnt& wc)
{
	static size_t nowrds {};
	std::istringstream iss(line);

	for (std::string wrd; iss >> wrd; ) {
		bool got {};

		for (size_t w = 0; !got && w < nowrds; ++w)
			if (wc[w].wrd == wrd) {
				++wc[w].cnt;
				got = true;
			}

		if (!got)
			if (const auto w = tolower(wrd); !w.empty()) {
				wc[nowrds].wrd = w;
				++wc[nowrds++].cnt;
			}
	}

	return nowrds;
}

int main(int argc, char* argv[])
{
	const std::string opent {"<BODY>"};
	const std::string closet {"</BODY>"};

	WrdsCnt wrdcnts;
	size_t nowrds {};

	std::cout << "Processing files - ";

	for (int a = 1; a < argc; ++a) {
		std::ifstream ifs(argv[a]);

		if (ifs) {
			std::string body;

			std::cout << argv[a] << "  ";

			for (auto [text, gotbod] {std::pair {std::string{}, false}}; std::getline(ifs, text); )
				for (size_t fnd {}, pos {}; fnd != std::string::npos; )
					if (gotbod)
						if (fnd = text.find(closet, pos); fnd != std::string::npos) {
							gotbod = false;
							body += text.substr(pos, fnd - pos);
							pos += closet.size();
							nowrds = getwrd(body, wrdcnts);
							body.clear();
						} else
							body += text.substr(pos) + " ";
					else
						if (fnd = text.find(opent, pos); fnd != std::string::npos) {
							gotbod = true;
							pos = fnd + opent.size();
						}
		} else
			std::cout << "\nCannot open file " << argv[a] << '\n';
	}

	std::sort(std::begin(wrdcnts), std::begin(wrdcnts) + nowrds, [](const auto& a, const auto& b) {return a.cnt > b.cnt; });

	std::cout << '\n';
	for (size_t top10 = 0; const auto& [cnt, wrd] : wrdcnts)
		if (top10++ < 10)
			std::cout << wrd << "  " << cnt << '\n';
		else
			break;
}


Hello again, trying to read from file. Getting error from this part of code "no operator ">>" matches these operrand types are std::ifstream >> std::string"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
 
string FileReader::ReadFile(const string fileName) const
{
	cout << "Please insert file with proper directory!" << endl; 
	ifstream file(fileName); 
	if (!file)
	{
		cerr << "ERROR: file not found!" << endl; 
		exit(1); 
	}
	if (file.is_open())
	{
		while (!file.eof()) {
			while (file >> Words) // here is the error
			{
				return Words; 
			}
		}
	}
	file.close(); 
}

am I doing something wrong? The weird part is it worked before, I have no clue why it is not working right now. Tried changing fileName to const char* fileName[] but didn't work either.
Last edited on
What type is Words? What is the layout of the file?

Also, you'd code this as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
string FileReader::ReadFile(const string fileName) const
{
	cout << "Please insert file with proper directory!" << endl; 
	ifstream file(fileName); 
	if (!file)
	{
		cerr << "ERROR: file not found!" << endl; 
		exit(1); 
	}

        // This is wrong. It will only try to read one item of type Words
	while (file >> Words) {
		return Words; 
	}
}

@seeplus Words is a private string in class fileReader, layout of file is .sgm, I figured since Words is a string it should read the entire thing. Tried reading line by line using getline but I still get the same error.
If Words is indeed of type std::string, then file >> Words will obtain one word from the file stream. The function will then return with that word.

Are you missing an include of string somewhere?
Topic archived. No new replies allowed.