Parse question

I'm trying to parse a text file which will have many lines similar to this:

2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:533] - 83051731101211

and I want to output a file which starts when the '830' is seen. At the moment I have the below code but I'm not sure how I go about starting the write process once the regex is true, any suggestions? At the moment it of course just writes any file where the regex is true

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
  void parse()
{
	std::string key{ "830" };

	std::ifstream inputFile;
	inputFile.open("source.txt");
	if (!inputFile.is_open())
		std::cout << "Error opening input file";

	std::ofstream outputFile;
	outputFile.open("dest.txt");
	if (!outputFile.is_open())
		std::cout << "Error opening output file";

	std::regex e(key);
	std::string line{};

	while (!inputFile.eof())
	{	
		std::getline(inputFile, line);
		bool match = std::regex_search(line, e);
		if (match)
		{
			outputFile << line << std::endl;
		}
	}
	inputFile.close();
	outputFile.close();
}

int main(void)
{
	parse();
	getchar();
	return 0;
}
Last edited on
What do you want to see, if this is your input

2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:522] - 83001731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:533] - 83051731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:233] - 83041731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:333] - 83031731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:433] - 83021731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:533] - 83051731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:633] - 83011731101211


Only lines matching 8305

2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:533] - 83051731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:533] - 83051731101211


Lines between 8305 pairs.

2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:533] - 83051731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:233] - 83041731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:333] - 83031731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:433] - 83021731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:533] - 83051731101211


Everything from the first 8305.

2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:533] - 83051731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:233] - 83041731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:333] - 83031731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:433] - 83021731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:533] - 83051731101211
2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:633] - 83011731101211



1
2
3
while (!inputFile.eof())
	{	
		std::getline(inputFile, line);

And this should be
 
while ( std::getline(inputFile, line) )

eof doesn't work as you expect. You need to test the actual file read operation for success.
eof looks to the past, it does not predict the future.


Thanks for the reply,

if the input is:

2019-03-25 08:43:12,628 [23 ] DEBUG [1731101211] [Msg:522] - 83001731101211

the expected output would be:

83001731101211

so I need to start writing to the output file at the point the regex is true, with anything starting "830"
https://en.cppreference.com/w/cpp/regex/regex_search
Well you want to add in a std::match_results m parameter as well.

And change your regex to be say

std::string key{ " - (830[0-9]+)" };

The actual matched text - ie between the ( ) should be in m[1].
hmm, I currently have it as the below, but it is only outputting 1 the match + 1 character, how can I tell it to read until the rest of the line after the match?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
        std::string key{ "830[0-9]" };
	std::regex e(key);
	std::smatch m;
	std::string line{};
	int lines_processed = 0;

	std::chrono::steady_clock::time_point tp = std::chrono::steady_clock::now();

	while (std::getline(inputFile, line))
	{	
		++lines_processed;
		bool match = std::regex_search(line, m, e);
		if (match)
		{
			outputFile << m[0] << std::endl;
		}
	}

	auto duration = std::chrono::steady_clock::now();
	std::cout << "\n\nLines Processed: " << lines_processed <<
		"\nSeconds Taken: " << std::chrono::duration_cast<std::chrono::seconds>(tp - 
        duration).count();
That's why my regex is [0-9]+ and not just [0-9]

The + makes all the difference.
It matches 1 or more of any digit.
Thanks Salem, it is working great now.

Do you have any tips on performance at all? The processing seems to take a lot longer than I expected of C++

edit: it also stops reading the line once a character is hit rather than an int, what is the syntax to have it check all ascii aswell instead of 0-9?
Last edited on
The first step would be to time this.
1
2
3
4
while (std::getline(inputFile, line))
	{	
		++lines_processed;
	}

This is your base time to just read the file.

Then perhaps time this.
1
2
3
4
5
6
7
8
9
while (std::getline(inputFile, line))
	{	
		++lines_processed;
		bool match = std::regex_search(line, m, e);
		if (match)
		{
			++lines_matched;
		}
	}


If it turns out that 90% of your time is spent just in file I/O, there isn't a lot you can do about it.

If your file is on rotating memory, the difference between the nS of your processor instructions, and the mS it takes the hard disk heads to move to another track is vast. There's no point improving the code just to spend longer waiting for more data.

The OS usually makes a good job of caching file reads ahead of need, but it's also easily possible for your code to catch up to the read-ahead and then be stuck waiting.
edit: it also stops reading the line once a character is hit rather than an int, what is the syntax to have it check all ascii aswell instead of 0-9?

clarify? If you want all ascii, don't check it at all, unless you are trying to weed out unprintable characters or upper ascii (> 127?) in which case >127 and < 10 (or whatever the value is for the unprintables, 20 is space, so its somewhere from 0-20ish).

Topic archived. No new replies allowed.