Filter text enclosed in specific tags from a text file

I have a huge text file in following format:
{Bunch of text
.
.
}
<tag> Text1 </tag>
{Bunch of text
.
.
}
<tag> Text2 </tag>
{Bunch of text
.
.
}
<tag> Text3 </tag>
{Bunch of text
.
.
}
<tag> Text4 </tag>
// This continues till about Text600

I want to extract Text1, Text2, Text3, Text4,..., Text600 in the output file. How can i achieve this?

/* BTW, I am not getting my homework done here. I am an ex-programmer, who has now moved to marketing for some time now, and today, I encountered this problem, which I believe can be solved easily through programming. */
Something like the following?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include <fstream>
#include <string>
#include <algorithm>

int main() {

  ifstream inputFile( "myinputfile.txt" );
  ofstream outputFile( "myoutputfile.txt" );

  std::string file_content;
  std::string line_content;

  while ( getline( inputFile, line_content ) )
    file_content.append( line_content );

  std::size_t start_pos = file_content.find( "<tag>", 0 );
  while ( start_pos != std::string::npos ) {

    start_pos += 5; // to skip the <tag> characters
    std::size_t end_pos = file_content.find( "</tag>", start_pos );

    while ( end_pos != std::string::npos && start_pos < end_pos ) {

      outputFile << file_content[ start_pos ];
      start_pos++;

    }

     start_pos = file_content.find( "<tag>", end_pos + 5 ); // can be  end_pos + 6
  }

  inputFile.close();
  outputFile.close();

  return 0;
}


EDIT: I haven't checked for the boundary conditions though. You will need to debug it.
Last edited on
One way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#include <fstream>
#include <iostream>
#include <regex>
#include <string>

int main()
{
    std::regex expression("<tag>([^<]*)</tag>");

    std::ifstream in("test.txt");

    std::string line;
    while (std::getline(in, line))
    {
        std::smatch results;
        if (std::regex_search(line, results, expression))
            std::cout << results[1] << '\n';
    }
}


[Edit: If you're using GCC, I would suggest boost::regex.]
Last edited on
Filter text enclosed in specific tags from a text file

Are you talking about getting multiple values from a single type of tag (I mean with the same name). Or was the repeated use of "<tag>" just for illustrative purposes? (Your post title does refer to tags.)

Does the text between tags only ever occupy a single line?

And how HUGE is huge??

Andy

PS Is it an actual XML file, of just text with some tags? And do you happen to know the encoding of the file?
Last edited on
Topic archived. No new replies allowed.