regex - searching XML that doesnt contain a newline

Josuttis has given an example for a regex that searches XML (that doesnt contain a newline).

The XML string that is searched and the regex are as follows:

1
2
3
4
5
6
    string data = "<person>"
                  "<first>Andrew</first>"
                  "<last>Ant</last>"
                  "</person>";

    regex reg("<(.*)>([^<]*)</(\\1)>");


The program uses the regex_search() function to iterate over the subtags.
http://cpp.sh/9hvvn

The O/P is:


match:  <first>Andrew</first>
 tag:   first
 value: Andrew
match:  <last>Ant</last>
 tag:   last
 value: Ant



How exactly does this regex work?

The subpattern is : ([^<]*)

This means that (after separating out the outer tag), the subpattern searches for any char other than <.

Yet it matches <first>Andrew</first>.

What exactly is happening here?

Thanks.
Last edited on
The subpattern does not match <first>Andrew</first>. The subpattern corresponds to the output line: value: ... and that is: Andrew. The pattern as a whole matches <first>Andrew</first>.
Yes, what you have said is basically correct.

I am giving below the detailed steps of what exactly happens:

a) The regex_search() function searches the entire data string for the 1st match with the regex.

b) It first considers the outer tag <person> as a possible match for the 1st subpattern <(.*)>.

However, the 2nd subpattern ([^<]*) (which searches for the value) looks for any character except newline any times. The < character is found in the inner tag <first>. Since this doesn’t match the 2nd subpattern, the function also rejects the outer tag <person> as a probable match for the 1st subpattern.

c) The function now considers the inner tag <first> as a possible match for the 1st subpattern <(.*)>. This time, "Andrew" does match the 2nd subpattern for the value. Therefore, the function also accepts <first> as a match for the 1st subpattern.

d) Successive iterations in the program find subsequent inner tags; in this case <last> and "Ant".
However, the 2nd subpattern ([^<]*) (which searches for the value) looks for any character except newline any times. The < character is found in the inner tag <first>. Since this doesn’t match the 2nd subpattern, the function also rejects the outer tag <person> as a probable match for the 1st subpattern.


Actually, the 2nd subpattern is satisfied with an empty string (0 times matches "any times"). Now the regex is trying to match <first> with </(\\1)> (and \\1 evaluates to "person").

So, since the regex could not find a match on the string starting with "<person>", it moves on to the inner tags.


Was your second post your own answer to your first post, or is there a question that has not been answered?
Topic archived. No new replies allowed.