find positions of punctuations with boost regex

-3 down vote favorite


I am trying to find ALL positions of punctuations in a textual string so that I can divide the text into sentence. How to do this with boost regex.

I tried but failed with the following codes



1
2
3
4
5
6
7
8
9
10

    string text="I am a student, living in Washington. ";
    boost::regex expression("[,|;|."|!]");
    boost::match_result Result;
     std::string::const_iterator s=text.begin();
 std::string::const_iterator e=text.end(); 
    boost::regex_search(s, e, expression, boost::match_defalut)
{
.....
} 
You could iterate over the matches (or non-matches, as case may be):

1
2
3
4
5
6
7
8
9
10
11
12
13
#include <iostream>
#include <string>
#include <iterator>
#include <boost/regex.hpp>

int main()
{
    std::string text="I am a student, living in Washington. ";
    boost::regex expression("[,|;|.\"|!]");

    for(boost::sregex_token_iterator i = boost::sregex_token_iterator(text.begin(), text.end(), expression, -1); i != boost::sregex_token_iterator();  ++i)
        std::cout << "'" << *i << "'\n";
 }


although if your goal is to divide the string, boost tokenizer might be less unwieldy: http://www.boost.org/doc/libs/release/libs/tokenizer/
Also, split() can take a regex http://www.boost.org/doc/libs/release/doc/html/boost/algorithm/split_regex.html
How to do this with boost regex.

I tried but failed with the following codes

Given the example on the Boost website, for regex_search, it doesn't appear that you tried very hard...

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <iostream>
#include <string>
#include <iterator>
#include <boost/regex.hpp>

int main()
{
    //string text="I am a student, living in Washington. ";
    std::string text =
        "I am trying to find ALL positions of punctuations, in a textual string, so"
        " that I can divide the text into sentence. How to do this with boost regex?\n"
        "\n"
        "I tried but failed with the following codes!";

    boost::regex expression("[,|;|.|!|?]"); // "[,|;|."|!]"
    boost::match_results<std::string::const_iterator> Results;
    std::string::const_iterator s=text.begin();
    std::string::const_iterator e=text.end();
    while(boost::regex_search(s, e, Results, expression, boost::match_default))
    {
        std::string t(s, Results[0].second);
        std::cout << "'" << t << "'\n";
        s = Results[0].second;
    }

    return 0;
}


'I am trying to find ALL positions of punctuations,'
' in a textual string,'
' so that I can divide the text into sentence.'
' How to do this with boost regex?'
'

I tried but failed with the following codes!'


Cubbi's iterator-based solution is more suscint (as well as wider...), but I am not convinced that regex is the right way to go. Is there a particular reason you're wanting to use it??

The Boost.Tokenizer that Cubbi mentioned (used with char separator) should be able to do what you require to start with (though I'd probably just use string::find_first_of and string::substr for prototyping.)

Also, only .!? actually end sentences; the others demarcate clauses, separate items in lists, etc. So really you should be treating .!? differently to the others. This would probably be easier with custom code than convoluted regex. Esp. when things get more complicated and you try to deal with the likes of "Mr. Jones went shopping. His grocery bill came to $23.45.", "The colors of the American flag are red, white, and blue.", and reported speech.

Andy

Last edited on
Topic archived. No new replies allowed.