Judge utf8 string is chinese

Forum

Forum
General C++ Programming
Judge utf8 string is chinese

Judge utf8 string is chinese

hello, i got a utf8 string, how could i use C++ regex to judge the string contains only chinese characters, engish letters(a-zA-Z) and digits(0-9), and calculate how much characters the string contains?
For example, i have "hello饿货不哭12", only with the characters mentioned above, length is 11.

Or any other way that can make it more simple and clean.

Last edited on

TheIdeasMan (6781)

I found this:

http://en.cppreference.com/w/cpp/regex/regex_traits/lookup_classname

It might work for you, but I really don't know anything about it.

The size function works for various string width types.

问候 :+)

Cubbi (4774)

If you're looking for the number of code points in a UTF-8 string, it's not a job for regular expressions:

C++11, locale-independent solution, works on Windows (with /utf-8) and Linux:

#include <iostream>
#include <locale>
#include <codecvt>

int main() {
  std::string in = "hello饿货不哭12";
  std::cout << "There are "
            << std::wstring_convert<std::codecvt_utf8<wchar_t>>{}.from_bytes(in).size()
            <<  " code points in " << in << '\n';
}

live demo https://wandbox.org/permlink/vK6r3zjJz8DntyZE


There are 11 code points in hello饿货不哭12

If you actually need to use regular expressions to classify a Unicode string (as in, to say how many code points are for chinese characters and how many are for english), you will need boost.regex because C++ regex (essentially a 14-year old version of boost.regex) doesn't do Unicode classification.

Last edited on

Topic archived. No new replies allowed.