Anyone good with regex?

Anyone good with regex?

I need a regex that will return true for this "1,000.00".
And return false for this "100.000,0000.2345"

I'm looking for regex that will identify numbers if they're formatted in the common US way of: N,NNN,NNN.0000
but will not return true if it is just a a series of numbers, commas, and dots that would not be considered a number in the US.

jonnin (11350)

you can play with it online https://regex101.com/ for example

US does accept euro reverse formats in a lot of software just to be 'friendly' though its uncommon to see it here. That is 1.000.000,42 instead of 1,000,000.42 (comma and period flip style). Whether you care about that or not depends on what its for.

https://regexland.com/regex-decimal-numbers/ would get you started but I don't think that one has the commas. Or try to find one that does it all.

Last edited on

kbw (9488)

You probably need to emphasize what you think is wrong with the false example. I guess it's having used a . as the separator, you also see a , and then there's the group of 4 zeros ... It you can express what's wrong, it'll go a long way to coming up with a solution.

I'm not good enough to come up with a regex quickly for that, but I think I might manage it with some time and testing. In any event, such a regex will have significant runtime cost, but you won't notice it on modern kit.

Personally, I'd write a parser, for me it'd be quicker (because I'm not good with regex'), and it'll run faster. Not an issue for a UI, but might be when validating input from streams.

mbozzi (3914)

Nobody teaches this, but I sometimes find it easier to actually write out a grammar.

We might compose a grammar like this:

number -> one-to-three-digits , three-digit-group-seq . digit-seq
number -> one-to-three-digits , three-digit-group-seq
number -> one-to-three-digits . digit-seq
number -> one-to-three-digits
three-digit-group-seq -> three-digit-group
three-digit-group-seq -> three-digit-group , three-digit-group-seq

This is a CF grammar, not regular, but at least it's a precise definition of what you (might) want, and easier to work with than a regular expression.

Hopefully the conversion to a regular expression is more doable from here.

\d\d?\d?             # first, a sequence of one-to-three digits
  (                  # then, 
    (,\d\d\d)+|      # either a comma followed by three-digit-groups; or
    (,\d\d\d)+\.\d+| # the same followed by a period and a sequence of digits; or
    \.d+|            # just a period and a sequence of digits.
    $                # nothing (the end-of-string)
  )

Corresponding to
std::regex r1 { R"eos(\d\d?\d?((,\d\d\d)+|(,\d\d\d)+\.\d+|\.\d+|$))eos" };

Now we can simplify by using the regex engine's support for empty rules, the "optional" and "zero-or-more" quantifiers, ? and *:

std::regex r2 { R"eos(\d\d?\d?(,\d\d\d)*(\.\d*)?)eos" };

This program seems to pass the smoke test:

#include <regex>
#include <string>
#include <iostream>

int main()
{
  std::regex r1 { R"eos(\d\d?\d?((,\d\d\d)+|(,\d\d\d)+\.\d+|\.\d+|$))eos" };
  std::regex r2 { R"eos(\d\d?\d?(,\d\d\d)*(\.\d*)?)eos" };  // equivalent

  for (std::string word; std::cin >> word; )
    std::cout << std::boolalpha << std::regex_match(word, r2) << '\n';
}

Last edited on

jonnin (11350)

Personally, I'd write a parser, for me it'd be quicker (because I'm not good with regex'), and it'll run faster. Not an issue for a UI, but might be when validating input from streams.

This ^^ I don't know about performance and all, but reading lines 7 and 8 above in a debugging session is the stuff of nightmares if you are not good with them. Its fine, when its working.

closed account G21UpDi1 (73)

@mbozzi

Thank you.

It is going to be a couple of hours until I can put it to use, but I appreciate it.

I'm going to have to look up what CF grammar means.

My blind spot was I wasn't treating (,\d{3}) as on one atom.

closed account G21UpDi1 (73)

Since I'm a novice programmer, but quite proud of how well this turned out. I'm going to post my program hoping for criticism. Please don't feel obligated to look it over, but if you have the time any criticism would be appreciated.

Thank you.

#include <iostream>
#include <algorithm>
#include <stdexcept>
#include <vector>
#include <fstream>
#include <sstream>
#include <regex>

std::vector<std::string> get_input();
std::vector<std::string> process_record(std::string r);
void print_record(std::vector<std::string> vs, std::fstream& ofh);

int main() try {

    auto vs = get_input();

    std::fstream ifh; 
    ifh.open(vs[0], std::ios::in); 
    if (!ifh) 
        throw std::runtime_error("File not opened: " + vs[0]);

    std::fstream ofh; 
    ofh.open(vs[1], std::ios::out); 
    if (!ofh) 
        throw std::runtime_error("File not opened: " + vs[1]); 

    for(std::string str; std::getline(ifh,str);) { 

        AGAIN:
        size_t n = std::count(str.begin(), str.end(), '"'); //ensuring all double quotes are closed
        if ( (n % 2)) { 
                //If they're not closed I'm grabing the the next record and replacing
                //the return character with a literal \n.
                std::string next_line;
                std::getline(ifh,next_line);
                str += R"(\n)";
                str += next_line;
                goto AGAIN;
        }

        auto record = process_record(str); //putting the individual fields in a vector.

        for(auto& s : record) { 
            s = std::regex_replace(s , std::regex(R"(^ +| +$|( ) +)"), "$1"); //remove leading and trailing spaces
            
            std::smatch sm;
            std::regex r { R"(\$?\d\d?\d?(,\d{3})*(\.\d*)?)" };
            if (std::regex_match(s, sm, r)) { 
                s = std::regex_replace (s, std::regex(R"(,)"), ""); //remove commas in numbers
                s = std::regex_replace (s, std::regex(R"(\$)"), ""); //remove dollar sign
            }

            s = std::regex_replace (s, std::regex(R"(,)"), ";"); //replacing commas with semi-colons
            
        }

        print_record(record, ofh);

    }

    return 0;
}
catch (std::exception& e) {
    std::cerr << e.what() << '\n';
    return 1;
}
catch (...) {
    std::cerr << "uncaught" << '\n';
    return 1;
}

void print_record(std::vector<std::string> vs, std::fstream& ofh) {
    std::string str;
    for(auto s : vs) {
        str += s;
        str += ',';
    }

    str = std::regex_replace (str, std::regex(R"(,+$)"), ""); //removing trailing commas
    ofh << str << std::endl;
}

std::vector<std::string> process_record(std::string r) {
    std::stringstream ss{r};
    std::vector<std::string> rv;
    char ch;

    std::string field{};
    while (ss.get(ch)) { 

        if(ch == '"') { 
            while (ss.get(ch)) {

                if (ch != '"') {
                    field += ch; 
                }
                else { 
                    ss.get(ch); 
                    break;
                }
            }
        }

        if(ch == ',') {
            rv.push_back(field);
            field.clear();
            continue;
        }
        field += ch;
        
    }

    return rv;
}

std::vector<std::string> get_input() {
    std::vector<std::string> return_vstr;
    std::fstream ifh;

    ifh.open(R"(input_info.xml)", std::ios::in);
	if (!ifh) 
        throw std::runtime_error("File not opened--check to see if input_info.xml is correct");

    std::string info;
    char ch;

    while (ifh.get(ch)) { 
        if (ch == '\n') ch = ' '; 
        info.push_back(ch); 
    } 

    std::smatch sm;  
    std::regex in(R"(.*?<input1>(.*?)</input1>.*)");
    std::regex out(R"(.*?<output1>(.*?)</output1>.*)");

    std::regex_match (info, sm, in); 
    return_vstr.push_back(sm[1]);

    std::regex_match (info, sm, out); 
    return_vstr.push_back(sm[1]);

    return return_vstr;

}

here is how I give it the inputs. I find it easier than using the command line.

<root>
    <input1>c:\tmp.csv</input1>
    <output1>c:\clean.csv</output1>
</root>

Last edited on

jonnin (11350)

process record should take a constant reference rather than a copy. Anything larger than a register should be passed by reference -- so int/double/float/char etc are ok copied, but objects like a string should be a reference. Its a microscopic performance hit once, but when you call a function in a tight loop, it can add up fast.

goto in c++ is powerful and nice, but it is still frowned upon as 'do not use it' by most coders. If you can do it without the goto, do not use the goto, is the rule of thumb. There are one or two things you can't do without them, the cited example is to 'break' out of nested loops.

it seems like 91 should use \" ? If it works, maybe its ok -- not sure if compiler is being nice to you or if that is ok as is, honestly, I always use \"

style (minor): 'main at the bottom' lets you not have prototypes, which are clutter when writing a one - file program.

not sure you need regx for taking out a comma. probably a better way for that, like find last of or some sort of reverse iteration from the back of the string and remove. And for sure string's replace (comma to ;) would be better than the regx. Look at what strings can do .. find, replace, substring, etc tools can do some of this regx and is faster, more readable, and generally a better approach unless your purpose is to practice regx

consider saving some of these reusable regx expressions by documenting what exactly each one does and putting it into a function wrapper with an appropriate name. Even an expert with them does not want to have to rewrite the same ones over and over, and the ones you have look handy.

You can be proud of this. It probably has some minor things I didn't see in the drive-by, but apart from that rather disturbing use of goto, it is very well done.

Last edited on

closed account G21UpDi1 (73)

Updated program using suggestions.

#include <iostream>
#include <algorithm>
#include <stdexcept>
#include <vector>
#include <fstream>
#include <sstream>
#include <regex>

std::vector<std::string> get_input();
std::vector<std::string> process_record(std::string& r);
void print_record(std::vector<std::string> vs, std::fstream& ofh);

int main() try {

    auto vs = get_input();

    std::fstream ifh; 
    ifh.open(vs[0], std::ios::in); 
    if (!ifh) 
        throw std::runtime_error("File not opened: " + vs[0]);

    std::fstream ofh; 
    ofh.open(vs[1], std::ios::out); 
    if (!ofh) 
        throw std::runtime_error("File not opened: " + vs[1]); 

    //precompling regex 
    std::regex leading_trailing_spaces(R"(^ +| +$|( ) +)");
    std::regex looks_like_number { R"(\$?\d\d?\d?(,\d{3})*(\.\d*)?)" };

    for(std::string str; std::getline(ifh,str);) { 

        for (size_t n{}; (n = std::count(str.begin(), str.end(), '"') % 2);){ //ensuring all double quotes are closed
            //If they're not closed I'm grabing the the next record and replacing
            //the return character with a literal \n.
            std::string next_line;

            std::getline(ifh, next_line);
            str += R"(\n)" + next_line;
        }

        auto record = process_record(str); //putting the individual fields in a vector.

        for(auto& s : record) { 
            s = std::regex_replace(s , leading_trailing_spaces, "$1"); //remove leading and trailing spaces
            
            std::smatch sm;
            if (std::regex_match(s, sm, looks_like_number)) { 
                std::remove(s.begin(), s.end(), ','); //remove commas in numbers
                std::remove(s.begin(), s.end(), '$'); //remove dollar sign
            }

            std::replace(s.begin(), s.end(), ',', ';'); //replacing commas with semi-colons
            
        }

        print_record(record, ofh);

    }

    return 0;
}
catch (std::exception& e) {
    std::cerr << e.what() << '\n';
    return 1;
}
catch (...) {
    std::cerr << "uncaught" << '\n';
    return 1;
}

void print_record(std::vector<std::string> vs, std::fstream& ofh) {
    std::string str;
    for(auto s : vs) {
        str += s;
        str += ',';
    }

    str = std::regex_replace (str, std::regex(R"(,+$)"), ""); //removing trailing commas
    ofh << str << std::endl;
}

std::vector<std::string> process_record(std::string& r) {
    std::stringstream ss{r};
    std::vector<std::string> rv;
    char ch;

    std::string field{};
    while (ss.get(ch)) { 

        if(ch == '"') { 
            while (ss.get(ch)) {

                if (ch != '"') {
                    field += ch; 
                }
                else { 
                    ss.get(ch); 
                    break;
                }
            }
        }

        if(ch == ',') {
            rv.push_back(field);
            field.clear();
            continue;
        }
        field += ch;
        
    }

    return rv;
}

std::vector<std::string> get_input() {
    std::vector<std::string> return_vstr;
    std::fstream ifh;

    ifh.open(R"(input_info.xml)", std::ios::in);
	if (!ifh) 
        throw std::runtime_error("File not opened--check to see if input_info.xml is correct");

    std::string info;
    char ch;

    while (ifh.get(ch)) { 
        if (ch == '\n') ch = ' '; 
        info.push_back(ch); 
    } 

    std::smatch sm;  
    std::regex in(R"(.*?<input1>(.*?)</input1>.*)");
    std::regex out(R"(.*?<output1>(.*?)</output1>.*)");

    std::regex_match (info, sm, in); 
    return_vstr.push_back(sm[1]);

    std::regex_match (info, sm, out); 
    return_vstr.push_back(sm[1]);

    return return_vstr;

}

warnings I get

Consolidate compiler generated dependencies of target process_comma_delimited
[ 50%] Building CXX object CMakeFiles/process_comma_delimited.dir/main.cpp.obj
cl : Command line warning D9025 : overriding '/W3' with '/W4'
main.cpp
C:\main.cpp(49): warning C4834: discarding return value of function with 'nodiscard' attribute
C:\main.cpp(50): warning C4834: discarding return value of function with 'nodiscard' attribute
C:\main.cpp(33) : warning C4706: assignment within conditional expression
[100%] Linking CXX executable process_comma_delimited.exe
[100%] Built target process_comma_delimited

I would like to get rid of these warnings, but I'm not quite sure how I can.

jonnin (11350)

line 3 is fine, it upgraded your warning level. something in your build told it both level 3 and 4 so it set it to 4 and told you this. This isnt code, its your build settings.

49 and 50: remove returns an iterator that you don't really need. You can absorb it into a junk variable, but that moves the warning from what you have now to "blah blah assigned a value that was never used". Using it in a bogus statement gets rid of that one. This is a lot of hoops just to clear a dumb warning, though: you don't need the value. This may be an effect of L4 warnings, which are *excessive* and produce a few warnings that simply are not worth 'fixing'. Maybe look at forcing the build to use level 3 (standard level)?

line 33: did you mean n == ? Its OK and valid to assign in a condition. But it is a very common typo so the compiler will flag it. If you meant to do it, proceed. The correct thing to do when it is intentional is to put a comment saying so. You can move the assignment out of the condition and change the condition to silence it. It may be a little cleaner to do that -- purists will say to remove all level 3 warnings even if that means bloating the code to avoid one, and many would say to NOT assign in a condition just because its weird and less readable/harder to follow. I am indifferent: its legal, as long as you understand what it does (it returns true if the value assigned is not zero).

Last edited on

thmm (703)

On line 49 & 50 you need to cast it to void.

1
2

static_cast<void>(std::remove(s.begin(), s.end(), ',')); 
//remove commas in numbers

jonnin (11350)

Nice, I did not know that one. This is c++ at its finest... you can use advanced techniques to define a function that will complain if it does not have a giant, ugly cast added to it :) If that isn't progress, I don't know what is.

kidding aside, OP, do you know about https://en.wikipedia.org/wiki/Erase%E2%80%93remove_idiom
v.erase(std::remove(v.begin(), v.end(), blah), v.end());

They probably made it grumble to ensure you remembered to do the above when appropriate.

Last edited on

seeplus (6479)

L49-50. You do need to use the return value - that's why they are now marked as nodiscard. std::remove() doesn't remove as it can't change the size of the container - it just changes the contents and returns the iterator to the now effective end. You then need to use erase. to remove.

Consider:

#include <iostream>
#include <string>
#include <algorithm>

int main() {
	std::string s {"qwe,kj$h"};

	const auto e1 {std::remove(s.begin(), s.end(), ',')};	//remove commas in numbers
	s.erase(std::remove(s.begin(), e1, '$'), s.end());		//remove dollar sign

	std::cout << s << '\n';

}

which displays as expected:


qwekjh

However:

#include <iostream>
#include <string>
#include <algorithm>

int main() {
	std::string s {"qwe,kj$h"};

	static_cast<void>(std::remove(s.begin(), s.end(), ','));	//remove commas in number
	static_cast<void>(std::remove(s.begin(), s.end(), '$'));	//remove dollar sign

	std::cout << s << '\n';

}

displays:


qwekjhhh

which isn't what is required.

closed account G21UpDi1 (73)

Thank you. I almost didn't post the warnings, but now I'm glad I did.

Topic archived. No new replies allowed.

C++

Forum

Anyone good with regex?