Fastest way to split a string

Hey,

what is the fastest way to split a string with a delimiter into a vector/array with C/C++ and STL? The only limitation is that I don't want to use boost.
I already tried different approaches - including strtok and stringstream - but it's always terrible slow compared to Java String.split().

Greetings!
Last edited on
closed account (o3hC5Di1)
Hi there,

I don't know if it's the fastest way - but using std::string::find_first_of gives you a complexity of O(n) with minimal memory overhead if you make use of c++11's move semantics:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <iostream>
#include <vector>
#include <iterator>
#include <algorithm>

std::vector<std::string> string_split(std::string s, const char delimiter)
{
    size_t start=0;
    size_t end=s.find_first_of(delimiter);
    
    std::vector<std::string> output;
    
    while (end <= std::string::npos)
    {
	    output.emplace_back(s.substr(start, end-start));

	    if (end == std::string::npos)
	    	break;

    	start=end+1;
    	end = s.find_first_of(delimiter, start);
    }
    
    return output;
}

int main()
{
    std::string s = "test_a,test_b,test_c,test_d";
    std::vector<std::string> container = string_split(s, ',');
    
    std::ostream_iterator<std::string> ostr_it(std::cout, "\n");
    std::copy(container.cbegin(), container.cend(), ostr_it);
}


Note that I haven't benchmarked it, I thought I'd leave comparison up to you since it would make more sense to compare to your existing results. Don't forget to turn on your compiler's optimizations too.

Perhaps one of the experts around here could give a better, perhaps more low level approach that would be faster.

All the best,
NwN
Yours benchmarks faster than mine by about 117%.

(Here I load a 10,715 character, 354-line source code file and split it iterations times by '\n'.)

Compiled with g++ -O2
iterations? 100000
test1: duthomhas::split()
354 lines
0:00:18.52
test2: NwN's string_split() with const ref argument
354 lines
0:00:15.92


Compiled with g++ -O3
iterations? 100000
test1: duthomhas::split()
354 lines
0:00:18.44
test2: NwN's string_split() with const ref argument
354 lines
0:00:15.81

I changed your argument to const std::string& s though.

[edit] I also benchmarked Boost's split(). It was faster than both of ours.

test3: boost::split()
354 lines
0:00:12.60


[edit 2] FAQ: http://www.cplusplus.com/faq/sequences/strings/split/
Last edited on
First, let me thank you for your answers!

I tried NwN's Method and it seems a little faster, though it still takes about 30 seconds to split a string with 3.2 Million chars.
I tried the same in Java, and figured out that loading the string from a textfile takes much longer than in c++, but splitting is done in, well, less than 1/10 of a second. The string has to be represented in a special way when it's loaded, right? I mean, all the splitting algorithms I tried don't even come close to this performance.

I don't have the time to give you exact benchmarks right now, but I'll do them when I get home today and let you know.

Greetings!
IIRC, Java represents strings using ropes, which makes a difference.
Also IIRC, there was talk about doing that to C++? (Or was it Tcl? I don't recall.)
closed account (o3hC5Di1)
Thanks very much for your link to that faq-entry Duoas, very informative and well written as always.
I really should make some time to read through all of the articles in there.

Would you please be so kind as to explain how ropes could help make this faster?
For the record, I just learnt about them by you mentioning them, but wikipedia compares them to array-based strings: http://en.wikipedia.org/wiki/Rope_(data_structure)#Comparison_with_monolithic_arrays

From other articles, it seems like ropes only become beneficial when used on large strings, so it seems (like usual) there is no "one fastest way", it depends on the size of string you're intending to split.

I tried to have a look at the boost::split code:
http://boost.cvs.sourceforge.net/viewvc/boost/boost/boost/algorithm/string/split.hpp?view=markup
http://boost.cvs.sourceforge.net/viewvc/boost/boost/boost/algorithm/string/iter_find.hpp?view=markup
http://boost.cvs.sourceforge.net/viewvc/boost/boost/boost/algorithm/string/finder.hpp?view=markup

But I fear I'm not familiar enough with boost / c++ to really grasp why it is (that much) faster.

All the best,
NwN
Last edited on
+1 on the tip with the rope! This really makes
sense now. What a pitty that there is no STL rope.
Would be a really great feature in C++ because the performance increase is really noticeable on large strings (see my last post above).
closed account (o3hC5Di1)
Hi there,

Registred wrote:
What a pitty that there is no STL rope.


Actually - there is (but beware that it is not in the standard, so it's probably not portable):

http://www.sgi.com/tech/stl/Rope.html
http://www.sgi.com/tech/stl/ropeimpl.html‎

http://stackoverflow.com/questions/7811572/sgi-stl-rope-in-g

All the best,
NwN
Last edited on
Topic archived. No new replies allowed.