Splitting strings

Pages: 123
Hi,

I have a string stored in a character array, like:
char test[256] = "This is a test";

I now want to split this, so I have an array with 4 words:
this
is
a
test

I've Googled a bit and found out this function should work for this:
http://www.cplusplus.com/reference/clibrary/cstring/strtok/

However, I don't understand it. How does this work, as the result is stored in a char variable? How does the function know where a word ends (I understand the use of delimiters, but it stores the result in a char variable)?

Also, what is this part of the example doing:
1
2
3
4
5
while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ,.-");
  }


One more thing... when I use the function myself, Visual Studio gives me the warning "warning C4996: 'strtok': This function or variable may be unsafe. Consider using strtok_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details.".

Shouldn't I use this function or can I ignore this warning?

Thank you
Last edited on
Return Value

A pointer to the last token found in string.
A null pointer is returned if there are no tokens left to retrieve.

there's a big difference between char and char*

the part of the example you posted finds all tokens until there are none left.
first of all it does not store the result in a char type variable. it is a pointer to a char type.

what this code does is print the next token until it reaches the end of the string
1
2
3
4
5
while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ,.-");
  }



strok_s is for wide characters which is unicode, char is for ascii code. it's ok ignore the warning.


i strongly recommend reading the tutorials in this website. to learn more about ctring, arrays and pointers
Last edited on
Hey,

I still don't understand it (the pointer part). I just don't understand WHAT is stored in pch. I know it stores a pointer to the last token, but WHAT is the token and how/where is it saved? For example, why can you just print pch without dereferencing?

I tried several combinations, but I just can't get it to work because I don't really understand how the pch pointer works.

Thank you
Last edited on
String streams can parse out words delimited by white space. In the following example, each word is extracted and stored as a string in a vector. Note that istringstream::operator>> can extract the words into a char array if you prefer, although that isn't what I would recommend. The vector, rather than an array, handles memory for you.

This is a C++ approach rather than using C functions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include <iostream>
#include <sstream>
#include <vector>
#include <algorithm>
#include <iterator>
using namespace std;

int main( int argc, char * args[] )
{
    // original data
    char test[] = "This is a test";
    
    // string stream to parse out words delimied by white space
    istringstream ss( test );
    
    // new container of words
    vector<string> words;
    
    // parse out words using string stream and store words in container
    string word;
    while( ss >> word )
    {
        words.push_back( word );
    }

    // dump all words to stdout, delimied by spaces
    copy( words.begin(), words.end(), ostream_iterator<string>( cout, " " ) );

    return 0;
}
Last edited on
Thank you for the reply, moorecm. But I would like to finish this with the strtok function for now, I'll look at your method later =)
i thought you didn't know pointers, anyway here's the explanation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] ="- This, a sample string.";
  char * pch;
  printf ("Splitting string \"%s\" into tokens:\n",str);
  pch = strtok (str," ,.-");
  while (pch != NULL)
  {
    printf ("%s\n",pch);
    pch = strtok (NULL, " ,.-");
  }
  return 0;
}


what the function does is to return a pointer pointing to the starting point of the next token so in the first call to the function in this case the pointer points to char 'T' and marks the end of the token with a null character which is '\0'. so char ',' is replace with a '/0'. so actually the string looks actually looks like this at this point
"- This\0 a sample string."



in the case of printing a char array or pointer usually streams prints each character until char '\0' is found. so printf will only print "This"

on the next call the string is actually
"- This\0 a\0sample string."
but this time the function expect to receive a null pointer to tell the function to use the last string you are working on.

-hope this helps and be easy on my english
Thank you for the explaination.

But the more I read about this, the more I get confused...

pch = strtok (NULL, " ,.-");
How does the compiler knows it should continue working on the string here, as you don't pass it as a parameter anymore? Also, where is the end of the token marked (I mean, where is the end stored)?


I've messed a bit with this code:
1
2
3
4
char test[] = "Dit is een test";
	char* pointer = test;

	cout << pointer << endl;

This actually works, but why? When I dereference pointer (cout << *pointer), it displays "D", but why?

1
2
char test[] = "Dit is een test";
	char* pointer = &test;

Why doesn't this work?

I'm still getting confused with pointers. I did read the documentation about it here and on other websites/tutorials multiple times, but I still get stuck with them. If anyone has any good exercises or something, I would really appreciate them =)

Thank you
Last edited on
that is because when you assign a char array to a pointer the pointer actually pointer to the 0 index of the array

and note when you declare char array like this
char test[] = "Dit is een test";
a null character is automatically added at the end so you actually declared
"Dit is een test\0" and the length of this is 16.

-hope that helps, i'll post again tomorrow it's 3am here. bye
Last edited on
Hi,

I know have this:
1
2
3
4
5
6
7
8
9
10
char test[] = "Dit is een test";
	char * pch;


	pch = strtok (test," ");
	pch = strtok (NULL," ");


	cout << pch << endl;
	cout << test << endl;


I really don't understand the output:
is
dit

First of all, how does the compiler know it's working with the char array on this line:
pch = strtok (NULL," ");
Why don't you need to specify the variable test anymore?

I understand the output of pch, but not the output of test. it outputs "dit", but why? (I expected it would also output "is")

Also, when I use the while loop, like in the example, and try to output pch afterwards, my program crashes. Why?


My goal is to display the LAST word in the string. So it should display "test", that is what I'm trying to do here.

Thanks
Last edited on

This actually works, but why?


- 'test' is an array name.
- an array name without its brakets is a pointer to the array (technically this isn't true, but for now just think of it that way)

Therefore:

1
2
3
4
5
pointer = test;  // this makes 'pointer' point to the 'test' array

// that is the same as this:

pointer = &test[0];  // point to the first character in the 'test' array 



When I dereference pointer (cout << *pointer), it displays "D", but why?


- pointer points to the 'test' string.
- The dereference operator (*) is an alternative way to use the braket operator ([])

1
2
3
4
5
cout << *pointer;

// is the same as

cout << pointer[0];


Why doesn't this work?


- an array name without brakets is a pointer to the array (again, technically not true, but bear with me)
- The & operator gets the address (pointer) to a variable

ie:
1
2
3
4
pointer = test;  // works
pointer = &test; // doesn't work because 'test' is already a
                   // pointer, so you're getting a pointer to a pointer
                   //  (note:  technically not true, again -- I'm trying to keep this simple) 
strok_s is for wide characters which is unicode, char is for ascii code. it's ok ignore the warning.
strtock_s() has nothing to do with Unicode. It's about... well, about something irrelevant that I don't feel like looking up. Possibly related to concurrency.

First of all, how does the compiler know it's working with the char array on this line:
strtok() maintains internal state that persists between calls.
Try this, for example:
1
2
3
4
5
6
int f(int a){
    static int b;
    if (a>10)
        return b=a;
    return b+a;
}


it outputs "dit", but why?
strtok() modifies the char array to which its first parameter points. Specifically, it writes a '\0' after each token found.

Also, when I use the while loop, like in the example, and try to output pch afterwards, my program crashes. Why?
Because pch is pointing to NULL, as guaranteed by the while condition. Dereferencing null pointers is illegal.

My goal is to display the LAST word in the string.
You have to keep the last pointer obtained from strtok() in a different pointer, so that when strtok() finally returns zero, the pointer doesn't get overwritten.
Last edited on
Thank you Helios and Disch, I'm starting to understand it.

So how can I store the last pointer obtained in a different pointer? I'm thinking about putting it in the while, just before the function is called. The problem is that I'm stuck again when I want the SECOND last word that way.

Like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] ="- This, a sample string.";
  char * pch;
  char* pointer;

  pch = strtok (str," ,.-");
  while (pch != NULL)
  {
    pointer = pch; // I'm pretty sure I'm doing it wrong again...
    pch = strtok (NULL, " ,.-");
  }
  return 0;
}


Also, is it possible to store the contents of pch (or the new pointer variable, whatever) in a new char array?

Thanks
Last edited on
Bump =)
With regard to the first question - that of finding the second last string - that is fairly straightforward:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/* strtok example */
#include <stdio.h>
#include <string.h>

int main ()
{
  char str[] ="- This, a sample string.";
  char * pch;
  char* pointer, *secondLast;
  char *tokens =" ,.-";
  
  secondLast = pointer =""; //in case we are given an empty string or a string consisting of tokens "" to tokenize.
  pch = strtok (str,tokens);
  while (pch != NULL)
  {
    secondLast = pointer;
    pointer = pch;
    std::cout << pointer << std::endl;
    pch = strtok (NULL, tokens);
  }
  
  return 0;
}


With regard to the second part - saving the strings or the pointers - could be a bit messy given that the string to be tokenized can be of an arbitrary length thus producing an arbitrary number of strings (anywhere from 0 to stringlength/2) depending on the token content)
Last edited on
Thank you, that was just something stupid I didn't came up with...

So my current code is this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <iostream>

using namespace std;

int main() 
{


	char test[] = "Dit is een test ";
	char* pch;
	char* pointer;
	char result[256];

	pch = strtok (test," ");

	while (pch != NULL)
    {
		pointer = pch;
		pch = strtok (NULL, " ");
    }

	cout << pointer << endl;
	
	return 0;

}


It now outputs what I want ("test"), but I now want to store it in the result variable, just like the original string.

Is this possible?

Thank you
Last edited on
To answer one of your original questions, I recommend against using strtok() unless you really, really have to. It is important to understand strtok() and it sounds like you are getting there, but there is a better option.

The Boost String Algorithms library has this functionality and it is much easier to use.

http://www.boost.org/doc/libs/1_41_0/doc/html/string_algo/usage.html#id1701774

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <iostream>
#include <vector>
#include <string>

#include <boost/algorithm/string.hpp>

typedef std::vector<std::string> split_result_t;

int main()
{
    const std::string test = "this is a test";
    
    split_result_t result;
    boost::split(result, test, boost::is_any_of(" "));
    
     for (split_result_t::const_iterator it = result.begin();
        it != result.end();
        ++it)
    {
        std::cout << *it << std::endl;
    }

    return 0;
}

Hi,

Thank you for the reply.

So what's wrong with strtok()?

I'll look at "Boost String Algorithms library" tomorrow, thanks =)
Well, dealing with character arrays, pointers and pointer arrays is, as you are experiencing, pretty painful. The advantage of using C++ over C is that it has these nice data types, containers and algorithms to make programming a little more enjoyable.

The reason that Microsoft warns of the use of strtok() is that it is easy to use in an unsafe manner. The behavior of strtok() is undefined when the first call is passed a NULL pointer. It is also not thread safe -- you have to remember to use the thread-safe version.
Hey,

I've looked at the code and I think I understand it (didn't test it yet), but the only thing confusing me is why a pointer is used on this line:
std::cout << *it << std::endl;

Why?

Thanks =)
Pages: 123