Reading in a passage word by word from a

Forum

Forum
Beginners
Reading in a passage word by word from a

Reading in a passage word by word from a text file

Hello,
I have a question about how to read in word by word from a text file. I know that reading token by token separated by a space or newline character is simple but how can I get rid of the notations to store/process the word as I want? For example with this sentence:

Hello, this is a sentence.

I'd like to read in and have Hello, this, is, a and sentence separately. Hello and sentence won't have the comma/dot next to them.
One thing I have try is to process each token, for example when reading in "Hello," I'll go in from the beginning and the end of the token and remove anything that is not an alphabet character or a digit.
But that doesn't work with something like this

Hello(why are we greeting again), Brad-who is my cousin-is sleeping.

So how can I deal with all these situations and read in word by word from a text file efficiently?
Thank you for your help

Last edited on

tipaye (535)

Try strtok()

http://www.cplusplus.com/reference/cstring/strtok/

MiiNiPaa (8886)

Here is a way to make stream threat punctuation as spaces:
http://www.cplusplus.com/forum/general/79412/

Bubiche (24)

Thank you guys, you have helped me take a huge step towards achieving my goal.
There's still a small issue for me though. Consider this sentence

Andy-a hard-working student-is absent today.

Using your suggestions, I can separate this sentence word by word into: Andy, a, hard, working, student, is, absent, today. But how can I keep "hard-working" as a whole word while "Andy" and "a" are still separated? And is there any other similar cases for other punctuations?

Chervil (7320)

Outside of C++, I'd consider this sentence to be incorrect:

Andy-a hard-working student-is absent today.

I would expect it to be written as (note the additional spaces):

Andy - a hard-working student - is absent today.

To some extent, if you treat "student-is" as two separate words, that is making a correction or change to what was actually written - a bit like the way google search will often return results for something other than what was actually entered.

An earlier example had similar errors:

Hello(why are we greeting again),

should be

Hello (why are we greeting again),

again with an additional space.

So, that's something to think about - should the program be making these kinds of alterations to what was written? If that is not regarded as a problem, then other than some sort of dictionary listing common hyphenated words, such as "hard-working", it could be difficult to devise simple rules to distinguish between a valid and invalid hyphenated word.

And so far we haven't discussed the use of the apostrophe versus the single quote :)

Bubiche (24)

Thank you mate, you have cleared a lot of things up for me. I guess the best choice for me is to separate everything like hard-working in to hard and working then.

Topic archived. No new replies allowed.