hey guys, I have a few data files that I would like to sort data out of, The originals got damaged and the only copy I have contain the data I need but it also contains symbols and other punctuation marks mixed in.
I tried using inFile >> and setting up conditions to copy every char to a new text file excluding the symbols I need. The only problem is that I'm not pulling in the blank spaces so now I have one continuous sentence.
Would I need to use getline? or can I check every character at a time with a different method?
1 2 3 4 5 6
//example text sample
//from:
thisÚËfi‘ is a sample`VLB8.$
¸ÚËfi‘ ¿µ™üîâ~sh]RG<1& sentence98@
//to:
this is a sample sentence
As a thought you could use "std::getline()" to read each line of the file and use a for loop to check each character with the "isalpha" and "isspace" functions. When you find a match add this character to a new string.
For something fancy you could set up something to add the necessary punctuation at the end.
Looks like some afternoon fun. I will see what I can come up with. If you could give me more than a one line example of what the file actually looks like that would help.
Hope that helps,
Andy
Edit: Forgot to mention "isalpha" and "isspace" is in the header file <cctype>.
Yes, use getline(fin, line). Then scan the line and only output what you want (including the spaces). Don't forget to output the newline at the end of the line.
This horrible hack of mine seems to work on your sample.
It removes all Unicode multi-byte characters.
It removes all characters with a '\x1e' or '\x1f' character after it.
It removes all characters that are not alphabetic or whitespace.
Ok, Just wondering if there was something already in the standard temp library. what tpd put up looks interesting, I will give it a try. I tried what you put Andy and it is a good start, it did however stop working midway....but I think its bc the file is way to big. regardless that gave me an idea might have to give it a couple passes to clean it up.
I'm thinking i'm just going to ignore every char that is not a space a-z,A-Z, and a period for now.
@Handy, In what sense did it not work?
I just copied the code from the post above, pasted it in an editor, saved it, compiled it, ran it and it printed:
this is a sample sentence
whooo ... hey I tried your code tpb, this is probably the best I have output the file, I like how it takes the unicode out yet leaves a white blank where the unicode was, this really improves redability.
lastchance you code was kind of what I was doing...but yours is shorter and practical.
I think I can get what I need from the files with what you guys have provided. Now I just have to sort through about 6 files with 50MB of text each :)
#include <iostream>
#include <iomanip>
#include <sstream>
usingnamespace std;
int nbytes(char ch) {
int n = (unsignedchar)ch;
if (n >= 240) return 4;
elseif (n >= 224) return 3;
elseif (n >= 192) return 2;
elseif (n >= 128) return -1;
return 1;
}
int main() {
istringstream sin(
"thisÚËfi‘ is a sample`VLB8.$""¸ÚËfi‘ ¿µ™üîâ~sh]R""G<1& sentence98@"
);
char c;
while (sin.get(c))
cout << setw(3) << (int)(unsignedchar)c << ' '
<< setw(2) << nbytes(c) << ' ' << c << '\n';
}
The output follows. As you can see, there are definitely unicode multi-byte characters and also a bunch of 14's and 15's that need to be removed. And if a 14 or 15 appears, the character before it needs to go, too, even if it's a normal alphabetic character.