extracting data from file

closed account (NCRLwA7f)
hey guys, I have a few data files that I would like to sort data out of, The originals got damaged and the only copy I have contain the data I need but it also contains symbols and other punctuation marks mixed in.
I tried using inFile >> and setting up conditions to copy every char to a new text file excluding the symbols I need. The only problem is that I'm not pulling in the blank spaces so now I have one continuous sentence.
Would I need to use getline? or can I check every character at a time with a different method?

1
2
3
4
5
6
  //example text sample
//from:
thisÚËfi‘ is a sample`VLB8.$
¸ÚËfi‘ ¿µ™üîâ~sh]RG<1& sentence98@
//to:
this is a sample sentence
Hello DaRealFonz,

As a thought you could use "std::getline()" to read each line of the file and use a for loop to check each character with the "isalpha" and "isspace" functions. When you find a match add this character to a new string.

For something fancy you could set up something to add the necessary punctuation at the end.

Looks like some afternoon fun. I will see what I can come up with. If you could give me more than a one line example of what the file actually looks like that would help.

Hope that helps,

Andy

Edit: Forgot to mention "isalpha" and "isspace" is in the header file <cctype>.
Last edited on
Yes, use getline(fin, line). Then scan the line and only output what you want (including the spaces). Don't forget to output the newline at the end of the line.
This horrible hack of mine seems to work on your sample.
It removes all Unicode multi-byte characters.
It removes all characters with a '\x1e' or '\x1f' character after it.
It removes all characters that are not alphabetic or whitespace.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <iostream>
#include <iomanip>
#include <sstream>
using namespace std;

int nbytes(char ch) {
    int n = (unsigned char)ch;
    if      (n >= 240) return  4;
    else if (n >= 224) return  3;
    else if (n >= 192) return  2;
    else if (n >= 128) return -1;
    return 1;
}

int main() {

    istringstream sin(
    "thisÚËfi‘ is a sample`VLB8.$"
    "¸ÚËfi‘ ¿µ™üîâ~sh]RG<1& sentence98@"
    );

    char c, prev;
    sin.get(prev);
    while (sin.get(c)) {
        int n = nbytes(c);
        if (n != 1) {
            for (int i = 1; i < n; i++)
                sin.get(c);
            sin.get(c);
            if (c != '\xe')
                sin.unget();
        }
        else if (c != '\xe' && c != '\xf') {
            if (isspace(prev) || isalpha(prev))
                cout << prev;
            prev = c;
        }
        else {
            sin.get(prev);
        }
    }

//    while (sin.get(c))
//        cout << setw(3) << (int)(unsigned char)c << ' '
//             << setw(2) << nbytes(c) << ' ' << c << '\n';;
    
    cout << '\n';
}

Hello DaRealFonz,

I tried tpb's program and it did not work for me. Sorry tpb.

I did come up with this that works to a point:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#include <iostream>
#include <iomanip>  //  setw(), fixed, setprecision.
#include <string>
#include <fstream>
#include <cctype>
#include <chrono>  // For sleep time code
#include <thread>  // For sleep time code

int main()
{
	wchar_t c{};

	std::string line, newLine;
	std::string iFileName{ "Input Data.txt" };

	std::ifstream inFile;

	inFile.open(iFileName);

	if (inFile.is_open())
	{
		std::cout << "\n File " << iFileName << " is open" << std::endl;
		std::this_thread::sleep_for(std::chrono::seconds(0));  // <--- Needs header files chrono" and "thread".
	}
	else
	{
		std::cout << "\n File " << iFileName << " did not open" << std::endl;
		std::this_thread::sleep_for(std::chrono::seconds(3));  // <--- Needs header files chrono" and "thread".
		exit(1);
	}

	while (std::getline(inFile, line))
	{
		std::cout << "\n" << line << std::endl;

		for (size_t lc = 0; lc < line.size(); lc++)
		{
			c = line[lc];

			if (c > 128) continue;

			if (std::isalpha(c))
				newLine += line[lc];

			if (std::isspace(c))
				newLine += line[lc];
		}
	}

	std::cout << "\n " << newLine << std::endl;

	std::cout << "\n\n Press Enter to continue";
	std::cin.get();

	return 0;
}

This works except for the fact that it picks out anything that is a letter even if you wish it not to be in the final output.

Something you can start with.

Hope that helps,

Andy
closed account (NCRLwA7f)
Ok, Just wondering if there was something already in the standard temp library. what tpd put up looks interesting, I will give it a try. I tried what you put Andy and it is a good start, it did however stop working midway....but I think its bc the file is way to big. regardless that gave me an idea might have to give it a couple passes to clean it up.

I'm thinking i'm just going to ignore every char that is not a space a-z,A-Z, and a period for now.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
#include <cctype>
#include <algorithm>
#include <iterator>
using namespace std;


bool filter( char c )
{
   const string allowed = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ .\n";    // choose what to keep

   bool ok = allowed.find( c ) != string::npos;
   return ok;
}


int main()
{
// ifstream in( "input.txt" )
// ofstream out( "output.txt" )
   stringstream in( "thisÚËfi‘ is a sample`VLB8.$¸ÚËfi‘ ¿µ™üîâ~sh]RG<1&sentence98@" );

   istream_iterator<char> it{ in >> noskipws };
   ostream_iterator<char> ot{ cout };
   copy_if( it, {}, ot, filter );
}
@Handy, In what sense did it not work?
I just copied the code from the post above, pasted it in an editor, saved it, compiled it, ran it and it printed:
this is a sample sentence
Try running this on your file. Set the input and output file names, of course.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#include <iostream>
#include <iomanip>
#include <fstream>
#include <cctype>
using namespace std;

int nbytes(char ch) {
    int n = (unsigned char)ch;
    if      (n >= 240) return  4;
    else if (n >= 224) return  3;
    else if (n >= 192) return  2;
    else if (n >= 128) return -1;
    return 1;
}

int main() {
    ifstream fin("inputfile");
    ofstream fout("outputfile");

    char c, prev;
    fin.get(prev);
    while (fin.get(c)) {
        int n = nbytes(c);
        if (n != 1) {
            for (int i = 1; i < n; i++)
                fin.get(c);
            fin.get(c);
            if (c != '\xe')
                fin.unget();
        }
        else if (c != '\xe' && c != '\xf') {
            if (isspace(prev) || isalpha(prev))
                fout << prev;
            prev = c;
        }
        else {
            fin.get(prev);
        }
    }
    fout << prev;
}

closed account (NCRLwA7f)
whooo ... hey I tried your code tpb, this is probably the best I have output the file, I like how it takes the unicode out yet leaves a white blank where the unicode was, this really improves redability.

lastchance you code was kind of what I was doing...but yours is shorter and practical.

I think I can get what I need from the files with what you guys have provided. Now I just have to sort through about 6 files with 50MB of text each :)
If you're on a linux system, consider using sed:
sed -e '1,$s/[^a-zA-Z]+/ /g' < inputFile > outputFile
My analysis was based on the output of this program:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#include <iostream>
#include <iomanip>
#include <sstream>
using namespace std;

int nbytes(char ch) {
    int n = (unsigned char)ch;
    if      (n >= 240) return  4;
    else if (n >= 224) return  3;
    else if (n >= 192) return  2;
    else if (n >= 128) return -1;
    return 1;
}

int main() {
    istringstream sin(
        "thisÚËfi‘ is a sample`VLB8.$"
        "¸ÚËfi‘ ¿µ™üîâ~sh]R"
        "G<1& sentence98@"
    );

    char c;
    while (sin.get(c))
        cout << setw(3) << (int)(unsigned char)c << ' '
             << setw(2) << nbytes(c) << ' ' << c << '\n';
}

The output follows. As you can see, there are definitely unicode multi-byte characters and also a bunch of 14's and 15's that need to be removed. And if a 14 or 15 appears, the character before it needs to go, too, even if it's a normal alphabetic character.

116  1 t
104  1 h
105  1 i
115  1 s
195  2 �
154 -1 �
 14  1 
195  2 �
139 -1 �
 14  1 
239  3 �
172 -1 �
129 -1 �
 14  1 
226  3 �
128 -1 �
152 -1 �
 32  1  
105  1 i
115  1 s
 32  1  
 97  1 a
 32  1  
115  1 s
 97  1 a
109  1 m
112  1 p
108  1 l
101  1 e
 96  1 `
 15  1 
 86  1 V
 15  1 
 76  1 L
 15  1 
 66  1 B
 15  1 
 56  1 8
 15  1 
 46  1 .
 15  1 
 36  1 $
 15  1 
 15  1 
 16  1 
 15  1 
  6  1 
 14  1 
194  2 �
184 -1 �
 14  1 
195  2 �
154 -1 �
 14  1 
195  2 �
139 -1 �
 14  1 
239  3 �
172 -1 �
129 -1 �
 14  1 
226  3 �
128 -1 �
152 -1 �
 14  1 
 32  1  
 14  1 
194  2 �
191 -1 �
 14  1 
194  2 �
181 -1 �
 14  1 
226  3 �
132 -1 �
162 -1 �
 14  1 
195  2 �
188 -1 �
 14  1 
195  2 �
174 -1 �
 14  1 
195  2 �
162 -1 �
 14  1 
126  1 ~
 14  1 
115  1 s
 14  1 
104  1 h
 14  1 
 93  1 ]
 14  1 
 82  1 R
 14  1 
 71  1 G
 14  1 
 60  1 <
 14  1 
 49  1 1
 14  1 
 38  1 &
 14  1 
 27  1 
 14  1 
 16  1 
 14  1 
  5  1 
 32  1  
115  1 s
101  1 e
110  1 n
116  1 t
101  1 e
110  1 n
 99  1 c
101  1 e
 57  1 9
 56  1 8
 64  1 @

Topic archived. No new replies allowed.