How do I get rid of , —, and other mystery characters?

I am trying to read a txt file but the output starts with "" and all of the long dashes are replaced with "—". I wasn't able to find an adequate solution on Stack Overflow; all I know is that is has to do with ASCII characters and UTF-8 defaults. Any assistance in identifying and solving this problem would be greatly appreciated. Here's my code:


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
#include <iostream>
#include <iomanip>
#include <fstream>
#include <string>
using namespace std;

int main() {
    
    int count = 0;


    ifstream file("C:\\Users\\Jacob\\IdeaProjects\\Summary\\OED2(12-14).txt");
    string curr;
    while (getline(file, curr))
    {
        if (count < 100) {
            cout << curr << endl;
            count++;
        }
        else {
            exit(1);
        }
        
    }

    file.close();
    //inFile.close();
   
    return 0;
    
}
Last edited on
What do you mean "get rid of"?

Remove all the utf8 encoded characters and
- replace them with nothing
- replace them with a marker
- replace them with crude ascii approximations

Or output them as their properly rendered glyphs?
Perhaps
https://en.cppreference.com/w/cpp/io/basic_ios/imbue

As
1
2
3
    ifstream file("C:\\Users\\Jacob\\IdeaProjects\\Summary\\OED2(12-14).txt");
    file.imbue(std::locale("en_US.UTF8"));
    cout.imbue(std::locale("en_US.UTF8"));  // maybe this as well? 


From a command prompt, does this display the text file properly?
https://docs.microsoft.com/en-us/windows-server/administration/windows-commands/type
 
type C:\Users\Jacob\IdeaProjects\Summary\OED2(12-14).txt




I want to eventually pass the lines of the txt file through a lineSeparaterFunction, and the location of the long dash is where each line will be separated. But Visual Studio does not even recognize the mystery character, so I cant use it as a replacement.

I just want my string to be identical to the line in the txt file, is there no way to get Visual Studio to recognize the long dash? Do you think I should try and change the long dashes to something else in the txt file?
Perhaps you want to use these, if you want a long dash to be preserved in your string as something you can look for later on.
https://www.cplusplus.com/reference/fstream/wifstream/
https://www.cplusplus.com/reference/string/wstring/
https://www.cplusplus.com/reference/iostream/wcout/

UTF-8 bytes: EF BB BF
ZERO WIDTH NO-BREAK SPACE
Hex code point: FEFF

ΓÇö
UTF-8 bytes: E2 80 94
EM DASH  —
Hex code point: 2014

If these are the only two non-ascii characters in the text then it is by far the easiest to replace them with alternate characters.

You can replace the ZERO WIDTH NO-BREAK SPACE with a regular space.

As for the EM DASH, if a regular dash is not sufficient because other regular dashes exist in the text, then you could replace it with something that isn't in the text, perhaps a tilde (~) or perhaps two dashes in a row (--).
Topic archived. No new replies allowed.