Unicode character check

Pages: 12
Hi. I'm using visual c++ 2010 and I'm trying to read from a file with unicode, check for a newline at a certain point, and replace it with a space character.
Heres my code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#include "stdafx.h"

#include "fstream"

#include "string"#include "iostream"


int main()

{
    
   using namespace std;
    
   fstream iofile("Sample.dat", ios::in | ios::out);
    
   if (!iofile)
    
   {
        
      cout << "Uh oh, Sample.dat could not be opened!" << endl;
        
      exit(1);
    
   }

    
   char chChar;
	
   int numchar=0;
   
   while (iofile.get(chChar))
    
   {
		
       if(chChar=='U+000A' && numchar % 76==0)
        
       {
			
            iofile.seekg(-1, ios::cur);
		
            iofile << 'U+0020';
            
            iofile.seekg(iofile.tellg(), ios::beg);

            numchar--;
        
       }
		
       else if(chChar=='U+000A' && numchar % 76!=0)
	
	    numchar==0;
		
       numchar++;
  
   }

    
   return 0;

}


I get the errors
line 1 error C3872: '0x2028': this character is not allowed in an identifier
line 1 error C2014: preprocessor command must start as first nonwhite space
line 2 fatal error C1004: unexpected end-of-file found
Please help!
Looks like you have a problem with the file format of your code, not the code itself. The error looks like it is saying there is a Unicode line separator up in your #includes
It does look like that. The unicode character 2028 is a line separator, so I'm not sure what it wants. How would I fix the file format of the code?
Off hand, I'd throw the source code into a hex editor or other text editor that can show non-print characters and get rid of it. I don't really see any reason for it to be there, anyway.
It's the enter character. I need that.
Well, according to your compiler, you have it at the beginning of your code (before the first #) which is preventing it from understand your preprocessor command.
Yeah, but there isn't anything before the first #
I probably have to change a setting somewhere, Im just not sure where.
From my understanding, with VC++ you need to have your STDIO and other header files that are in your VC++ folder within <>'s. So:

1
2
3
#include <fstream>
#include <string>#include <iostream> 



Stdafx.h shouldn't require it. Of course, normally you'd be receiving a different error for something like that, but it's worth a shot.
Last edited on
It might be that it is trying to interpret your file as ASCII instead of Unicode...try saving it and see if it prompts you to save it as Unicode instead. If that doesn't work, try opening it in another text editor like Notepad++ and save it as Unicode specifically.
I've already saved it as unicode, thats when this error appeared. Before that, it said
error C2015: too many characters in constant

I also changed the quotes to <>, but that didn't help
Hrm, I have no idea then...I've used Unicode files before and they work fine. :/

As for that error, you can't do it like that. A "char" in C++ is a single byte, and unfortunately the standard support for unicode etc is quite bad...so I'd try doing a search and see if you can find anything to help you with reading the files.
I've tried, the forum was my last resort
A wild guess. Are you sure that it is not BOM (U+FEFF) that's bugging the IDE, and just the compiler message is deceiving you? The stuff is usually prepended to unicode files, but it would be very strange if you save the file with the IDE itself and the compiler that is bundled with it can not recognize it.

Seriously though, open the file with hex editor - I recommend HxD: http://mh-nexus.de/en/hxd/

Regards
I opened it with the hex editor you suggested- there aren't any hidden characters.
I don't exactly know what BOM is, so I have no idea how to check for it...
New MS compilers are out of my element, but are you sure that you can specify Unicode literals with this syntax. (even with wchar_t) Even if it is not the cause of the issue, it still makes me wonder. Looking at:
at MSDN: http://msdn.microsoft.com/en-us/library/6aw8xdf2.aspx and
at this post: http://stackoverflow.com/questions/1826426/unicode-string-literals-in-c-vs-c-cli
I don't see this syntax supported anywhere.

The most you can do, if everything fails and noone advises you better, is to dump the preprocessed output from the .cpp file (sorry, but you'll have to find the option yourself or someone must tell you how to do it) and then search with the hex editor for the sequence 0xE2 0x80 0xA8. This is the character in UTF-8. Alternatively, you can search for 0x2028 directly. Keep in mind that the compilers report all sorts of stuff sometimes, that doesn't show the true source of the error. So, make sure that your code is ok according to the MS standards first.

This link is for reference: http://www.fileformat.info/info/unicode/char/2028/index.htm

Regards
I tried using wchar_t and the L prefix before, but my compiler didn't like the L, and iofile.get(chChar) wouldn't take a wchar_t. Also, I tried saving it as non unicode, where it complains that my string literals are too long... I'm not sure if thats helpful though. I think Im just going to try to find a different way of doing this.
Well, I don't know why you encountered problems, but the proper way to use wide character support is to use wfstream instead of fstream, the L prefix, and the wchar_t type. Now, whether wide character support will give you unicode support is entirely different issue and I can not help you there. (Although, I seriously doubt that it will, unless you exploit the platform and encoding specifics.)

Two links that may or may not help you:
http://forum.osdev.org/viewtopic.php?p=17836#p17836
http://www.boost.org/doc/libs/1_38_0/libs/serialization/doc/codecvt.html
I tried this on the non unicode version, and the complaint has switched to

warning C4066: characters beyond first in wide-character constant ignored

But it succeeded:D
I'll debug tomorrow, I think the warning is probably important.
Bye
As I said, I haven't seen the L'U+code' syntax before, which if incorrect would explain the compiler barking for excess of characters. Why don't you try tomorrow the L'\xcode' syntax?
This warning was with the L'U+code' and a wchar_t. before it said that the string literals are too long. Though this is the version not saved as unicode. The version saved with it still complains about the first line
Pages: 12