UTF-8 file to string

Hi,

I have a UTF-8 file, that looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
/****** Script for SelectTopNRows command from SSMS  ******/
SELECT COUNT(*)
FROM [a].[dbo].[Events]
WHERE [a].[dbo].[Events].EventID NOT IN(
	SELECT
	[qs].EventID 
	FROM [a].[dbo].[Qualifiers] AS [qs] INNER JOIN [a].[dbo].[Events] ON [qs].EventID = [Events].EventID
	WHERE typeID = 1 
	AND qualifier IN(2,5,6,107,123,124)
	GROUP BY [qs].EventID
)
AND typeID = 1


I read it in as so:

1
2
std::ifstream *streamqe = new std::ifstream("C:\\path\\queries.txt");
	std::string szq((std::istreambuf_iterator<char>(*streamqe)), std::istreambuf_iterator<char>());


so, string szq is fine, except, that it contains a at the beginning of the string. How is that, and, what can I do to get rid of it?

Thanks!

C
If you interpret your std::string as UTF-8, you will see that it is actually U+FEFF, which is meaningless in a utf8 file: http://en.wikipedia.org/wiki/Byte_order_mark#UTF-8

just ignore it
well, if I just ignore it, I cant use the string, as the query doesn't work with . I guess I can delete that part of the string, though ;)

If I understand you right, this will be at the beginning of every utf-8 file, right?

So, I can just go by deleting the first 3 characters?
Last edited on
No, normally UTF-8 files don't use it, it's the pecularity of whatever program you used to create it. Check if the first three bytes are \xef\xbb\xbf or alternatively check if the first character is \ufefff, and delete it if equals.
I was actually able to change the default encoding of the editor (Sql Server Query Editor) to UTF-8 without signature, now I can continue without that "crap" ;)

Thanks Cubbi, you sent me in a good direction for this!
No, normally UTF-8 files don't use it, it's the pecularity of whatever program you used to create it.

That's not quite correct.

UTF-8 files don't need it, but a good number of programs do use it. (The most notorious of these is Notepad on Windows.)

Since it is valid at the head of a UTF-8 stream, whenever you have to handle any UTF stream, whether it be 8, 16, 32, 7, whatever -- you must pay attention to the possibility of a BOM. If nothing else, it will tell you whether or not you can continue processing the stream safely.

UTF-8 has no byte-order problems, but it does identify the stream as UTF-8 Unicode text, instead of some random ASCII or whatever-the-originator's-codepage-was text.

The correct thing to do is to assume that a BOM may be present. If it is, get rid of it.

1
2
3
4
5
6
7
8
9
10
11
12
13
if (sqz.compare( 0, 3, "\xEF\xBB\xBF" ) == 0)  // Is the file marked as UTF-8?
{
  hey_this_is_a_unicode_file = true;  // (If it matters to take note.)
  sqz.erase( 0, 3 );                  // Now get rid of the BOM.
}

// If there is any possibility that it may be another UTF stream, you should check here for them.
else if (...)
{
  ...
}

// Finally, carry on as usual. 

Hope this helps.
Topic archived. No new replies allowed.