Parsing data slow in Windows

I've posted here recently about reading from a text file and parsing into a vector. I've noticed, after running a few tests, that the reading of the file is not the issue, but parsing data is much quicker in Linux than Windows.

Here's another link to code and tests I've done to get some answers with:
https://stackoverflow.com/questions/57642866/parsing-data-quicker-in-ubuntu-than-windows-10

Also an older post here from me when I thought it was a reading issue because of the filesystem:
http://www.cplusplus.com/forum/general/254030/

I'm just trying to understand if the parsing issue is something that can be solved with the code or if it's something that Linux just does better than Windows.

Thanks
compile both programs into assembly and compare if they do anything differently.

Other than that, I have heard that msvc and gcc deal with push_back/emplace_back differently, pretty much use emplace_back when you can (convert push_back(object(a,b)) to emplace_back(a, b)).
The vector isn't the issue at all...even with just parsing the data when streaming from the file, Windows is just much slower.

I'm not sure how to compile both in the assembly though and check that. I'll look into that to see if I can find anything
Well from reading your old thread it seems that C++ has pretty much the same performance as C unless you are using the >> operator (which makes sense since >> does tons of locale conversion checks and you call a virtual function for every segment).

copy and pasted from salem c
$ cl /EHsc bar.cpp
$ bar.exe
C fgets, 5242880 records, 2.00438 seconds
C fgets+sscanf, 5242880 records, 13.4581 seconds
C++ getline, 5242880 records, 40.5704 seconds
C++ extractors, 5242880 records, 92.5081 seconds

$ cl /EHsc /O2 bar.cpp
$ bar.exe
C fgets, 5242880 records, 2.00834 seconds
C fgets+sscanf, 5242880 records, 13.5875 seconds
C++ getline, 5242880 records, 3.56614 seconds
C++ extractors, 5242880 records, 41.97 seconds


edit: Oh I guess this doesn't help much, misread your question

assembly won't show you the problem if the problem is inside the standard library (it could still give hints though).

I guess different implementations of a compiler just run differently... not much you can do about that other than finding alternative solutions.
Last edited on
If you're currently using Visual Studio, it would be interesting if the performance was compared againat GCC (MinGW on Windows).

Edit: Also show a minimal example so we can see the performance ourselves.
Edit 2: oh you already put that in the SO post, nevermind.

Maybe Windows is just slow, or perhaps anti-virus is messing with stuff. But I won't jump to a conclusion too quickly.

It's not clear if you're using the same/equivalent optimization levels. If you used GCC on Windows you could more easily compare apples to apples.
Last edited on
@Ganado

I used GCC on Windows and the results are similar with MinGW, but I'm not sure how to check optimization levels...or even what that really is. Any help with that would really help. Thanks
you should google it.

Or you should have at least noticed that the example benchmark from salem is an example of optimizations.
Last edited on
If you are not specifically asking for compiler optimizations, you aren't getting any. Both windows and linux code will be sig ificantly improved speedwise if you turn on full optimizations, and you wont notice any difference in speed between the two systems.

Also, make sure you select for the most modern compiler standard you can; differences in library implementations for older standards can also affect speed.

Hope this helps.
Turning on full optimizations didn't make much of a difference in Windows. It's still much slower in Ubuntu. As salemc said

Maybe so, but the elephant in the room is the god-awful slowness of parsing the data once it's in memory.

Shaving a few uS off the time by memory mapping the file directly ain't gonna change that.


Maybe it's just a Windows issue. I've tried everything and it just seems to be slow in parsing.
I dislike having to give such a blunt response, but the time differences you are posting are straight-up unbelievable if you maintain the position that it is the compiler/OS's fault.

What remains is your code.

You have been given a very important suggestion that you have not responded to: profile.

If you want us to do it for you, post your code somewhere we can look at it (and fix it). As the big boys say, put up or shut up.

Because blaming a tool for failure is always the wrong approach.
I appreciate the blunt response but I'm just trying to understand the responses from others about the OS.

I'm relatively new to programming and I still don't know much about it all so I'm not sure what profiling is. I looked up some things and was unsure what I was doing.

I posted the in those link in start of this thread but I'll post the code here. If anyone would like to profile it or do whatever they want, I really would just like to figure out the issue and fix it. I'm sure it is something with the code but I really don't know what to fix or figure out what's causing it given that it's the same code in both OS's.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
#include <iostream>
#include <vector>
#include <string>
#include <fstream>

using namespace std;

struct Data
{
    unsigned int date, time, mil;
    double last, bid, ask;
    unsigned int volume;
};

void readData(ifstream &dataFile)
{
    string line;
    int number{0};

    if (dataFile.is_open())
    {
        while (getline(dataFile, line))
        {
            number++;
        }
    }

    else
    {
        cout << "File did not open" << endl;
    }

    cout << number << endl;
}

void createData(vector<Data> &matrix, ifstream &dataFile)
{
    unsigned int date, time, mil;
    double last, bid, ask;
    unsigned int volume;
    char del;

    if (dataFile.is_open())
    {
        while (dataFile >> date >> time >> mil >> del >> last >> del >> bid >> del >> ask >> del >> volume)
        {
            matrix.push_back({date, time, mil, last, bid, ask, volume});
        }
    }

    else
    {
        cout << "No file" << endl;
    }
}


I'd like to clarify that even without adding elements to the vector, it doesn't much affect the speed in my tests.

Here is a link to the file:

https://drive.google.com/open?id=1m2q2F6alDYgcgW3sGALRf_ruD85tSuVl


Smaller file:

https://drive.google.com/open?id=1T6MJm494bG3PM2wmVIc_P1DRaCQcgeGJ
Last edited on
Hey, well, your code isn't bad, actually. You can shave a little by getting rid of the temporaries.

1
2
3
4
5
6
7
void loadData(vector<Data> &matrix, ifstream &dataFile)
{
  Data d;
  char c;
  while (f >> d.date >> d.time >> d.mil >> c >> d.last >> c >> c >> d.bid >> c >> d.ask >> c >> d.volume)
    matrix.emplace_back(d);
}

On my 12-year old PC it loads your file in about 2.2 seconds average.

Next thing to look at is your compiler and optimization settings. I compiled with:

cl /EHsc /Ox /std:c++17 a.cpp

And

clang++ -O3 -std=c++17 a.cpp

I have yet to reinstall Linux , so no comparative there, but it would really surprise me if it didn't run in nearly the same time.

If you really want to shave time, a lot of it is in the string->integer conversion, so you must move that to elsewhere.

Load the file as strings, then perform conversions on the data as needed and/or in a separate thread. This only works if you dont need the entire dataset immediately, of course.

Hope this helps.
You sir have helped a ton. This is actually making a difference. First time I've seen something that helps. I'm still not getting the speeds you are, I'm at about 20 seconds on average with Windows (which is a huge improvement) and I know it has to do with the optimizations.

Like I said before, I'm still new to all this but how would I enable these all these optimization in Visual Studio and/or Qt? If you could just provide me with that information, I think I'll be all set and I'd really appreciate it.

I was able to set the /Ox with Visual Studio, but I'm not sure about the rest. I looked it up but was still a little confused.

I also thought able streaming it all to strings and convert as I needed, but I do need access to all the data. But that's something I'll have to look into as well.
From memory:

In VS, make sure the drop-down box in the top left reads "Release" and selects your computer architecture (64-bit if you have it, 32-bit otherwise).

Also make sure you are linking with "Release" (not "Debug") versions of Qt and any other external libraries you are using.
In VS, I have it set on the Release build with 64-bit but I'm still not getting your speeds, although it is much quicker.

Did you use an IDE or how did you run the code?
Choice of IDE/console compilation will not affect the program's runtime.

I do not know what else could be an issue here.

Are you going through the standard streams for any of this?
Do you have any unusual environmental considerations?

The only options at this point are to either release all your code somewhere people can look at it (preferably with very complete information about your environment as well), or to profile the code yourself to see what is taking the time.

For the profiling, you can do that from VS using the debugger. I am the wrong person to ask about using the debugger, though.
What I posted is literally all the code. Excluding the actual functions in main.

I'm not sure what you mean by standard streams and I shouldn't have anything unusual. I just installed VS and ran the code. Nothing else.

After running the profiler, the majority (87.6%) of the CPU usage in in the:

while (dataFile >> d.date >> d.time >> d.mil >> c >> d.last >> c >> d.bid >> c >> d.ask >> c >> d.volume)


Most of that is coming from called functions

msvcp140.dll!0x007ffbf050849f
&
msvcp140.dll!0x007ffbf0506bef
.

Edit:

I also ran a profile to see what is taking time and the >> operator is taking 43.18% with istream taking 29.26% of CPU time

I don't know if that means much but that seems to me to be what is causing the issue.
Last edited on
Your file is literally a list of numbers (integers and floating points) and your code converts every single one of them.

So that is naturally going to be where most of your program’s time is spent.
Topic archived. No new replies allowed.