Strangest Random Bug Ever Encountered..ever

Built/Tested on Win 7 64bit using VS2015 Pro.

Project Solutions:
http://s000.tinyupload.com/index.php?file_id=79811894669876614450

I dare anyone here to;
1. Build & Launch TestHost.exe console app
2. type unittest_1|1000
3. wait for console to freeze
4. Try to explain why set ticket status is being randomly and completely ignored during threadRun_ method in ThreadManager.

Sorry I dont know how to post entire project code so i wrapped it in a zip.

// Ramblings below;

Intro
I seriously have no idea what is going on here. I have written a dll and a test application that tests the dll. Everything works absolutely fine! Except this random error. This project is not large, but too large to throw code all over your screen. I have no trouble sharing it as it's more a learning experience than anything.

I will try to explain as simply as possible;

Idea
The dll keeps track of all unique id's used using a ticketStatusList map.
When an item is queued, its id is added to that map with a default status (READY).
When an item is processed the status relating to that id is updated (DONE).

This way let's me easily check if the data has been processed and can be retrieved.

Problem
A random task item status will not update at all. Even with specific checks and re checks, it's like there are more than one version of the ticket status map where one version is updated but not the other.

Bug Reproduction (As close as possible)
Open the cb_ff_dbi solution, set to release build, build. Run, type unittest_1|1000 in console window with no spaces. If the bug occurred you will notice the console freeze but cpu usage stay the same. It will not ask for more input. If it does, just re enter unittest_1|1000.


I feel I cannot proceed in my c++ learning until i realize the cause of this problem, it feels like I am missing something severely fundamental.

Thanks for any help!
Last edited on
I managed to build and run it.
Worked well with unittest_1|1000. Took a few seconds but no freeze.
Repeated once, again no problem.
Tried once with unittest_1|100, again no problem.
Tried with unittest_1|10, see output
Output:

Enter Request: unittest_1|10
submitted 10 tickets
rec: 0|#success|data = 41 bytes
rec: 4|Multi-Part (4) for 19169bytes.
rec: 6|#success|data = 11 bytes
rec: 7|#success|data = 11 bytes
rec: 1|Multi-Part (1) for 6334bytes.
rec: 2|#success|data = 11 bytes
rec: 8|#success|data = 11 bytes
rec: 3|#success|data = 11 bytes
rec: 5|#success|data = 11 bytes
rec: 9|#success|data = 11 bytes
submitted 10 tickets
rec: 10|#success|data = 11 bytes
rec: 14|Multi-Part (1) for 5705bytes.
rec: 15|Multi-Part (5) for 23281bytes.
rec: 9|Multi-Part (2) for 11478bytes.
rec: 12|#success|data = 11 bytes
rec: 11|Multi-Part (6) for 26962bytes.
rec: 16|Multi-Part (2) for 9961bytes.
rec: 13|#success|data = 11 bytes
rec: 17|#success|data = 2995 bytes
rec: 18|Multi-Part (1) for 4827bytes.
submitted 10 tickets
rec: 20|#success|data = 11 bytes
rec: 21|#success|data = 3902 bytes
rec: 24|Multi-Part (4) for 17421bytes.
rec: 18|Multi-Part (8) for 32391bytes.
rec: 22|#success|data = 11 bytes
rec: 19|#success|data = 11 bytes
rec: 23|#success|data = 292 bytes
rec: 25|#success|data = 11 bytes
rec: 26|#success|data = 11 bytes
rec: 27|#success|data = 11 bytes
submitted 10 tickets
rec: 27|#success|data = 11 bytes
rec: 31|Multi-Part (1) for 5447bytes.
rec: 29|#success|data = 11 bytes
rec: 34|Multi-Part (6) for 25667bytes.
rec: 28|#success|data = 11 bytes
rec: 30|Multi-Part (4) for 19718bytes.
rec: 36|Multi-Part (7) for 28703bytes.
rec: 32|Multi-Part (3) for 14771bytes.
rec: 33|#success|data = 1869 bytes
rec: 35|Multi-Part (4) for 17035bytes.
submitted 10 tickets
rec: 39|#success|data = 11 bytes
rec: 37|Multi-Part (7) for 31322bytes.
rec: 36|#success|data = 11 bytes
rec: 40|#success|data = 11 bytes
rec: 42|Multi-Part (4) for 17673bytes.
rec: 38|#success|data = 11 bytes
rec: 41|#success|data = 11 bytes
rec: 43|#success|data = 11 bytes
rec: 44|Multi-Part (3) for 15141bytes.
rec: 45|#success|data = 11 bytes
submitted 10 tickets
rec: 46|Multi-Part (6) for 25547bytes.
rec: 45|Multi-Part (7) for 28253bytes.
rec: 48|Multi-Part (8) for 32662bytes.
rec: 54|Multi-Part (3) for 12316bytes.
rec: 47|#success|data = 11 bytes
rec: 49|#success|data = 11 bytes
rec: 52|#success|data = 11 bytes
rec: 50|Multi-Part (5) for 20037bytes.
rec: 53|Multi-Part (6) for 27529bytes.
rec: 51|Multi-Part (2) for 8723bytes.
submitted 10 tickets
rec: 54|#success|data = 11 bytes
rec: 56|#success|data = 11 bytes
rec: 59|#success|data = 11 bytes
rec: 58|#success|data = 288 bytes
rec: 55|#success|data = 11 bytes
rec: 57|Multi-Part (5) for 22190bytes.
rec: 60|Multi-Part (2) for 9040bytes.
rec: 61|Multi-Part (4) for 19264bytes.
rec: 62|#success|data = 11 bytes
rec: 63|#success|data = 11 bytes
submitted 10 tickets
rec: 68|Multi-Part (6) for 24370bytes.
rec: 65|#success|data = 11 bytes
rec: 67|Multi-Part (4) for 15890bytes.
rec: 70|Multi-Part (6) for 24393bytes.
rec: 63|#success|data = 11 bytes
rec: 64|Multi-Part (6) for 27446bytes.
rec: 66|#success|data = 11 bytes
rec: 69|Multi-Part (3) for 15006bytes.
rec: 71|#success|data = 11 bytes
rec: 72|#success|data = 11 bytes
submitted 10 tickets
rec: 77|Multi-Part (4) for 18756bytes.
rec: 78|Multi-Part (1) for 4966bytes.
rec: 72|#success|data = 11 bytes
rec: 73|Multi-Part (4) for 19629bytes.
rec: 74|#success|data = 11 bytes
rec: 79|Multi-Part (3) for 13931bytes.
rec: 76|Multi-Part (6) for 24084bytes.
rec: 81|Multi-Part (4) for 16944bytes.
rec: 75|#success|data = 11 bytes
rec: 80|#success|data = 11 bytes
submitted 10 tickets
rec: 81|#success|data = 11 bytes
rec: 87|Multi-Part (1) for 5537bytes.
rec: 89|Multi-Part (5) for 22929bytes.
rec: 84|#success|data = 11 bytes
rec: 82|#success|data = 11 bytes
rec: 86|#success|data = 11 bytes
rec: 83|Multi-Part (6) for 24626bytes.
rec: 85|#success|data = 11 bytes
rec: 88|Multi-Part (4) for 16118bytes.
rec: 90|#success|data = 11 bytes
UT-END---------------------------------------------
Enter Request:

Enterd quit and program terminated.
Thank you for testing it! really means a lot!

My own thoughts
I have noticed if i build for winXP and run in compatibility mode it wont freeze!
I'm seriously contemplating I have something wrong with my computer.

Can anyone think of any reason using the winXP Shim (Program Compatiblity XP SP3) would cause a difference?

Virtual Machine Tests
I tried in Windows virtual machine setups;
Windows 7 x86 Ultimate: Same issue (Freezes randomly - ticket status doesnt update for 1 ticket in n tickets)
Windows XP Pro SP3 - No problems whatsoever, run any amount as fast as I want no freeze.

Trying to catch the bug I updated threadRun_ like so; (Notice the extra status check at the end)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
void ThreadManager::threadRun_()
{
	while (!isQueueEmpty())
	{
		// Get queue item
		ThreadTaskItemUnique * taskItem = &threadFrontPopQ_();

		// set status wait
		threadSetTicketStatus_(taskItem->id(), ThreadTaskItemUnique::Status::WAIT);

		// execute queue item
		std::string result = (cbHostInstance_->*taskItem->cbFuncPtr())(taskItem->data());

		// set status done
		threadSetTicketStatus_(taskItem->id(), ThreadTaskItemUnique::Status::DONE);

		// store result
		threadInsertResult_(taskItem->id(), result);

		// check status actually is done
		if (ticketStatusById(taskItem->id()) == ThreadTaskItemUnique::Status::READY)
		{
			threadSetTicketStatus_(taskItem->id(), ThreadTaskItemUnique::Status::DONE);
			//throw EXCEPTION_DEBUG_EVENT;
		}
	}
};

I said a breakpoint above the exception line in an attempt to catch this problem, It does! randomly! It will catch the first and maybe second time the status fails to update, but then will randomly freeze again as if even this check is ignored? No break happens?

Conclusion
Im starting to think there is something inherently different with windows 7 on my computer? If anyone tests this can they tell me what processor and operating system they tested it on?

Thanks people!
Maybe it really is my CPU... But then why wouldnt the double check work? Why does it have to skip all those checks, i could put a while loop in there and it would still skip it as if the ticket was done, but it actually is still only "ready".

My system specs;
VS2015 Ultimate
Windows 7 Ultimate x64
I5-2500K (has been running high temps... pc even restarts attempting performance index)
Last edited on
Confirmed the freeze happens on my friends windows 7 machine too. Even more often than on mine.

Really seems to be something inherently wrong here, But I just can't see it, it really seems to just ignore anything to do with the ticket status for a random ticket now and then.

I just don't even know where to start tracking down this bug, maybe the mutex locks? maybe the method im using to add/update the map item (using the [] operator)? But i did try using an iterator and same thing...

Something that windows 7 has or does, that xp doesn't have or do is causing this so maybe that can narrow it down for some computer science genius lol.

Grrrrr so stuck on this I will not let this go until I know whyyyyyyy xD.
Thanks again.


edit --
Thinking it could also be something to do with my setup? There have been various builds of visual studio from different versions on this computer, along with other IDE's for other languages like java. Perhaps something with my build configuration is causing this? But even then why would it be so random, but happen on other computers without visual studio.

Also it is pretty random, sometimes after 100 tickets, sometimes u have to go though 100,000 tickets to get a freeze.
Last edited on
Can't help any ChemicalBliss other than to say its a classic memory corruption bug. I live in dread of these things. One I recall I lost one whole week on. Another, three weeks. Yet another two months, and that particular one I never really solved other than to say I finally tracked it down to a particular version of a compiler I was using. Changing compiler versions solved it. And that one only occurred on Win 2000/XP - not on Win 7. Yet another one existed in important code I have and randomly struck every couple years, and folks using my program would lose data. It was a hand held data collector program in Windows CE. Finally we started using another model of data recorder, and the bug surfaced on every program run, so I was able to find it and kill it. What it was, was a variable I failed to initialize. These things are serious and can ruin big piecies of one's life. They've taught me to code pretty defensively, and use as simple of techniques as I can to do things.
Perfect thanks freddie! That's just what I needed, someone with experience with this kind of problem.

I can tell already It will probably take me a long long time to work out exactly whats wrong here (tearing out chunks of code etc to reproduce the problem locally or whatever).
But, I guess I can check a few things first, like the fact it's only on windows 7, or at least not on windows XP.

Thoughts

It's only to do with the ticket status map AFAIK. Everything works fine - it is literally just this ticket status map that has the issue and anything related to it (the if checks/switches).

So I guess I'll try get someone else to compile this binary in as many different compilers as possible, that way I should be able to rule out if it is a compiler issue (I have my doubts).

The only other thing (luckily it's a relatively small application) I can think of doing is learning assembly and stepping through the entire program to make sure all variables are initialized, no memory leaks etc.

I will find this though, I will find the culprit if it takes me a year, I need to.

Ideas/Plan going forward
If anyone can think of any best practices for debugging this issue (aside from aforementioned) please tell me, anything that will help me find this bug, any ideas even fringe cases, at least I'll be able to rule them out.

Things I have ruled out so far-
1. Hardware issue (Reproduced on separate machine)
2. Logic issue (Checked the code, logic is fine)

Things it could still be -
1. Compiler
2. Code Semantic/mechanic (could be uninitialized variable I guess)

If anyone can help me think of any other reasons this type of memory corruption could occur so I can grow my "things it could be" list would be appreciated.
Also any advice on how to check the code thoroughly, advice on defensive programming etc would be nice.

Thanks again guys!
Last edited on
I have no experience with this kind of code but some ideas - kind of brain storm.

- use a log file. If it freezes the last log entry could give you a hint.

- use of assertions and exceptions

- use of Unit Tests

- check the loops - sometimes a freeze comes from an endless loop

- try to move the code into a normal program
Thanks thomas, mate I could use any help xD, let me clarify a few things;

Clarification of Problem
I know exactly where the problem lies with the inf loop (its in testhost.exe), the main program tries to get the ticket, the dll then checks if the status of that ticketID is
DONE
, if it isn't, then return the status of the ticket to the main program (testhost.exe).

The unit test (in testhost.exe) doesn't throw any tickets it keeps asking for the last ticket in the list it has (you can see this test submits 10 tickets at a time, then gets random ones until its got all ten, then loops again for X tries (where X is the number after the pipe symbol).

So the loop is because the ticket is never changed, that is the problem, the ticket status map in the dll (Thread Class).

Conclusion
I have tried logging everything the program does, it works, it actually seems to fix the corruption issue, it doesn't show itself in millions of unit test cycles. It's like the observer effect in quantum theory, really, really strange haha.

I will try to set up some extra specific unit tests for parts of the program but it really has to go through the whole process to be fully tested imo, data is passed through many objects/methods so the problem is most likely arising in that situation.

assertions, im going to look into that tonight, Ive noticed a few asserts being used in other code for sanitization of data, so I have the basic concept. But I tried that debug_exception or debugger_break_exception or w/e (code above^) with a if statement, you can read what the result was in that post, strange again.

And yes, I am going to try to isolate this problem as much as I can, by stripping code as thin as possible and by ruling out possibilities of compiler problems etc with other testers building/testing the application.

Anything else anyone can think of even if your inexperienced, please let me know your thoughts :).

Thanks
I second what Thomas just said. Try to keep eliminating things from the code so as to rule them out. Output log files are my primary method of debugging. I print out tons of stuff. When the going gets tough just about the value and address of every variable. Be wary of eliminating something from the code, seeing the bug gone, and assuming you've found the culpret. For what can happen is when code is eliminated it changes how the compiler lays things out in the binary or in memory, and something can go wrong somewhere else. Yes, it truely is sinister.

I wish I could help, but I don't do anything with the C++ Standard Library. I exclusively use the C Standard Library and my own C++ library code I've developed. So I'm not familiar with C++ Std. Lib. lists, maps, etc.
Thanks freddie but I tried logging stuff to a file, I don't get the problem anymore as far as i can tell. But i don't trust it to be fixed just by logging. Dilemma there-in.

edit- SUCCESS
Ok so actually managed to get this to happen with logging.

The problem was that the main thread called addTask and in there it would add the task item to the shared queue and then set ticket status to READY.
For one ticket out of so many it seems the main thread was slowed somehow and the READY line of code executed after several tickets (including the one it added that time) have been processed to DONE. So it was setting it to READY after it was set to DONE.

Setting the ticket to READY before the task item is added to the queue seems to resolve this issue.
---

Marking solved, [i]Now to find out why VS debugger isn't tracking variable data properly...[/i]

Thanks again guys I'm so glad I figured this out finally!

-- edit 1 -
Argh now after 500k cycles it froze again... damnit.
And of course adding the logger now stops the problem occuring in 1,000,000 cycles.

I think it is the same problem as above, but why would an instruction like that (set ticket status) be put after the add to queue method call? hmmmmmmm.

-- edit 2 -
51,960,000 cycles, with log = no freeze/errors/problems at all.

-- edit 3 - This was one of two multi threading issues.
Last edited on
So it seems that the other problem was that the thread would not be ready for the signal just after a task was inserted to the queue. Sometimes it would not process the last item.

For ex:
1
2
3
4
5
6
7
8
9
10
        _Thread1_                    _Thread 2_
                        --   Thread->WaitforSignal
Thread->InsertQueueItem --
Thread->SignalWorker    --
                        --   Thread->WhileQueueHasItems (1)
                        --   Thread->processNextItem
                        --   Thread->WhileQueueHasItems (0)
Thread->InsertQueueItem --
Thread->SignalWorker    --
                        --   Thread->WaitforSignal



Temp fix: By signalling worker on every request if waiting.
Perm fix: Check for "InsertingQueueItem" flag in "WhileQueueHasItems" loop?

Maybe...
1
2
3
4
5
6
7
8
9
10
11
while(somevar = true)
{
      if(QueueHasItems() | InsertingQueueItem)
      {
            //...
      }
      else
      {
            somevar=false; // break;
      }
}


Side note
Also, i was assigning/setting the ticket id status to done BEFORE inserting the ticket data. Not a good idea considering a client might request data that has not been inserted yet.

6*(10^6) tickets later, no problems. I think i finally cracked it haha.

Resolved

Why this didnt show up on winxp im uncertain, perhaps more strict thread control?
Last edited on
That's not at all unusual, had it happen many times, to have bugs show up when moving to the next Win OS. Kind of scary really. We write code and do our best, but really have no idea what punishment we're dishing out to the OS and its just absorbing it for us. Then next OS BAM!.
Thanks again freddie and grats on your 1337(th) post ;).

Guess the only real method of testing code on different platform version is to brute test them on those platforms concerned.

I guess it was just luck that XP didn't show this problem as it seems a fairly reasonable assumption this would happen on winXP machines (especially on xp compatibility shards on win7). Ah well, Test Test Test... haha

cheers
Topic archived. No new replies allowed.