Code simply stops randomly?

Hello guys,

I am experiencing strange code behavior.

So in the institute of my university, we have a "cluster", i.e. thousands of CPU's that we can use to run our simulations with different input parameters simultaneously.

In my case, I use different random seeds for a random number generator.

However, it seems like on some of the cluster nodes my code does not complete. There is no error message or anything (well, a cluster error message that the task unexpectedly stopped). From the output that is written to a file, I can see that the code randomly stops at a point in the program where nothing should go wrong. It does not happen with all tasks, so on most cluster nodes the program finishes just fine.

When I try to reproduce the error on my own computer, using the same input parameters (including randomseed) of one of the flawed cluster runs, the code completes just fine.

Now, my randomseed generates random numbers in a number interval given by two numbers that are obtained via floating point arithmetics (see my other thread), so even if I use the same randomseed on my own computer, the execution might not be completely equivalent.

Still, it makes no sense. What does it mean when a program simply stops without an error message and without finishing properly?
I am compiling the code on my own computer and only executing it on the cluster. Maybe I should try to compile it on the cluster nodes themselves...

Just wanted to ask whether people here are familiar with such behavior.


Best,
PhysicsIsFun.

Last edited on
Still, it makes no sense. What does it mean when a program simply stops without an error message and without finishing properly?


Most often this is due to some 'undefined behavior' issue - code that doesn't work occasionally, but does most often, must be doing something unpredictable.

Without code, however, I can't see what that might be.
That's the signature of undefined behavior. Most likely there is an error in your code that only shows up in specific circumstances.
could also be an infinite loop. Are the programs sharing anything, could it be a deadlock? Those 3 things (undefined, loop, or deadlock) are all possible here.
Thanks for your input.

Would an infinite loop not keep running... like... infinitely?^^


edit: Deadlock seems to be a concept from multithreading, right? I am not doing that, each simulation is only executed by a single CPU.
Last edited on
The best way for us to help is for you to post your code. It's nearly impossible otherwise.

In your other thread, you mention that the random number generator takes a and b which are the result of some floating point calculations. Could you be dividing by zero or otherwise creating invalid values?

Does the cluster give you a core file for the stopped programs? You could examine that with a debugger to find the problem.

Programs exit with a return code (the value returned from main(), or a larger value indicating the reason that the program was terminated by the operating system). Does the cluster tell you the return code?

If you can't get a core file then you may need to add some debugging output to help see what the program is doing.

But if you post your code, someone here might be able to locate the problem.

I misread something, its not infinite loop if it crashed.
If it crashed, don't rule out disk full or out of memory or other local problems that have nothing to do with your code (directly).
@dhayden

The core output file for the defect runs simply states that the task died...
There is another outputfile that simply prints what otherwise would be given by "cout<<...". From this file, I can see that it happens at completely random points in the code (and not, say, always during the same procedure).

Wouldn't numerical errors like dividing by zero give a runtime error or sth. like that?

I can't post the code. On one hand, it is too large. On the other hand, I am not sure whether I am allowed to publish it here^^"
If you can't directly use a debugger, then all I can suggest is to add more print statements, and since it sounds like you redirecting to output files, make sure you are flushing the stream after each print statement, since you are writing to a file and you're not sure when it exactly crashes.

Edit:
Possibly more important: I seen no information given about what your environment is, other than you said it's single-threaded. What is the OS? How much memory do you have to work with? Is this on a virtual machine?

If this is Windows, dump files can be generated when an application crashes.
https://docs.microsoft.com/en-us/windows/desktop/wer/collecting-user-mode-dumps
You can then examine the crash dump file to figure out which part of the stack you were on, and other states.

I don't know what capabilities other OSes have.

Edit 2:
You also should conduct some sanity tests. Does a simpler program that just runs in a busy loop of some sort, doing basic computation, also crash when run in this environment?
Last edited on
The core output file for the defect runs simply states that the task died...
Can you get a stack trace from the core file? That alone may point to the problem. Better yet would be to examine the core file with a debugger.
Hi guys,

thank you again! Unfortunately, I cannot answer most of your questions. But I will keep testing my stuff on the cluster and talk to the admin later, also with respect to your suggestions!

Thanks!
I believe whether it crashes on a divide by zero is compiler flag. It may or may not. divide by zero sets a floating point nan value that usually propagates. Eventually everything you print will have 'nan' in it. You can check your compiler flags, see if you can find the control for this if you think it happened.
divide by zero sets a floating point nan value
I suppose I'm being picky, but the result of divide by zero is actually +/- infinity unless it's 0.0 / 0.0, which is "not-a-number" (nan). I agree that the result tends to be contagious. Almost any expression containing inf or nan will result in inf or (more likely) nan.
Hi again,

just to let you know:

The problem seemed to stem from the cluster hardware. It was really hot in my part of the world at that time, and the current theory is that the hardware went too hot.
The error suddenly stopped appearing.


@PhysicsIsFun,

The error suddenly stopped appearing.


ooooooooo I hate it when that happens!

"Undefined behavior from the HARDWARE!"

...was wondering what happened to you...glad it makes sense.

Check the Freon, clean the coils :)

Heh... ascribing hard-to-diagnose problems to "thermal issues" became a bit of a running joke at a previous job :)
Last edited on
Well, as long as the error does not reappear, I take it as a convenient explanation and move on with my life :D
And keeping a close eye on the weather forecast :P
Topic archived. No new replies allowed.