Running MPI on Linux

Hello,

I have parallelized my code using MPI with the Master-Slave scheme. The code is too big to be written here.

I have a problem when I run it on Linux while it works just fine on Mac.

When I run the code on Linux for short like hundred steps it works fine and generates all the output files while for thousands steps it seems going to a kind of coma. When I do squeue, it is not in the queue anymore, usually meaning that the job is done, but none of the outputs are generated. When, I check the slurm output it seems it is still running.

I don't have this problem when I run it on my Mac. I compile it with the latest gcc and g++ compilers.

I am confused! does anyone have any clue what might be going wrong with code?

Thanks,
M

The 'squeue' is a SLURM tool that shows running jobs or jobs that are queuing for execution. It does not show completed/failed/killed jobs.

Commands 'sacct' and 'seff' can tell something about completed jobs (depends on how the SLURM has been configured).


Based on the information that you have given so far it is easiest to assume that there is nothing wrong with the code.
Thank you for your reply. I worked on it a little bit more. I think it gets out of memory.
I am broadcasting a lot of data to all the slave nodes per iteration of a large loop.
It seems that it does not clean the buffer. I tried to send the finalize signal to the slaves that the transferring data is over, but it only broadcast only for the first iteration and then it remains waiting for ever.

Any suggestion how I can avoid it?

Thanks,
M
Topic archived. No new replies allowed.