Distributed computation

I have an easily parallelizable workload that I would like to run on multiple computers on the same network.
Ideally I'd like it to work something like this:

* A manager process at a known location listens for incoming connections from workers.
* The manager handles the details of work assignment to online workers, storing results, networking, etc. It merely asks a loaded module for details of how to partition the workload.
* A worker process connects to the server and spins up threads according to the computer's capabilities, calling into a module in each thread to do the computation and then sending back completed results to the manager.
* Again, the worker takes care of the low lever details such as networking and threading.

The code in the loaded module (my code) should be something not much more complex than
1
2
3
4
5
6
7
8
9
10
11
12
13
14
//Called from manager
Workload get_workload(){
    ...
}

//Called from worker
Result process_workload(Workload &){
    ...
}

//Called from manager
void results_complete(const std::vector<Result> &){
    ...
}

Does anyone know of a system/library/whatever that does something like this, or how I could search for it?
Last edited on
Message Passing Interface (MPI) could implement manager-workers, but not quite like you describe. MPi is a library.

Schedulers like HTCondor and SLURM could launch jobs (including MPI) at remote machines.
Thanks, I'll look into it. What keywords should I use when searching for stuff like this? "Remote scheduler"?
I'm interested in this as well. I have a similar use case for which I currently use boost::interprocess (can't just use threads because a library I'm relying on for computation is not thread-safe), but I want to move to a distributed version. I was looking for 'RPC HPC' and 'actor model HPC' and came across Charm++ ( http://charmplusplus.org/ ). Looks nice but haven't had time to evaluate it yet. In any case, if you settle on something please share.
closed account (E0p9LyTq)
helios wrote:
I have an easily parallelizable workload that I would like to run on multiple computers on the same network.

You should look at the source code for the BOINC Manager, Berkeley Open Infrastructure for Network Computing.

https://boinc.berkeley.edu/trac/wiki/SourceCodeGit

Volunteer Mac developers are needed, the current solo one is leaving the project.

https://boinc.berkeley.edu/trac/wiki/MacDeveloper

Note, this is non-paying. Good bullet point on the resume, though.

I run several BOINC projects on my PCs when I am not using them.
HTCondor and SLURM have their own pages:
https://research.cs.wisc.edu/htcondor/
https://slurm.schedmd.com/overview.html

They are not directly linked to what you run. SLURM assumes that every node sees the same (shared) path.


MPI-application uses the MPI library for interprocess communication (over network). If one process fails, whole application is bound to fail due to race conditions. (I presume; haven't touched MPI-base code within this millenia.)

There are MPI applications that have single-threaded processes. All communication is via MPI.
Others use one multi-threaded process per machine and use MPI only for inter-machine communication.
closed account (z05DSL3A)
Part of the Intel Performance Libraries is an MPI library, I guess you could check out their docs and sample code...
http://www.cplusplus.com/forum/lounge/250217/
Topic archived. No new replies allowed.