Distributed IMP

We would like to provide a simple mechanism for describing (and implementing) weakly communicating sampling protocols in IMP. For non-communicating protocols, we already have the structure given for biological systems where the sample scripts are given an index and the total number of jobs (eg run job 0 of 10) and then can divide up the work appropriately. These scripts can be run on a cluster to parallelize the work. This approach doesn't work when the various jobs need to communicate with one another. Examples of this communication include

Limitations

Discussion points

What are other communication requirements?

Various modeller things spawn small, communication free jobs and amalgamate the result.

A proposal

We build things around the assumption that there is a fixed Model/Restraints for everything and the tasks simply involve a Configuration input, a ConfigurationSet output and a sampling script to go from one to the other. This architecture allows the setup to be done once and then used for many sampling tasks. In order to provide more flexibility, one can have "contexts" which are simply setup environments. When executing a task, you can pick which context you want it to be executed in

A python interface could look something like

class Task {
ConfigurationSet *operator()(Model *m, PythonMap data);
};

class Setup {
(Model*, PythonMap) operator()();
}

class Manager: public Object {
  Context add_setup_script(Callable setup);
  TaskID create_task(Callable task, Configuration config_to_start_from,
                     Context=default_context);
  /** Blocking call to get the result */
  ConfigurationSet* get_configuration_set(TaskID);
};

Usage would be something like (replace IMP.Manager with a specific implementation eg SGEManager).

# get the locations of the slave processes from the command line or some other way.
man= IMP.Manager(sys.argv)
def my_setup():
  m= IMP.Model()
  ...create_particles...
  ...create_restraints...
  ...
  return (m, {'molecule':molecule_particle, "parameters":sim_param})

context=man.add_setup_script(my_setup)

def MDTask(temperature):
  def func(m, data):
     # we could just set this in the configuration, but it makes swaps easier
     data['parameters'].set_temperature(temperature)
     ...
     return cs
  return func

(m, data)= my_setup()
tasks=[]
for t in temperatures:
  randomize_particles(m, data)
  c= IMP.Configuration(m)
  tasks.append(man.create_task(MDTask(t), c, context))
while True:
  css=[]
  energies=[]
  temperatures=[]
  for t in tasks:
    cs= man.get_configuration_set(t)
    cs.load_state(0)
    energies.append(m.evaluate(False))
    temperatures.append(data['sim_params'].get_temperature())
    css.append(cs.get_configuration(0))
  ...decide who needs to swap and swap the temperature values...
  ... spawn again...

Implementation

Communication can (hopefully) be done using pyro. If not, raw sockets.

To implement it, we should lock the model so the set of restraints/weights can't be changed. We can make restraint sets store their weights in particles so that restraints can be turned on and off.

The map should be clone before being passed to a task so that changes to it are lost.

General communication model discussion

Control model

Master/slave

Defined as a master node and a number of slaves which have persistent state. The master controls the slaves by sending asynchronous messages of some sort or another

advantages

disadvantages

task based

Defined as having one controlling node which creates tasks. The tasks are then farmed out to slave processes. The interface provides no direct control over what tasks are done where. This is the nicest model to program against, I think and the easiest to provide an implementation for. T

advantages

disadvantages

Peer-to-peer

advantages

disadvantages

How would fault tolerance be handled?

The consensus seems to be that it must be handled transparently by the framework if it is to be handled (that is, people weren't much interested in writing fault tolerance code themselves). This makes peer to peer and fault tolerance incompatible.

C++ vs Python

Python at least at first. For one thing, transmitting C++ code over the network is hard...

C++ would be nice though.

Other Interface proposals

Sampling specific peer to peer

One idea for providing this functionality would be to have a Communicator class. It would provide functionality to send sets of particles and messages to other jobs. The communication could be implemented via files or via MPI as appropriate (and as we get motivation). Such a class could look something like

class Communicator: public Object {
public:
/** Create a communicator with the given ID and with the assumption that there are number_of_jobs total jobs */
Communicator(unsigned int id, unsigned int number_of_jobs);

unsigned int get_id() const;
unsigned int get_number_of_jobs();
void write_message(unsigned int target, Message message);
/** Spin while there are no messages.
*/
Message read_message(unsigned int source);
// communicate particles to another job
void write_particles(ParticlesTemp ps, unsigned int target);

// get particles from another job
void read_particles(unsigned int source, ParticlesTemp ps);
};

Message are Keys. The advantage (over just using ints) is that they are typed and that there is a natural, and failsafe, mechanism for translating them into pretty printable strings (so that it is harder to mix them up and the Communicator can enforce invariants if we want, such as the same set of messages being defined on both ends).

Usage would look something like this (for replica exchange)

id= int(sys.argv[1])
num_jobs= int(sys.argv[2])
com= IMP.Communicator(id, num_jobs)

no_message= IMP.add_message("no")
yes_message= IMP.add_message("yes")

initialize things

while not done:
  do some sampling
  orig_config= IMP.Configuration(m)
  # note this doesn't really work, we need to divide into odd and even sets or something
  com.write_particles(relevant_particles, id+1)
  com.write_particles(relevant_particles, id-1)
  com.read_particles(id+1, relevant_particles)
  if we should swap with +1:
     com.write_message(id+1, yes_message)
     figure out how we want to ignore the particles
  else:
     com.write_message(no_message, id+1)
     message= com.read_message(id-1)
     if message==IMP.Message("yes"):
       com.read_particles(id-1, relevant_particles)

IMP: distributed_computations (last edited 2011-11-24 23:39:21 by newacct)