Ambiguity in IMP
On this page we'll address the elusive notion of ambiguity.
What is ambiguity
Several distinct types of ambiguities can be discriminated, that are usually mixed with one another during a modelling process. Three major cases are detailed hereafter.
Ambiguous identities
Experimental data usually provide knowledge on protein types, whereas when building a model, one is ultimately concerned with "instances" of proteins. For example, a complex may be composed of two proteins of type A. When building a model, one will ultimately distinguish both instances A1 and A2 of protein type A.
One part of the model building process consists in converting (or interpreting) "type-based" data coming from experimental sources into a 3D "instance-based" description.
Ambiguous connectivity mechanism
Certain experiments tell you that two proteins interact, but not how. For example, you could know that two proteins touch, but not what the interface is which they touch on.
Ambiguous connectivity graph
Some experimental techniques (such as affinity purification or pull-outs) provide informations on interconnectivity between a set of proteins. In such cases, the only information that is accessible is that these proteins are somehow connected with one another, but the exact arrangement between proteins is unknown.
Handle ambiguity when modelling
Methods to handle ambiguities have been proposed in the following publications :
- Frank Alber et al., “Integrating diverse data for structure determination of macromolecular assemblies,” Annual Review of Biochemistry 77 (2008): 443-477.
- Frank Alber et al., “Determining the architectures of macromolecular assemblies,” Nature 450, n°. 7170 (Novembre 29, 2007): 683-694.1.
Keren Lasker et al., “Integrative structure modeling of macromolecular assemblies from proteomics data,” Molecular & Cellular Proteomics: MCP 9, n°. 8 (Août 2010): 1689-1702.
How is ambiguity handled in IMP ?
IMP provides some tools to help one handling ambiguities:
- Ambiguous connectivity mechanism is handled by the IMP::core::KClosePairsPairScore which can score two complex objects based on the closest interacting parts. This result is correct in that the score will be 0 if and only if all the entities are connected.
Ambiguous connectivity graph in the absence of ambiguity about identity can be handled thoroughly by the IMP::core::ConnectivityRestraint. The implementation of this restraint follows the propositions in ref[1] : The restraint contains references to the set of particles that compose the composite object. The complete graph of the composite is built (i.e. the graph that joins every possible pair of particle in the composite). Each edge in that graph is then weighted with a configurable pair-score that depends on the distance between both particles connected by that edge. This weight is updated each time the model changes, and the interconnectivity is then evaluated as the minimal score that can be obtained by summing weights over a path interconecting each particle in the composite (namely, finding the minimum spanning tree or MST). This result is correct in that the score will be 0 if and only if all the entities are connected.
Ambiguous identity is also currently handled by IMP::core::ConnectivityRestraint, but less well. All entities of the same type are grouped together and edges are given the weight of the minimum weight between any pair of the two types of the endpoints. For example, with proteins type A, B, C, D, the graph used has 4 nodes (one for each of A, B, C, D) and the weight of edge A,B is the minimum length connected any protein of type A with and of type B. The resulting score is 0 if the complex is connected, but the other direction does not hold as the A-B and B-C edges could be using different members of B.
In all of these, certain edges can be explicitly disallowed by dropping them from the candidate set or setting their cost to infinity.
Towards better handling of ambiguous data
Cases where we would like a better tree to be computed:
- we would like to enforce that a graph is connected
- we would like to be able to generate multiple equivalent graphs in the case of symmetry
- we would like to be able to limit stoichiometry in the graph?
The first is relatively easy. Instead of computing the MST, one simply grows a tree from each of the copies in a particular class by adding edges (if desired) in sorted order and takes the cheapest tree. The score will be 0 if and only if the relevant things are connected.
The second can be done if one of the types has the same number of copies as the number of trees. Then you grow a tree from each copy. The cost of an edge is given by the cost of the minimum matching between the copies of the endpoints in the tree with any of the copies of the appropriate type not in the tree.
The last is hard and seems to require heuristic search (eg A*).