The RMF file format (short for Rich Molecular Format) stores hierarchical data about a molecular structure in a binary file. This data can include
For example, a protein can be stored as a hierarchy where the root is the whole molecule. The root has one node per chain, each chain has one node per residue and each residue one node per atom. Each node in the hierarchy has the appropriate data stored along with it: a chain node has the chain identifier, and a residue node has the type of the residue stored and atom nodes have coordinates, atom type and elements. Bonds between atoms or coarser elements are stored explicity as dealing with external databases to generate bonds is the source of much of the difficulty of dealing with other formats such as PDB.
The file might also include a pair for storing the r-value for a FRET measurement between two residues as well as extra markers to highlight key parts of the molecule.
Multiple conformations on the hierarchy are stored as frames. Each frame has the same hierarchical structure, but some aspects of the data (eg coordinates) can have one value for each frame (or no value for a particle frame if they happen not be be applicable then).
A hierarchical storage format was chose since
See simple.rmf for an XML dump of the RMF generated from simple.pdb. For a larger example, see 3U7W.rmf from 3U7W.pdb. Note, that viewing XML files works much better with Firefox Google Chrome than with Safari. For more information about the library see RMF. And for the standard data storage schemes see standard categories.
More technically, each node in the RMF hierarchy has
One accesses nodes in the hierarchy using handles, RMF::NodeHandle and RMF::NodeConstHandle. The root handle can be fetched from the RMF::FileHandle using RMF::FileHandle::get_root_node().
Each attribute is identified by a key (RMF::Key) and is defined by a unique combination of
On a per RMF basis, the data associated with a given key can either have one value for each node which has that attribute, or one value per frame per node with the attribute. The methods in RMF::NodeHandle to get and set the attributes take an optional frame number.
A number of data categories and attributes have been defined so far. New ones can be added as needed, without affecting existing files. See the documentation for each category for more information:
In addition, arbitrary data can be associated with sets of Nodes, accessed via, eg, RMF::NodePairHandle and RMF::NodePairConstHandle. This mechanism is used to store bond information (which is always stored in the file to avoid the difficulties associated with parsing PDB files). Information about bond angles or torsion angles could be stored using the same mechanism. Currently, the library API supports sets up to size 4, but we can add arbitrary sized sets when needed.
Attributes on a parent node should be thought of as being inherited by their children when appropiate unless the child overrides it with a value of its own. For example, if a node in the hierarchy has a color stored in it, the children should be considered as having the same color unless they have a different color of their own. Similarly, atoms inherit their residue index and type from their residue parents and all of those get the chain identifier from their parent.
When adding data to an RMF file that is just to be used for internal consumption, one should create a new category. For example, IMP defines an ''imp'' category when arbitrary particle data is stored.
If, instead, the data is likely to be a general interest, it probably makes sense to add it to the documentation of this library so that the names used can be standardized.
The RMF data is stored in a single HDF5 group in the file on disk. As a result, one could easily put multiple RMF "files" in a single HDF5 archive, as well as store other data (such as electron density maps). However, adding extra data sets within the RMF HDF5 group is not supported.
HDF5 was chosen over the other candidates as it
The RMF data is spread over various data sets, divided up into classes based on the RMF::Category, data type and whether the particular attribute has one value per frame or just one for the whole file and whether the data is for one one or a sets of nodes. Each node has space allocated where it can store information about whether it has attributes in a given class, and if so, where in the corresponding data set the attributes are stored.
Space is allocated in the appropriate table if an attribute in a particular class is used in a node. A special marker value is used to signify when a particular attribute in a class is not found for a particular node (e.g. a -1 is used to signify that a node does not have an index attribute).
To get any idea of the data layout in a file, see the dump (produced by h5dump) of a tiny RMF, simple.rmf. For a larger example, see 3U7W.rmf.