I recently read “The Treeterbi and Parallel Treeterbi algorithms: efficient, optimal decoding for ordinary, generalized, and pair HMMs” by Keibler, Arumugam and Brent which was published in Bioinformatics 23(5):545-554 (2007). This paper outlines an algorithm for optimal decoding of HMMs which reduces memory requirements compared to the more standard Viterbi decoding. Furthermore, unlike the Hirschberg algorithm (aka Myers-Miller) it does not result in a significant increase in runtime.

The two key ideas which make their algorithm work are garbage collection and early decoding. By taking advantage of the structure of the model (dynamic programming recursions) they can identify when a cell of the matrix is no longer reachable. Once unreachable, the cell can not be part of the final optimal path and is freed (garbage collection). Likewise, nodes along the optimal path can be freed (starting from the root) if the path leading to that node becomes clearly defined. This is the essence of early decoding. Taken together, Keibler applies these concepts to N-SCAN (a gene predictor) and Pairagon (cDNA-to-genome alignment) with dramatic results. In the discussion they highlight how these ideas can also be applied to more generalized graphical models to exploit locality when it exists within the model.

In general, probabilistic models are excellent tools for providing principled solutions to difficult problems. The are particularily useful in cases where the model can reflect prior knowledge about the problem domain (e.g. the triplet codon nature of genes for example). Unfortunately, when applied to large datasets, the inference algorithms often require an expensive calculation — either in time, memory or both. Examples include (but aren’t limited to) chromosomal sequence alignment, genefinding, covariance model searches, and the simultaneous alignment and folding of RNAs (Sankoff algorithm).

There have been a number of papers lately which have engineered methods of constraint (some heurstic and others rigorous) into their probabilistic models in order to tackle large datasets. This is particularly true for problems involving RNAs where these approaches are critical, such as :

- Weinberg and Ruzzo’s use of rigorous filtering for searching for non-coding RNAs using covariance models
- Nawrocki and Eddy’s use of query-dependent dynamic programming for reducing the runtime of covariance models
- Many efforts on making simultaneous align and fold of RNAs practical such as Ian Holmes work on constraints on pairSCFGs; Harmanci et. al. ‘s recent improvements to Dynalign; and Havgaard et. al.’s work on constraining the dynamic programming of Foldalign.
*There are many others working on this, but I want to be brief …*

Two things to note here:

- Large datasets are becoming the norm in biology, so constraints are simply necessary to be practical.
- When using a heuristic methods one needs to understand the tradeoffs inherent in the method to assess the risk/rewards of setting the particular parameters which balance accuracy vs speed.

Keibler, E., Arumugam, M., Brent, M.R. (2007). The Treeterbi and Parallel Treeterbi algorithms: efficient, optimal decoding for ordinary, generalized and pair HMMs. Bioinformatics, 23(5), 545-554. DOI: 10.1093/bioinformatics/btl659

## Leave a Reply