ABI Innovation: Algorithms and Models for Distributed Computation of Bayesian Phylogenetics

This project is funded by the National Science Foundation under grant number 1355998.

Project Description

Given a set of molecular (DNA, RNA, or amino acid) sequences obtained from a number of organisms, phylogenetic inference is the task of reconstructing the evolutionary relationships among the sequences. These relationships are represented as an evolutionary tree, or, more generally, a directed, acyclic graph (DAG) when horizontal or vertical transmission of genetic material is modeled. The proposed project investigates the popular Bayesian approach to phylogenetic inference. Currently, all popular software packages for Bayesian inference utilize Markov Chain Monte Carlo (MCMC) algorithms. The problem is that these algorithms can take months to converge on large inference problems.The project's aim is to develop parallel algorithms for MCMC that are suitable for use in a modern cluster compute environment, such as Amazon.com's EC2 service.

To ensure broader impacts of this work, the software, along with a user's manual and tutorial, will be open-sourced and the PIs will make special effort in recruiting students from underrepresented groupsFor a phylogenetic inference algorithm to scale to thousands of organisms, parallelization is mandatory. For example, one idea that the project will expore is the so-called Bayesian Forest (BF) approach. In the BF approach, the MCMC algorithm maintains not one tree or DAG, but dozens, hundreds, or thousands of trees or DAGs at the same time, where all of them work together and share information with one another as the algorithm runs. This tends to avoid problems associated with converging to poor, locally optimal solutions because as long as one tree/DAG has avoided a problem configuration, over time, that tree/DAG can pull the entire forest away from the poor solution. Crucially, since the most expensive computations over the individual trees/DAGs are independent of one another, it is easy to send them to different machines. The project will also consider distributed algorithms for challenging variants of the problem, such as multi-loci data sets and vertical and horizontal transmission of genetic material, as well as co-estimation of gene and species trees. All software, along with a user's manual and tutorial, will be open-sourced.

Project Participants

The following people are currently working on this project:

Publications

Software