1 Background

The primary goal of this project is to improve comparative biogeographical estimation of shared (synchronous) population divergences or size changes across taxa using genomic data. We have implemented the models we will be testing in the software package ecoevolity.

The background information below will assume some knowledge of the types of comparative biogeographic models implemented in ecoevolity. The documentation provides a brief and gentle introduction to such models. For more details, please refer to Oaks (2019), Oaks, Siler, et al. (2019), & Oaks et al. (2020).

Ecoevolity estimates the relative timing of two types of evolutionary events among taxa:

Population divergence (Oaks 2019).
Change in the effective size of a population (Oaks et al. 2020).

Throughout this documentation, we will focus on divergences, but all the comparative models we discuss and test also apply to changes in effective population sizes (or a mix of divergences and size changes).

1.1 A hypothetical example

As a hypothetical example, let’s imagine three pairs of lizard species that are co-distributed across two islands. We want to test the hypothesis that these three pairs of species diverged when the islands were fragmented by rising sea levels 130,000 years ago (Figure 1.1). This predicts that all three pairs of populations diverged at roughly the same time on an evolutionary timescale.

Figure 1.1: Three pairs of lizard species co-speciating when their island is fragmented

To evaluate the probability of such a model of co-divergence, we also need to consider other possible explanations (i.e., models). For example, perhaps all of these populations were founded by random over-water dispersal events, which we would not expect to be temporally clustered. Or, perhaps two pairs of populations co-diverged when rising seas separated the islands, but the third pair of lizards diverged at a different time via dispersal.

With 3 pairs of lizards, there are 5 possible models to explain their history of divergence times:

All 3 diverged at the same time
All 3 diverged independently
2 diverged together and the third independently; the independent diverger could have been any of the 3 pairs, so these comprise 3 models

Below, Figure 1.2 illustrates these 5 models, where \(\tau\) represents independent divergence-time parameters.

Three pairs of lizard species co-speciating — All 3 pairs co-diverge

Two pairs of lizard species co-speciating — All 3 pairs co-diverge

The number of possible models is equal to the number of possible set partitions of the taxa, which increases very rapidly as the number of taxa increases.

1.2 Bayesian inference of divergence models

Bayesian model choice offers a way to determine the probability of divergence models. Such a Bayesian approach has been used with simulation-based (Hickerson et al. 2006; Huang et al. 2011) and full-likelihood (Oaks 2019; Oaks et al. 2020) methods. It has been shown that these methods can be biased toward over-clustering (i.e., underestimating the number of divergence times) (Oaks et al. 2013, 2014).

Oaks et al. (2014) proposed this tendency to prefer models with too few divergence-time parameters was caused by the prior probability density placed on the divergence times. In Bayesian model choice, the likelihood of a model is averaged over the priors, making the prior a “penalty” for adding parameters to the model. Let’s cover some basics of Bayesian inference to gain some intuition for why this is.

1.3 Basics of Bayesian statistical inference

Let’s focus on the simplest possible model to explain the divergence of our 3 pairs of lizard species. This simplest mode has a single parameter, \(\tau\), the time of divergences shared by all 3 pairs. Bayes’ rule tells us that the posterior probability density of \(\tau\) is proportional to the likelihood multiplied by the prior probability density

\[ p(\tau | D) = \frac{ p(D | \tau) p(\tau)}{ p(D) } \text{,} \]

where \(D\) is the data. The normalizing constant in the denominator (\(p(D)\)) is the marginal probability of the data under our model:

\[ p(D) = \int_{\tau} p(D | \tau) p(\tau) d\tau \text{.} \]

This is an average over the entire parameter space of the likelihood weighted by the prior. This averaging is only in the normalizing constant, so the posterior probability density of \(\tau\) tends to be pretty robust to our choice of prior. So, generally speaking, Bayesian parameter estimation tends to be pretty robust to the priors on the parameters, assuming we have informative data.

However, this changes once we start comparing models. To see this, let’s first add some notation to our equations above to make it explicit that the posterior is for our simple 1-parameter model, which we’ll denote \(M_1\):

\[ p(\tau | D, M_1) = \frac{ p(D | \tau, M_1) p(\tau | M_1)}{ p(D | M_1) } \text{,} \]

where

\[ p(D | M_1) = \int_{\tau} p(D | \tau, M_1) p(\tau | M_1) d\tau \text{.} \]

These equations are identical to the ones above. We’ve just added some bookkeeping to make it explicit which model we’re talking about.

Now, let’s consider a 3-parameter model, \(M_2\), in which all 3 pairs diverge independently:

\[ p(\tau_1, \tau_2, \tau_3 | D, M_2) = \frac{ p(D | \tau_1, \tau_2, \tau_3, M_2) p(\tau_1 | M_2) p(\tau_2 | M_2) p(\tau_3 | M_2)}{ p(D | M_2) } \text{,} \]

where

\[ p(D | M_2) = \int_{\tau_1}\int_{\tau_2}\int_{\tau_3} p(D | \tau_1, \tau_2, \tau_3, M_2) p(\tau_1 | M_2) p(\tau_2 | M_2) p(\tau_3 | M_2) d\tau_1 d\tau_2 d\tau_3 \text{.} \]

Notice, for the marginal likelihood of \(M_2\), we are now averaging over three prior densities. Assuming we are using the same prior on all three divergence times, we are cubing the space over which the likelihood is averaged. Again, if we just want to estimate the three divergence times under this model, the posteriors of these three parameters will be pretty robust to our choice of prior on \(\tau\) (assuming our data are informative).

However, if we want to compare \(M_1\) and \(M_2\), the prior plays a much bigger role. Bayesian model choice is also based on Bayes’ rule, but now we are interested in the posterior probability of each model:

\[ p(M_1 | D) = \frac{ p(D | M_1) p(M_1)}{ p(D | M_1)p(M_1) + p(D | M_2)p(M_2) } \text{.} \]

Notice, the marginal likelihood of the model is now in the numerator of Bayes’ rule; it’s no longer just a normalizing constant! By changing our focus from parameter estimation to model comparison, we have moved into a world of averaged likelihoods. Now, the priors we are averaging the likelihood over are super important, no matter how informative our data are. Choose a prior that places a bunch of density in regions of parameter space with low likelihood, and you can create a crippling penalty against models with more parameters.

So, if we don’t know much a priori about the divergence times of our pairs, and we choose a “vague” prior on divergence times to represent that uncertainty, we can create a strong penalty against models with more divergence time priors.

Let’s look at a silly example of coin flipping to gain more intuition for the differing influence of priors we place on parameters between Bayesian parameter estimation and model choice

1.4 A coin flipping example

This example is stolen from Oaks, Cobb, et al. (2019). Let’s use \(\theta\) to represent the rate of a coin landing heads-side up. We will flip the coin 100 times and record the number of times it lands head up, which we will model as a random outcome from a binomial distribution. Before seeing the coin and flipping it, let’s say we want to compare four models that vary in our prior assumptions about the coin’s rate of heads:

\(M_1\): All values are equally probable; \(\theta \sim \text{Beta}(1,1)\)
\(M_2\): It might be a trick coin, but we don’t know if it’s weighted toward heads or tails; \(\theta \sim \text{Beta}(0.6,0.6)\)
\(M_3\): The coin is probably fair; \(\theta \sim \text{Beta}(5,5)\)
\(M_4\): The coin is probably weighted toward tails; \(\theta \sim \text{Beta}(1,5)\)

If we happen to flip 50/100 heads, Figure 1.3 (from Oaks, Cobb, et al. (2019)) shows what our posterior distributions of \(\theta\) will look like, along with the marginal likelihoods of the four models.

Results of our coin-flipping experiment — Figure 1.3: Results of our hypothetical coin-flip experiment. From Oaks, Cobb, et al. (2019).

Notice, how the posterior estimates of \(\theta\) are nearly identical across all four models, but the average (marginal) likelihood of the models are very different. While silly, this is an example of how differently our choice of prior on a parameter affects Bayesian parameter estimation vs model choice.

1.5 What to do?

Okay, so putting a “vague” prior distribution on divergence times can create a downward bias on the number of divergence times. What can we do about this bias?

We could take an empirical Bayesian approach and use our data to inform the “prior” we put on divergence times. This is double-dipping, which could be okay if we just want point estimates of parameters. For example, if we cheated in our coin-flipping experiment above and added a 5th model that uses the posterior from \(M_1\) as the prior, the mode of the posterior under this 5th model will be nearly identical to the other four models. However, when comparing models, using the data twice can have a big influence on the marginal likelihoods of the models.

All is not lost though. We can use a hierarchical Bayesian approach to make estimating the prior on divergence times part of the model.

1.6 A hierarchical Bayesian approach

By using a hierarchical model, we can let the data tell us what the prior on divergence times should be, without double dipping. To get a sense for how this can work, let’s go back to our 3-parameter model from above, but add another level to the model. Let’s use \(\mu\) to represent the mean of the prior on all the divergence time parameters. Dropping the \(M_2\) notation from above for simplicity, our hierarchical 3-parameter model will look like

\[ p(\tau_1, \tau_2, \tau_3, \mu | D) = \frac{ p(D | \tau_1, \tau_2, \tau_3) p(\tau_1 | \mu) p(\tau_2 | \mu) p(\tau_3 | \mu) p(\mu)}{ p(D) } \text{,} \]

where

\[ p(D) = \int_{\tau_1}\int_{\tau_2}\int_{\tau_3}\int_{\mu} p(D | \tau_1, \tau_2, \tau_3) p(\tau_1 | \mu) p(\tau_2 | \mu) p(\tau_3 | \mu) p(\mu) d\tau_1 d\tau_2 d\tau_3 d\mu \text{.} \]

Notice that \(\mu\) (the mean of the divergence-time prior) is one-level removed from the data. So, integrating \(\mu\) does not directly involve the model likelihood. This will allow the values of \(\tau\) informed by our data (which remember, tend to be robust to the prior) to inform the mean of the prior. If our data suggest our prior on \(\tau\) is “off”, the second-level of the model can learn this indirectly from the data.

Also notice that we have the new term \(p(\mu)\) in our model. This is the prior probability of \(\mu\), the mean of the divergence-time prior distribution. When we want to estimate a parameter of a prior distribution, we have to put a prior on it.

Above, we use the mean of the prior distribution on divergence times as an example. In theory, we can do this for any parameter of the prior distribution on divergence times (e.g., shape, scale, mean, standard deviation, min, max, etc.).

1.6.1 What’s “hyper”? Terminology of hierarchical Bayesian models

In the parlance of Bayesian statistics, when we estimate the parameter of a prior distribution, we call it a hyperparameter. The prior distribution we place on a hyperparameter (i.e., the prior on the parameter of a prior) is called a hyperprior. For this project, we will be exploring hyperpriors on divergence times in the ecoevolity software pacakge. Hence the name of this project.

1.7 In summary (tl;dr)

In comparative biogeographic methods like msbayes and ecoevolity, prior distributions on divergence times with lots of uncertainty can create a strong penalty against models with more divergence-time parameters, and thus cause a downward bias in the number of divergence events across the taxa being compared. For this project, we will explore whether a hierarchical Bayesian approach can mitigate such a bias. Specifically, we will explore using hyperpriors on the parameters of the prior on divergence times to see if the model can learn about the divergence-time prior during the analysis and better calibrate the “penalty” against adding divergence-time parameters to the model.

Hickerson, Michael J., Eli A. Stahl, and H. A. Lessios. 2006. “Test for Simultaneous Divergence Using Approximate Bayesian Computation.” Evolution 60 (12): 2435–53.

Huang, Wen, Naoki Takebayashi, Yan Qi, and Michael J. Hickerson. 2011. “MTML-msBayes: Approximate Bayesian Comparative Phylogeographic Inference from Multiple Taxa and Multiple Loci with Rate Heterogeneity.” BMC Bioinformatics 12: 1. https://doi.org/10.1186/1471-2105-12-1.

Oaks, Jamie R. 2019. “Full Bayesian Comparative Phylogeography from Genomic Data.” Systematic Biology 68 (3): 371–95. https://doi.org/10.1093/sysbio/syy063.

Oaks, Jamie R., Kerry A. Cobb, Vladimir N. Minin, and Adam D. Leaché. 2019. “Marginal Likelihoods in Phylogenetics: A Review of Methods and Applications.” Systematic Biology 68 (5): 681–97. https://doi.org/10.1093/sysbio/syz003.

Oaks, Jamie R., Nadia L’Bahy, and Kerry A. Cobb. 2020. “Insights from a General, Full-Likelihood Bayesian Approach to Inferring Shared Evolutionary Events from Genomic Data: Inferring Shared Demographic Events Is Challenging.” Evolution 74 (10): 2184–206. https://doi.org/10.1111/evo.14052.

Oaks, Jamie R., Charles W. Linkem, and Jeet Sukumaran. 2014. “Implications of Uniformly Distributed, Empirically Informed Priors for Phylogeographical Model Selection: A Reply to Hickerson Et Al.” Evolution 68 (December): 3607–17. https://doi.org/10.1111/evo.12523.

Oaks, Jamie R., Cameron D. Siler, and Rafe M. Brown. 2019. “The Comparative Biogeography of Philippine Geckos Challenges Predictions from a Paradigm of Climate-Driven Vicariant Diversification Across an Island Archipelago.” Evolution 73 (6): 1151–67. https://doi.org/10.1111/evo.13754.

Oaks, Jamie R., Jeet Sukumaran, Jacob A. Esselstyn, et al. 2013. “Evidence for Climate-Driven Diversification? A Caution for Interpreting ABC Inferences of Simultaneous Historical Events.” Evolution 67 (4): 991–1010. https://doi.org/10.1111/j.1558-5646.2012.01840.x.