The Configuration File¶
Configuration File Components
You specify most of the settings for a phycoeval analysis in a YAML-formatted configuration file. YAML is an “anti-markup” language and data standard that is much more human friendly than alternatives like XML.
Note
The website http://www.yamllint.com/ is a nice tool for debugging YAML syntax. You can copy and paste your config file their to check of you’re using valid YAML syntax.
The simplest possible configuration file for phycoeval would look something like:
---
data:
constant_sites_removed: false
alignment:
path: alignment/Gekko.nex
This specifies the path to the nexus file containing the aligned character data for the species we want to infer a phylogeny for, and specifies whether we have removed constant sites from the alignment. The defaults for all other settings would be used in this case, which we strongly discourage.
Now, let’s look at an example that is more thoroughly specified:
---
data:
ploidy: 2
constant_sites_removed: false
alignment:
genotypes_are_diploid: true
markers_are_dominant: false
population_name_is_prefix: false
population_name_delimiter: ' '
path: alignment/Gekko.nex
# This is a comment!
tree_model:
tree_space: generalized
starting_tree: comb
tree_prior:
uniform_root_and_betas:
parameters:
root_height:
estimate: true
prior:
gamma_distribution:
shape: 4.0
mean: 0.01
alpha_of_node_height_beta_prior:
value: 1.0
estimate: false
branch_parameters:
population_size:
equal_population_sizes: true
value: 0.0002
estimate: true
prior:
gamma_distribution:
shape: 4.0
mean: 0.0002
mutation_parameters:
freq_1:
value: 0.5
estimate: false
mutation_rate:
value: 1.0
estimate: false
mcmc_settings:
chain_length: 15000
sample_frequency: 10
All the settings are hierarchically nested by the indent spacing.
For example,
data
, tree_model
, branch_parameters
,
mutation_parameters
, and mcmc_settings
are all at the highest level of the hierarchy, and have various settings nested
within them.
Across a given level of the hierarchy, order does not matter. E.g., the
top-level groups of settings listed above can be arranged in any order.
Anything proceeded by a ‘#’ is a comment that is ignored by phycoeval.
General parameter syntax¶
Before we get into the specific syntax for each setting in the config file,
let’s first
look at the general syntax that applies
to all parameters in the tree_model
, branch_parameters
, and
mutation_parameters
sections of the config.
The general syntax for a parameter is:
parameter_name:
estimate: true # or false
value: 1.0
prior:
a_valid_distribution:
distribution_parameter: 1.0
So, for any parameter, you can specify:
Whether or not the parameter should be fixed (
estimate: false
) or estimated (estimate: true
).A value for the parameter. This is only the starting value if
estimate
istrue
, or is the fixed value ifestimate
isfalse
. If a value is not specified, the starting value is drawn from the prior.The prior probability distribution. This is ignored if
estimate
isfalse
.
tree_model¶
Next, let’s walk through how to specify the tree model in the configration file.
tree_model:
tree_space: generalized
starting_tree: comb
tree_prior:
uniform_root_and_betas:
parameters:
root_height:
estimate: true
prior:
gamma_distribution:
shape: 4.0
mean: 0.01
alpha_of_node_height_beta_prior:
value: 1.0
estimate: false
tree_space¶
There are three options for the tree_space
setting:
tree_space: generalized
This tells phycoeval to use a generalized tree distribution that allows for shared and/or multifurcating divergences.
tree_space: bifurcating
This specifies a “standard” tree distribution that assumes all divergences are independent and bifurcating.
tree_space: fixed
This tells phycoeval that you want to fix the tree topology. To do this, you must also provide a starting tree with the topology you wish to hold constant during the analyses.
starting_tree¶
phycoeval will first check to see if starting_tree
is set to one of two
options:
starting_tree: comb
This specifies that the MCMC chain should start with the tree topology with only one multifurcating internal node.
starting_tree: random
This specifies that the MCMC chain should start with the random topology with only independent, bifurcating divergences; i.e., a random topology with internal nodes, where is the number of tips (populations/species).
If the setting for starting_tree
is not comb
or random
,
phycoeval will next check to see if the string provided is a valid path
to a file.
If it is, the starting tree will be read from the file, which can be in
phylip/newick format or nexus format.
If the setting for starting_tree
is not comb
, random
, nor a path,
phycoeval will try to interpret the setting as a newick-formatted tree string.
So, in summary, you can specify a specific starting tree using a newick string or a path to a newick or nexus formatted tree, or you can specify the MCMC chain start with the comb tree or a random bifurcating tree.
tree_prior¶
Currently, the only tree_prior
implemented in phycoeval is
uniform_root_and_betas
.
This prior assumes that all topologies are equally probable a priori,
the age of the root follows a parametric distribution (e.g., a gamma or
exponential distribution), and all other divergence times follow scaled beta
distributions.
When, tree_space
is set to generalized
, all possible non-reticulating
topologies (including topologies with shared and/or multifurcating divergences)
are a priori equally probable.
When, tree_space
is set to bifurcating
, all possible topologies with
independent, bifurcating divergences (i.e., internal nodes)
are equally probable.
The prior settings for the root age (height) and the beta distributions
for all other divergence times are specified under the parameters
section fo the uniform_root_and_betas
tree prior:
uniform_root_and_betas:
parameters:
root_height:
estimate: true
prior:
gamma_distribution:
shape: 4.0
mean: 0.01
alpha_of_node_height_beta_prior:
value: 1.0
estimate: false
root_height¶
The root_height
parameter follows the
general parameter syntax discussed above.
Valid continuous probability distributions for the prior
include:
exponential_distribution
For the
exponential_distribution
, you can specify themean
orrate
(the rate = 1/mean), and anoffset
. By default, theoffset
is zero, and the lower bound of the exponential distribution is zero. If you specify a positive offset, this shifts the entire distribution so that the lower bound is equal tooffset
. NOTE: If you specify amean
and anoffset
, the mean is interpreted as the mean before shifting the distribution.gamma_distribution
For the
gamma_distribution
, you can specify theshape
andmean
OR theshape
andscale
, where the . As with the exponential distribution, you can also specify anoffset
, which will shift the gamma distribution to have a lower bound equal tooffset
. NOTE: When you specify anoffset
, thescale
ORmean
is interpreted as the scale or mean before shifting the distribution.uniform_distribution
The
uniform_distribution
has two parameters that can be specified: themin
andmax
.
alpha_of_node_height_beta_prior¶
The uniform_root_and_betas
tree prior assumes (a priori)
that all non-root divergence times follow a scaled beta distribution
between the present and the age of the youngest parent node of any
node mapped to the divergence time
(see the figure below for an example.
These are “scaled” betas, because, normally, a beta distribution has a lower
and upper bound of 0 and 1, respectively.
The beta distributions are scaled to be proper probability distributions (i.e.,
integrate to one) between zero (the “present”) and the age of the youngest
parent node.
By default, phycoeval sets both the alpha and beta parameters of these beta
distributions to one, which makes them equivalent to uniform distributions
between zero and the age of the youngest parent node.
However, phycoeval allows you to control the settings for the alpha parameter of
these beta distributions via the alpha_of_node_height_beta_prior
option:
alpha_of_node_height_beta_prior:
value: 1.0
estimate: false
Above, we are simply fixing the alpha parameter of the beta distributions on
the non-root divergence times to 1 (this is the default for phycoeval).
In general, the default is probably fine for most applications, and estimating
alpha_of_node_height_beta_prior
has not been well tested with simulated
data.
branch_parameters¶
branch_parameters:
population_size:
equal_population_sizes: true
value: 0.0002
estimate: true
prior:
gamma_distribution:
shape: 4.0
mean: 0.0002
population_size¶
The population_size
parameter follows the
general parameter syntax discussed above,
with the exception of one additional setting: equal_population_sizes
.
If equal_population_sizes
is set to true
, then all the branches in the
tree will share the same population_size
parameter.
If equal_population_sizes
is set to false
, then each branch in the tree
gets its own population_size
parameter.
If you set the mutation rate to 1, then the effective population sizes will be scaled by the mutation rate (). Alternatively, if you specify an actual rate of mutation per site per generation, then the population size will be in units of the effective number of diploid individuals or haploid genomes (), if the ploidy is 2 or 1, respectively.
Note
Important: In ecoevolity, the population_size
is related to, but not
equal to (; the genetic
diversity, or more precisely, the expected number of differences per base
between two randomly selected haploid genomes).
The relationship between population_size
(represented by
) and is:
(1)¶
Thus, if you have prior expectation that and you’ve
set the mutation_rate
to 1.0 (i.e., ) and
ploidy
to 2, then your prior expectation for population_size
is,
The relationship above is also very important to keep in mind if you
specify a mutation rate in years.
For example, let’s assume that for your taxa, you believe 0.004 is a
reasonable value for the average number of differences per base between two
randomly selected gene copies, and you’ve told phycoeval that the ploidy =
2
.
If you’ve set the mutation rate to 1e-8
, then you can
use Equation (1) to adjust your prior
expectation for population_size
accordingly:
mutation_parameters¶
mutation_parameters:
freq_1:
value: 0.5
estimate: false
mutation_rate:
value: 1.0
estimate: false
mutation_rate¶
The mutation_rate
settings are for rate of mutation across the tree.
.. How you scale this is up to you, but you need to make sure you are consistent
.. in how you scale time and effective population sizes.
If you set the mutation rate to 1, then time and effective
population sizes will be scaled by the mutation rate.
Specifically, time will be in units of (i.e.,
expected substitutions per site), and effective population size will be measured
in .
Alternatively, if you specify an actual rate of mutation per site per
generation, then time will be in units of generations,
and population size will be in units of the effective number of diploid
individuals or gene copies () if the ploidy is 2 or 1,
respectively.
To help ensure the population sizes are scaled correctly, it can help to
remember that should
equal the expected differences per base between two randomly selected genomes
from a population.
freq_1¶
The freq_1
parameter is the equilibrium frequency of the “1” allele (or 1
minus the frequency of the “0” allele).
If you are using nucleotide data, we recommend that you fix the frequencies
of the “0” and “1” states to be equal:
freq_1:
value: 0.5
estimate: false
This is because there is no natural way to recode the 4 nucleotide states to two states. Thus, if you try to estimate frequencies of the two states, your results will be sensitive to the vagaries related to how you decided to code your nucleotides as binary.
However, if the characters you are using are truly biallelic, then it might make sense to estimate the frequencies of the two states:
freq_1:
estimate: true
prior:
beta_distribution:
alpha: 2.5
beta: 1.2
Another option is:
freq_1:
value: empirical
estimate: false
which fixes the frequencies of the two states to their empirical frequencies (i.e., the frequencies at which they appear in your data). Again, you shouldn’t do this for nucleotide data.
Note
The empirical
option for the value only works for the freq_1
parameter. You should get an error if you try to use it for any other
parameters.
data¶
This section of the configuration file tells phycoeval about the data we are asking it to analyze, and where it can find those data.
ploidy: 2
This is the ploidy of the organisms (e.g., 1 = haploid, 2 = diploid), and
determines the meaning of the population_size
parameters covered
under branch_parameters
below.
If ploidy
is set to one, then population_size
will be the
(potentially mutation-scaled) haploid effective population size
(i.e., the effective number of haploid genomes).
If your taxa are diploid and you set ploidy
to two, then
population_size
will be the (potentially mutation-scaled) effective number
of diploid individuals.
constant_sites_removed: false
This tells phycoeval whether or not you have removed all constant
characters/sites.
If true
, phycoeval will correct the likelihood for having only
sampled variable characters.
The alignment
section provides phycoeval with information about
the data you wish to analyse:
alignment:
genotypes_are_diploid: true
markers_are_dominant: false
population_name_is_prefix: false
population_name_delimiter: ' '
path: alignment/Gekko.nex
Below, we cover each of the possible settings nested under alignment
.
genotypes_are_diploid: true
This tells ecoevolity how you have encoded your characters.
Does each cell of your character matrix represent the state of both alleles of
a diploid individual?
If so, genotypes_are_diploid
should be true
.
If each cell represents the state of a particular gene copy, then
genotypes_are_diploid
should be false
.
If you have a code(s) to represent a heterozygote, then
genotypes_are_diploid
should definitely be true
.
markers_are_dominant: false
This specifies whether your markers are dominant.
If the same code is used to designate the character state of a heterozygote and
one of the two possible homozygotes, then markers_are_dominant
should be
true
.
If you can tell the difference among a heterozygote and both homozygotes, then
markers_are_dominant
should be false
.
population_name_delimiter: " "
population_name_is_prefix: false
For each row (individual or gene copy) of your character matrix, you need to tell ecoevolity which population it was sampled from. You do this by using prefixes or suffixes for each row label in your nexus file, and the prefix or suffix needs to be delimited by a character. So, if your matrix looks like:
Begin data;
Dimensions ntax=20 nchar=40000;
Format datatype=standard symbols="01" missing=? gap=-;
Matrix
'population1-lizard-001' 0010...
'population1-lizard-002' 0010...
'population2-lizard-003' 0000...
'population2-lizard-004' 0011...
.
.
.
then you should specify:
population_name_delimiter: "-"
population_name_is_prefix: true
and ecoevolity will know that the data in the first two rows came from “population1” and the data in the third and forth rows came from “population2”.
Note
Underscore gotcha!
The nexus format standard interprets underscores as spaces, unless the labels are quoted. So if you have:
Begin data;
Dimensions ntax=20 nchar=40000;
Format datatype=standard symbols="01" missing=? gap=-;
Matrix
population1_jro-001 0010...
population1_jro-002 0010...
population2_jro-003 0000...
.
.
.
you need to specify:
population_name_delimiter: " " # just a space!
population_name_is_prefix: true
mcmc_settings¶
The mcmc_settings
specify how long to run the MCMC
chain, and how often to record a sample from it. For example:
mcmc_settings:
chain_length: 15000
sample_frequency: 10
tells phycoeval to run the chain for 15,000 generations, recording a sample every 10th generation (15,000/10 = 1500 total samples). This is a good starting point. For most datasets we have analyzed so far, this has been sufficient. If the chain is having mixing problems, then you can try increasing these numbers. We recommend running several independent chains (analyses) to:
Confirm the chains are converging.
Increase the number of samples from the posterior distribution (assuming the chains converged).