# The Configuration File¶

Configuration File Components

Most of the settings for an ecoevolity analysis you specify in a configuration file. We use a YAML-formatted configuration file. YAML is an “anti-markup” language and data standard that is much more human friendly than alternatives like XML.

Note

The website http://www.yamllint.com/ is a nice tool for debugging YAML syntax. You can copy and paste your config file their to check of you’re using valid YAML syntax.

The simplest possible configuration file for ecoevolity would look something like:

```
---
comparisons:
- comparison:
path: "alignments/G-crombota-rossi-BabuyanClaro-Calayan.nex"
- comparison:
path: "alignments/G-mindorensis-mindorensis-Lubang-Luzon.nex"
- comparison:
path: "alignments/G-mindorensis-mindorensis-MaestreDeCampo-Masbate.nex"
- comparison:
path: "alignments/G-sp_a-sp_b-Dalupiri-CamiguinNorte.nex"
```

This specifies the path to the nexus file containing the character data for
each of your population pairs.
The defaults for all other settings would be used in this case, which we
**strongly discourage**.

Now, let’s look at an example that is more thoroughly specified:

```
---
# This is a comment
event_model_prior:
dirichlet_process:
parameters:
concentration:
value: 7.5
estimate: false
event_time_prior:
exponential_distribution:
rate: 100.0
mcmc_settings:
chain_length: 75000
sample_frequency: 50
operator_settings:
auto_optimize: true
auto_optimize_delay: 1000
operators:
ConcentrationScaler:
weight: 3.0
scale: 0.1
ModelOperator:
weight: 10.0
number_of_auxiliary_categories: 4
TimeSizeRateMixer:
weight: 5.0
scale: 0.02
TimeSizeRateScaler:
weight: 0.0
scale: 0.02
TimeRootSizeMixer:
weight: 3.0
scale: 0.05
EventTimeScaler:
weight: 1.0
scale: 0.02
global_comparison_settings:
ploidy: 2
genotypes_are_diploid: true
markers_are_dominant: false
population_name_delimiter: " "
population_name_is_prefix: false
constant_sites_removed: false
equal_population_sizes: false
parameters:
population_size:
estimate: true
prior:
gamma_distribution:
shape: 4.0
scale: 0.001
root_relative_population_size:
estimate: true
prior:
gamma_distribution:
shape: 100.0
scale: 0.01
offset: 0.0
freq_1:
value: 0.5
estimate: false
mutation_rate:
value: 1.0
estimate: false
operators:
RootPopulationSizeScaler:
weight: 1.0
scale: 0.05
LeafPopulationSizeScaler:
weight: 1.0
scale: 0.05
comparisons:
- comparison:
path: "../alignments/G-crombota-rossi-BabuyanClaro-Calayan.nex"
- comparison:
path: "../alignments/G-mindorensis-mindorensis-Lubang-Luzon.nex"
- comparison:
path: "../alignments/G-mindorensis-mindorensis-MaestreDeCampo-Masbate.nex"
- comparison:
path: "../alignments/G-sp_a-sp_b-Dalupiri-CamiguinNorte.nex"
```

Note

In Version 1.0.0 and above, you can specify exponential and gamma distributions using the mean. For example:

```
exponential_distribution:
rate: 100.0
```

Is equivalent to:

```
exponential_distribution:
mean: 0.01
```

And:

```
gamma_distribution:
shape: 100.0
scale: 0.01
```

Is equivalent to:

```
gamma_distribution:
shape: 100.0
mean: 1.0
```

All the settings are hierarchically nested by the indent spacing.
For example,
`event_model_prior`

, `event_time_prior`

, `mcmc_settings`

,
`operator_settings`

, `global_comparison_settings`

, and `comparison`

are all at the highest level of the hierarchy, and have various settings nested
within them.
Across a given level of the hierarchy, order does not matter. E.g., the
top-level groups of settings listed above can be arranged in any order.
Anything proceeded by a ‘#’ is a comment that is ignored by ecoevolity.

## event_model_prior¶

The `event_model_prior`

sets up the prior probabilities for all the different
ways we can cluster the comparisons together (or not).
Thus, the term “event model” is being used to refer to each of these
possibilities.

### Dirichlet-process prior¶

Prior to Version 1.0.0,
the `dirichlet_process`

(aka “DPP”) was the only option for the
`event_model_prior`

, and it has a single setting: the `concentration`

parameter.
(see this background section for more DPP details).

The settings:

```
event_model_prior:
dirichlet_process:
parameters:
concentration:
value: 7.5
estimate: false
```

specify that the concentration parameter of the Dirichlet process should be fixed to a value of 7.5. Alternatively, you can put a gamma-distributed prior on the concentration parameter, for example:

```
event_model_prior:
dirichlet_process:
parameters:
concentration:
value: 7.5
estimate: true
prior:
gamma_distribution:
shape: 2.0
scale: 3.75
```

Note

In Version 1.0.0 and above, you can specify the same gamma distribution using the mean:

```
gamma_distribution:
shape: 2.0
mean: 7.5
```

will allow the concentration parameter to be estimated. Generally, if you have a large number of comparisons (say 6 or more), it can be helpful to allow the concentration parameter to vary.

Note

The prior on the concentration parameter must be a gamma distribution.

If you’ve installed ecoevolity, you should have a command line tool called
`dpprobs`

that can help you choose a value for the concentration parameter.
Typing:

```
$ dpprobs -h
```

on the command line should provide the help menu, but the basic usage is:

```
$ dpprobs -p concentration 7.5 4
```

which requests the prior probabilities of all possible numbers of divergence events when the concentration parameter is 7.5 and there are 4 comparisons.

We often set the concentration so that 50% of the prior probability is on the maximum number of events. The idea is that by placing most of the prior probability on the model with no shared events, if our results indicate a shared event, we can be more confident that the data are driving that result. But, that is just an arbitrary preference (i.e., there is no fundamental mathematical justification for it).

### Pitman-Yor process prior¶

In Version 1.0.0 and above,
the Pitman-Yor process (PYP) can also be used
(see this background section for more PYP details).
It has two parameters: the `concentration`

and `discount`

parameters.

Here is an example of the configuration of the `pitman_yor_process`

,
where we estimate and put priors on both parameters:

```
event_model_prior:
pitman_yor_process:
parameters:
concentration:
value: 3.58
estimate: true
prior:
gamma_distribution:
shape: 2.0
mean: 3.58
discount:
value: 0.2
estimate: true
prior:
beta_distribution:
alpha: 1.0
beta: 4.0
```

In Version 1.0.0 and above, you can also use the `dpprobs`

tool to
help get a feel for how the PYP distributes prior probabilities over
the possible number of categories.
For example, we can modify our `dpprobs`

command above to give us
some information about a PYP prior with a discount of 0.5:

```
$ dpprobs -d 0.5 -p concentration 7.5 4
```

Or, we can use the following command to get a summary for our
`pitman_yor_process`

settings above:

```
$ dpprobs --shape 2.0 --scale 1.79 --discount-alpha 1.0 --discount-beta 4.0 1 4
```

(the parameter value `1`

[the second to last argument] is ignored for this
command, but is necessary to avoid an error).

### Uniform prior¶

In Version 1.0.0 and above,
a uniform distribution can also be used.
In its simplest form, this gives all possible ways to assign comparisons to
events equal probability, *a priori*:

```
event_model_prior:
uniform:
parameters:
split_weight:
value: 1.0
estimate: false
```

However, the `split_weight`

can be adjusted or estimated under a prior
distribution.
See this background section for more details.

In Version 1.0.0 and above, there is also a `swprobs`

tool (the mnemonic for
this is “s”plit “w”eight “prob”abilities).
For example:

```
$ swprobs 1 4
```

This will summarize a uniform prior distribution with a split weight of 1 for 4 comparisons.

## event_time_prior¶

The `event_time_prior`

specifies the prior on the divergence times.
For example:

```
event_time_prior:
exponential_distribution:
rate: 100.0
```

Note

In Version 1.0.0 and above, you can use the mean to specify the same exponential distribution:

```
exponential_distribution:
mean: 0.01
```

specifies an exponential distribution with a rate of 100.0 (thus the mean of the exponential prior is 1/rate = 1/100.0 = 0.01). A gamma or uniform distribution can also be used.

If the mutation rate of one or more comparison is set to 1.0, then time is in
units of expected number of substitutions per site, *relative* to the
comparsion(s) with a mutation rate set to 1.0.
If actual mutation rates are specified for your comparisons, then the
units of time will be in whatever unit the rates are in.
For example, if you give mutation rates in substitutions per site per million
years, time will be in units of millions of years.

## mcmc_settings¶

The `mcmc_settings`

simply specify how long to run the MCMC
chain, and how often to record a sample from it. For example:

```
mcmc_settings:
chain_length: 75000
sample_frequency: 50
```

tells ecoevolity to run the chain for 75,000 generations, recording a sample every 50th generation (75000/50 = 1500 total samples). This is a good starting point. For most datasets we have analyzed so far, this has been sufficient. If the chain is having mixing problems, then you can try increasing these numbers. We recommend running several independent chains (analyses) to:

Confirm the chains are converging.

Increase the number of samples from the posterior distribution (assuming the chains converged).

## operator_settings¶

The `operator_settings`

control the behavior of the “global” MCMC operators
that update the values of the model’s parameters.
By “global,” we mean that these operators affect all the comparisons (we’ll
discuss operators that only operate on a specific comparison further below).

Generally, the default values for the `operator_settings`

are sensible
and will work fine for many datasets.
We recommend trying the defaults first (i.e., simply do not specify the
`operator_settings`

section in the configuration file), and make adjustments
if your chains do not mix and/or converge well.

```
operator_settings:
auto_optimize: true
auto_optimize_delay: 1000
```

This specifies that the MCMC operators should automatically adjust their tuning
parameters to try an optimize mixing. The `auto_optimize_delay: 1000`

tells
the operators to wait until they have been used 1000 times before they start
auto-tuning (this gives them data on their acceptance rate).
Generally, auto optimization should always be used, and a delay of 1000 seems
to work well.

```
operator_settings:
operators:
ConcentrationScaler:
weight: 3.0
scale: 0.1
ModelOperator:
weight: 10.0
number_of_auxiliary_categories: 4
TimeSizeRateMixer:
weight: 5.0
scale: 0.02
TimeSizeRateScaler:
weight: 0.0
scale: 0.02
EventTimeScaler:
weight: 1.0
scale: 0.02
```

These settings control the “global” MCMC operators.
The weights are relative and control how often each operator is used.
For example, an operator with `weight: 2`

will be used twice as often (on
average) than an operator with `weight: 1`

.
`Mixer`

and `Scaler`

operators have a `scale`

parameter, which
controls how large of changes it will propose for the value of
model parameters.
If auto optimization is turned on, these are starting values, and the values of
the `scale`

parameters will be adjusted during the MCMC chain to try to
optimize mixing.

The `ModelOperator`

has a setting for the `number_of_auxiliary_categories`

.
This controls how many “extra” event categories the Gibbs sampler uses when
proposing changes to the number of events and which comparisons are assigned
to each event.
More “extra” categories can improve mixing, but slows down the MCMC chain;
fewer categories will take less time, at the risk of poorer mixing.
We do not suspect you would ever need more than 4, but you may very well be able
to use 3 or 2, and still have good mixing.
The default is 4.

Note

If you specify a weight for an operator that only updates parameters that are fixed (not estimated), ecoevolity will automatically “turn off” the operator (i.e., change the weight to zero).

**BUT**, if you set the weight of an operator to zero for a free parameter,
ecoevolity will **not** automatically “turn on” that operator. So, unless there
is another operator that updates the parameter, it will effectively be
fixed to the starting value (it will not be updated during the MCMC).

Note

You do not have to specify all settings within a group. For example you can use:

```
operator_settings:
operators:
ModelOperator:
number_of_auxiliary_categories: 2
```

in your config file to change the number of auxiliary categories to 2, but leave all other operator settings at their default values.

## global_comparison_settings¶

The `global_comparison_settings`

is an optional section that can
be useful for specifying settings to be applied to all of your
comparisons, unless otherwise overridden.
All of the settings within `global_comparison_settings`

can
also be specified for each comparison. For example:

```
global_comparison_settings:
ploidy: 2
equal_population_sizes: false
comparisons:
- comparison:
path: "species1.nex"
- comparison:
path: "species2.nex"
- comparison:
path: "species3.nex"
ploidy: 1
equal_population_sizes: true
- comparison:
path: "species4.nex"
```

specifies that Species 1, 2, and 4 are diploid organisms for which you want to estimate the root (ancestor) and leaf (descendant) effective population sizes separately, and Species 3 is haploid and you want to constrain the root and leaf populations to be the same size.

In other words, `global_comparison_settings`

allow you to specify default
settings for your comparisons that you can override.

```
ploidy: 2
```

This is the ploidy of the organisms (i.e., 1 = haploid, 2 = diploid).

```
genotypes_are_diploid: true
```

This tells ecoevolity how you have encoded your characters.
Does each cell of your character matrix represent the state of both alleles of
a diploid individual?
If so, `genotypes_are_diploid`

should be `true`

.
If each cell represents the state of a particular gene copy, then
`genotypes_are_diploid`

should be `false`

.
If you have a code(s) to represent a heterozygote, then
`genotypes_are_diploid`

should definitely be `true`

.

```
markers_are_dominant: false
```

This specifies whether your markers are dominant.
If the same code is used to designate the character state of a heterozygote and
one of the two possible homozygotes, then `markers_are_dominant`

should be
`true`

.
If you can tell the difference among a heterozygote and both homozygotes, then
`markers_are_dominant`

should be `false`

.

```
population_name_delimiter: " "
population_name_is_prefix: false
```

For each row (individual or gene copy) of your character matrix, you need to tell ecoevolity which population it was sampled from. You do this by using prefixes or suffixes for each row label in your nexus file, and the prefix or suffix needs to be delimited by a character. So, if your matrix looks like:

```
Begin data;
Dimensions ntax=20 nchar=40000;
Format datatype=standard symbols="01" missing=? gap=-;
Matrix
'population1-lizard-001' 0010...
'population1-lizard-002' 0010...
'population2-lizard-003' 0000...
'population2-lizard-004' 0011...
.
.
.
```

then you should specify:

```
population_name_delimiter: "-"
population_name_is_prefix: true
```

and ecoevolity will know that the data in the first two rows came from “population1” and the data in the third and forth rows came from “population2”.

Note

**Underscore gotcha!**

The nexus format standard interprets underscores as spaces, unless the labels are quoted. So if you have:

```
Begin data;
Dimensions ntax=20 nchar=40000;
Format datatype=standard symbols="01" missing=? gap=-;
Matrix
population1_jro-001 0010...
population1_jro-002 0010...
population2_jro-003 0000...
.
.
.
```

you need to specify:

```
population_name_delimiter: " " # just a space!
population_name_is_prefix: true
```

```
constant_sites_removed: false
```

This tells ecoevolity whether or not you have removed all constant characters/sites.

```
equal_population_sizes: false
```

If `true`

the effective sizes of the root and leaf populations are
constrained to be equal (but their shared size can still be estimated).
If `false`

the effective sizes of the root and leaf populations are estimated
separately (assuming they are not fixed to particular values).

### The parameters section¶

The following section controls the settings for the parameters for each
comparison. Again, these can be specified in the `global_comparison_settings`

section and/or for each comparison:

```
parameters:
population_size:
estimate: true
prior:
gamma_distribution:
shape: 4.0
scale: 0.001
root_relative_population_size:
estimate: true
prior:
gamma_distribution:
shape: 100.0
scale: 0.01
offset: 0.0
freq_1:
value: 0.5
estimate: false
mutation_rate:
value: 1.0
estimate: false
```

Note

In Version 1.0.0 and above, you can specify exponential and gamma distributions using the mean. For example:

```
gamma_distribution:
shape: 4.0
scale: 0.001
```

Is equivalent to:

```
gamma_distribution:
shape: 4.0
mean: 0.004
```

This allows you to specify whether or not you want estimate each parameter, and if so, what prior to use.

#### General parameter syntax¶

Before we discuss each parameter, let’s look at the general syntax that applies to all parameters (including the concentration parameter of the Dirichlet process that we saw above).

The general syntax for a parameter is:

```
parameter_name:
estimate: true # or false
value: 1.0
prior:
a_valid_distribution:
distribution_parameter: 1.0
```

So, you can specify

Whether or not the parameter should be fixed or estimated.

A value for the parameter. This is only the starting value if

`estimate`

is`true`

, or is the fixed value if`estimate`

is`false`

. If a value is not specified, the starting value is drawn from the prior.The prior probability distribution. This is ignored if

`estimate`

is`false`

.

#### mutation_rate¶

The `mutation_rate`

settings are for the mutation rate
() of the comparison.
How you scale this is up to you, but you need to make sure you are consistent
in how you scale time and effective population sizes.
For example, if you set the mutation rate to 1, then time and effective
population sizes will be scaled by the mutation rate.
Specifically, time will be in units of (i.e.,
expected substitutions per site), and effective population size will be measured
in units of .
Alternatively, if you specify an actual rate of mutation per site per
generation, then time will be in units of generations,
and population size will be in units of the effective number of diploid
individuals or gene copies () if the ploidy is 2 or 1,
respectively.
Differences in generation times among pairs can also be accounted for
via the `mutation_rate`

parameters, with the appropriate scaling
of the effective population sizes.
To help ensure the population sizes are scaled correctly, it can help to
remember that should
equal the expected differences per base between two randomly selected genomes
from a population.

#### population_size¶

The `population_size`

settings are for the effective population sizes of the
leaf (descendant) population(s) of a comparison.
If you set the mutation rate to 1, then the effective population sizes
will be scaled by the mutation rate ().
Alternatively, if you specify an actual rate of mutation per site per
generation, then the population size will be in units of the effective number
of diploid individuals or gene copies (), if the ploidy is 2
or 1, respectively.

Note

**Important**: In ecoevolity, the `population_size`

is related to, but **not**
equal to (; the genetic
diversity, or more precisely, the expected number of differences per base
between two randomly selected haploid genomes).
The relationship between `population_size`

(represented by
) and is:

(1)¶

Thus, if you have prior expectation that and you’ve
set the `mutation_rate`

to 1.0 (i.e., ) and
`ploidy`

to 2, then your prior expectation for `population_size`

is,

The relationship above is also very important to keep in mind if you
specify a mutation rate in years, or scale the mutation rate to account
for differences in rate and/or generation time from another comparison.
For example, let’s assume that for your taxon, you believe 0.004 is a
reasonable value for the average number of differences per base between two
randomly selected gene copies, and you’ve told ecoevolity that the ```
ploidy =
2
```

.
If you’ve set the mutation rate to 0.5 for one of your comparisons to account
for differences in mutation rate and/or generation time compared to another
comparison for which you’ve set the mutation rate to 1.0, then you can
use Equation (1) to adjust your prior
expectation for `population_size`

accordingly:

#### root_relative_population_size¶

The population size of the root (ancestral) population is parameterized a bit
differently.
You specify a prior on the effective population size of the root *relative* to
the mean population size of the leaf (descendant) populations.
For example:

```
root_relative_population_size:
value: 1.0
estimate: false
```

Constrains the root population to always have an effective population size that is equal to the mean size of the leaf populations. Thus, it is not an estimated (free) parameter; it is a deterministic function of the leaf population sizes. Similarly,

```
root_relative_population_size:
value: 2.0
estimate: false
```

constrains the effective population size of the root to be twice the mean effective population size of the leaves. Alternatively,

```
root_relative_population_size:
estimate: true
prior:
gamma_distribution:
shape: 100.0
scale: 0.01
```

Note

In Version 1.0.0 and above, you can specify the same gamma distribution using the mean:

```
gamma_distribution:
shape: 100.0
mean: 1.0
```

allows the effective population size of the root to be estimated, and centers the prior on its relative size on 1 (i.e., centers the prior expectation for the actual root effective population size on the mean of the leaf sizes); the mean of a gamma distribution is the product of the shape and scale parameters: Similarly

```
root_relative_population_size:
estimate: true
prior:
gamma_distribution:
shape: 100.0
scale: 0.02
```

Note

In Version 1.0.0 and above, you can specify the same gamma distribution using the mean:

```
gamma_distribution:
shape: 100.0
mean: 2.0
```

allows the effective population size of the root to be estimated, and centers the prior on its relative size on 2 (i.e., centers the prior expectation for the actual root effective population size on twice the mean of the leaf sizes).

The hope of this parameterization is to allow you to specify a more informative prior on the root effective population size. There is usually a lot of prior uncertainty in the actual value of the root population size, but we might have good reason to expect that it is similar to the mean of the leaf sizes.

#### freq_1¶

The `freq_1`

parameter is the equilibrium frequency of the “1” allele (or 1
minus the frequency of the “0” allele).
If you are using nucleotide data, we recommend that you fix the frequencies
of the “0” and “1” states to be equal:

```
freq_1:
value: 0.5
estimate: false
```

This is because there is no natural way to recode the 4 nucleotide states to two states. Thus, if you try to estimate frequencies of the two states, your results will be sensitive to the vagaries related to how you decided to code your nucleotides as binary.

However, if the characters you are using are truly biallelic, then it might make sense to estimate the frequencies of the two states. Another option is:

```
freq_1:
value: empirical
estimate: false
```

which fixes the frequencies of the two states to their empirical frequencies (i.e., the frequencies at which they appear in your data).

Note

The `empirical`

option for the value only works for the `freq_1`

parameter. You should get an error if you try to use it for any other
parameters.

### The operators section¶

The `operators`

settings control the behavior of the MCMC operators that act
upon the parameters of a comparison.
I.e., each comparison gets its own copy of the specified operators, as opposed
to the “global” operators discussed above.
As with the “global” operators, the default settings will likely work fine.
We recommend you try the defaults first (simply do not specify operator
settings), and resort to adjustments if you have poor mixing.

The following operators can only be specified here (listing them in the global
`operator_settings`

will result in an error:

```
operators:
RootPopulationSizeScaler:
weight: 1.0
scale: 0.05
LeafPopulationSizeScaler:
weight: 1.0
scale: 0.05
FreqMover:
weight: 1.0
window: 0.1
MutationRateScaler:
weight: 1.0
scale: 0.3
```

Other operators that can be specified here are:

```
operators:
TimeSizeRateMixer:
weight: 5.0
scale: 0.02
TimeRootSizeMixer:
weight: 3.0
scale: 0.05
TimeSizeRateScaler:
weight: 0.0
scale: 0.02
EventTimeScaler:
weight: 1.0
scale: 0.02
```

These three can also be “global” operators that act on all comparisons (see
`operator_settings`

section above).
The only time you might need to apply these operators to *each* comparison
is if you are having mixing trouble.
For example, the “global” `EventTimeScaler`

might not work well for
the divergence time of a particular pair.
Giving that pair its own `EventTimeScaler`

might improve mixing in such a
case.

## comparisons¶

In its simplest form, the `comparisons`

section simply is a list of the paths
to the nexus-formatted files containting the character alignments:

```
comparisons:
- comparison:
path: "../alignments/G-crombota-rossi-BabuyanClaro-Calayan.nex"
- comparison:
path: "../alignments/G-mindorensis-mindorensis-Lubang-Luzon.nex"
- comparison:
path: "../alignments/G-mindorensis-mindorensis-MaestreDeCampo-Masbate.nex"
- comparison:
path: "../alignments/G-sp_a-sp_b-Dalupiri-CamiguinNorte.nex"
```

However, as noted above, *all* of the options discussed for the
`global_comparison_settings`

section can also be applied to each comparison.
For example:

```
global_comparison_settings:
ploidy: 2
genotypes_are_diploid: true
markers_are_dominant: false
population_name_delimiter: " "
population_name_is_prefix: false
constant_sites_removed: false
equal_population_sizes: false
parameters:
population_size:
estimate: true
prior:
gamma_distribution:
shape: 4.0
scale: 0.001
operators:
TimeRootSizeMixer:
weight: 3.0
scale: 0.05
comparisons:
- comparison:
path: "alignments/G-crombota-rossi-BabuyanClaro-Calayan.nex"
- comparison:
path: "alignments/G-mindorensis-mindorensis-Lubang-Luzon.nex"
ploidy: 1
genotypes_are_diploid: false
population_name_delimiter: "-"
population_name_is_prefix: true
constant_sites_removed: true
parameters:
population_size:
estimate: true
value: 0.005
prior:
gamma_distribution:
shape: 2.0
scale: 0.0025
operators:
TimeRootSizeMixer:
weight: 5.0
scale: 0.01
- comparison:
path: "alignments/G-mindorensis-mindorensis-MaestreDeCampo-Masbate.nex"
- comparison:
path: "alignments/G-sp_a-sp_b-Dalupiri-CamiguinNorte.nex"
```

Here, we have overridden many of the “global” settings for the second comparison.