The Data¶
To use phycoeval, you will need biallelic genetic characters aligned across the species for which you wish to infer a phylogeny. Phycoeval assumes these characters are unlinked, but, depending on your data, you might be better off violating this assumption of the model (more about this below). Biallelic means that each character has two possible states. But, most genomic data are composed of nucleotides, which have four states. This can be accommodated by coding the nucleotides as biallelic.
Coding nucleotide characters¶
Phycoeval will read your nucleotide characters, and recode them as biallelic. If phycoeval finds a polyallelic character with more than two states across your samples, it will report an error and stop running. However, you can tell phycoeval to recode these polyallelic characters as either being the first state (0) or a different state (1). Your other option is to exclude these sites.
Note
If you are analyzing nucleotide data as biallelic characters, we strongly recommend that you do not try and estimate the frequencies of the two states. Instead, fix the frequencies to be equal. This is because there are many ways to recode the 4 nucleotide states to two states. Thus, if you try to estimate frequencies of the two states, your results can be sensitive to how you decided to code your nucleotides as binary.
Unlinked¶
Unlinked means that each character is assumed to have evolved along a gene tree that is independent from the others (conditional on the population history). In other words, it assumes that the characters are far enough apart from one another in the genome that they segregate independently during meiosis.
Many genomic data sets consist of many loci, each of which is a stretch of linked nucleotides. Examples include “RADseq” or sequence-capture data. What should we do with such data?
What to do with linked characters?¶
What if you have loci comprising sequences of linked nucleotides? Based on simulations [14][18][16][20], we recommend that you analyze all of your characters (including the constant ones) and violate the assumption of unlinked characters. Ecoevolity and phycoeval perform better when you use all of the sites (including linked and constant ones) compared to reducing the data to only one variable character per locus. So, your best bet is to include all of the sites from loci.
Formatting your data¶
Currently, phycoeval accepts two input formats for genetic characters: Nexus and YAML.
Nexus format¶
Standard haploid data¶
You can represent your data in a “standard” 0/1 format. Here’s an example of three species from which we’ve sampled 4 genomes each (2 diploid individuals):
#NEXUS
BEGIN TAXA;
DIMENSIONS NTAX=12;
TAXLABELS
lizard-953-a-speciesA
lizard-953-b-speciesA
lizard-954-a-speciesA
lizard-954-b-speciesA
lizard-152-a-speciesB
lizard-152-b-speciesB
lizard-154-a-speciesB
lizard-154-b-speciesB
lizard-331-a-speciesC
lizard-331-b-speciesC
lizard-338-a-speciesC
lizard-338-b-speciesC
;
END;
BEGIN CHARACTERS;
DIMENSIONS NCHAR=273658;
FORMAT DATATYPE=STANDARD SYMBOLS="01" MISSING=? GAP=-;
MATRIX
lizard-953-a-speciesA 001010...
lizard-953-b-speciesA 001010...
lizard-954-a-speciesA 101010...
lizard-954-b-speciesA 001011...
lizard-152-a-speciesB 101110...
lizard-152-b-speciesB 101110...
lizard-154-a-speciesB 001011...
lizard-154-b-speciesB 101010...
lizard-331-a-speciesC 001010...
lizard-331-b-speciesC 001010...
lizard-338-a-speciesC 011010...
lizard-338-b-speciesC 001011...
;
END;
Note, we don’t need separate TAXA and CHARACTER blocks like above. Instead, we can specify a DATA block:
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=12 NCHAR=273658;
FORMAT DATATYPE=STANDARD SYMBOLS="01" MISSING=? GAP=-;
MATRIX
lizard-953-a-speciesA 001010...
lizard-953-b-speciesA 001010...
lizard-954-a-speciesA 101010...
lizard-954-b-speciesA 001011...
lizard-152-a-speciesB 101110...
lizard-152-b-speciesB 101110...
lizard-154-a-speciesB 001011...
lizard-154-b-speciesB 101010...
lizard-331-a-speciesC 001010...
lizard-331-b-speciesC 001010...
lizard-338-a-speciesC 011010...
lizard-338-b-speciesC 001011...
;
END;
Both examples above would be equivalent for phycoeval, but the Nexus Class Library used by phycoeval will report a message about an implicit TAXA block if you use the latter format. Either way, in your phycoeval config file, you need to tell phycoeval that the states, or genotypes, are haploid by declaring:
genotypes_are_diploid: false
Standard diploid data¶
Above, each cell in our matrix represented which state was present for the character in a particular haploid genome. We can also represent the same data where each cell represents the genotype of a diploid individual:
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=6 NCHAR=273658;
FORMAT DATATYPE=STANDARD SYMBOLS="012" MISSING=? GAP=-;
MATRIX
lizard-953-speciesA 002020...
lizard-954-speciesA 102021...
lizard-152-speciesB 202220...
lizard-154-speciesB 102021...
lizard-331-speciesC 002020...
lizard-338-speciesC 012021...
;
END;
Now, “0” represents that the individual has two copies with the 0 state, “2” represents two copies of the 1 state, and “1” represents a heterozygote. Again, in your phycoeval config file, you need to tell phycoeval that the states, or genotypes, are diploid by declaring:
genotypes_are_diploid: true
Nucleotide data¶
If you have nucleotide data, the easiest thing is provide the nucleotide characters to phycoeval as is, and let it recode them as biallelic. Here’s an example where we are providing nucleotides as haploid (each cell is a haploid genotype):
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=12 NCHAR=273658;
FORMAT DATATYPE=DNA MISSING=? GAP=-;
MATRIX
lizard-953-a-speciesA ACGTAG...
lizard-953-b-speciesA ACGTAG...
lizard-954-a-speciesA GCGTAG...
lizard-954-b-speciesA ACGTAA...
lizard-152-a-speciesB GCGCAG...
lizard-152-b-speciesB GCGCAG...
lizard-154-a-speciesB ACGTAA...
lizard-154-b-speciesB GCGTAG...
lizard-331-a-speciesC ACGTAG...
lizard-331-b-speciesC ACGTAG...
lizard-338-a-speciesC ATGTAG...
lizard-338-b-speciesC ACGTAA...
;
END;
This is sometimes referred to as “phased” data. Again, if we are providing a matrix where each cell represents a haploid genotype, we need to tell phycoeval this is so via the config file:
genotypes_are_diploid: false
We can also represent the same data as “unphased”, where each cell represents a diploid genotype:
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=6 NCHAR=273658;
FORMAT DATATYPE=DNA MISSING=? GAP=-;
MATRIX
lizard-953-a-speciesA ACGTAG...
lizard-954-a-speciesA RCGTAR...
lizard-152-a-speciesB GCGCAG...
lizard-154-a-speciesB RCGTAR...
lizard-331-a-speciesC ACGTAG...
lizard-338-a-speciesC AYGTAR...
;
END;
We need to indicate this in the config file:: accordingly:
genotypes_are_diploid: true
Population labels¶
In our nexus character matrix, we need to indicate which species (or population) each row corresponds to. We can do this with either using a prefix or suffix in the row (or taxon) labels. For example, in this nexus data file:
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=6 NCHAR=273658;
FORMAT DATATYPE=DNA MISSING=? GAP=-;
MATRIX
lizard-953-a-speciesA ACGTAG...
lizard-954-a-speciesA RCGTAR...
lizard-152-a-speciesB GCGCAG...
lizard-154-a-speciesB RCGTAR...
lizard-331-a-speciesC ACGTAG...
lizard-338-a-speciesC AYGTAR...
;
END;
we are using the suffixes to indicate that the first two samples came
from a species we are calling speciesA
,
the next to samples came from a species called speciesB
,
and
the last two samples came from a species we are calling
speciesC
.
In our phycoeval config file
we have to indicate this with:
population_name_delimiter: "-"
population_name_is_prefix: false
This tells phycoeval to look for the last bit of each row label that is separated by a “-” to figure out the population label.
Note
If you like to use underscores as a population label delimiter, just watch out for a gotcha related to how the nexus format treats underscores
YAML format¶
The YAML-formatted data format is a lot more efficient (much smaller file sizes). Instead of a full alignment, it only contains a list of allele count patterns, followed by a list of the weight of each pattern (i.e., how many times the allele pattern occurs in the alignment). For a very small example let’s convert the following nexus alignment and convert it to the YAML format that ecoevolity accepts as input:
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=12 NCHAR=6;
FORMAT DATATYPE=DNA MISSING=? GAP=-;
MATRIX
lizard-953-a-speciesA ACGTAG
lizard-953-b-speciesA ACGTAG
lizard-954-a-speciesA GCGTAG
lizard-954-b-speciesA ACGTAA
lizard-152-a-speciesB GCGCAG
lizard-152-b-speciesB GCGCAG
lizard-154-a-speciesB ACGTAA
lizard-154-b-speciesB GCGTAG
lizard-331-a-speciesC ACGTAG
lizard-331-b-speciesC ACGTAG
lizard-338-a-speciesC ATGTAG
lizard-338-b-speciesC ACGTAA
;
END;
To convert these data to YAML format, we will assume the first nucleotide (from the top) in each column is state “0”, and the second nucleotide (if any) is state “1”. Doing so gives us the following allele-count patterns in YAML format:
---
markers_are_dominant: false
population_labels:
- speciesA
- speciesB
- speciesC
allele_count_patterns:
- [[1,4], [3,4], [0,4]]
- [[0,4], [0,4], [1,4]]
- [[0,4], [0,4], [0,4]]
- [[0,4], [2,4], [0,4]]
- [[1,4], [1,4], [1,4]]
pattern_weights:
- 1
- 1
- 2
- 1
- 1