Varying dataset size¶
In the Simulation-based analyses section, all of the datasets we simulated and analyzed consisted of 500,000 characters. Now, we are going to simulate and analyze datasets with 5,000 10,000 50,000 and 100,000 characters to see how our models perform.
We saw from the results of the analyses in the Simulation-based analyses
section that the most challenging simulation conditions for the
models we are comparing was when all 10 pairs of populations
diverge independently (i.e., the constrained model
defined in the
fixed-pairs-10-independent-time-1_0-0_05.yml
config file).
As a result, for the analyses in this section, we will simulate all of the
datasets under this model.
We will then analyze these datasets under the three models we are comparing:
Dirichlet-process (DP) prior
Pitman-Yor process (PYP) prior
Uniform distribution with a split-weight parameter (SW)
Setup our environment¶
Before anything else, navigate to the project directory (if you are not already there):
cd /path/to/your/copy/of/ecoevolity-model-prior
If you haven’t already, let’s activate the Python environment for this project:
conda activate ecoevolity-model-prior-project
Create simulation scripts¶
Now, lets cd
into the project’s scripts
directory:
cd scripts
Use the create_new_batch_of_simcoevolity-scripts.py
script to create
simcoevolity scripts for generating a new batch of simulated datasets with
varying numbers of characters under the “independent divergences” model:
python create_new_batch_of_simcoevolity-scripts.py -n 10 simcoevolity-scripts/template-simcoevolity-*-chars-*00.template
The output should confirm the creation of 5 new scripts for running
simcoevolity
(one for each config in ecoevolity-configs
), and report a
batch ID:
Script written to 'simcoevolity-scripts/simcoevolity-fixed-pairs-10-independent-time-1_0-0_05-chars-100000-batch-308303035.sh'
Script written to 'simcoevolity-scripts/simcoevolity-fixed-pairs-10-independent-time-1_0-0_05-chars-10000-batch-308303035.sh'
Script written to 'simcoevolity-scripts/simcoevolity-fixed-pairs-10-independent-time-1_0-0_05-chars-50000-batch-308303035.sh'
Script written to 'simcoevolity-scripts/simcoevolity-fixed-pairs-10-independent-time-1_0-0_05-chars-5000-batch-308303035.sh'
Simcoevolity scripts successfully written.
Batch ID:
308303035
But, your batch ID number will be different.
IMPORTANT: Make a note of your batch ID number, you will need it moving forward.
For all of the commands below, use your batch ID number in place of
308303035
.
Commit simulation scripts¶
Before we run the simcoevolity
scripts, let’s add them to the staging area
of project Git repository:
git add simcoevolity-scripts/*308303035.sh
Then, commit them to the repository:
git commit
A good commit message might look something like:
Adding batch 308303035 of simcoevolity scripts.
Adding shell scripts generated by:
create_new_batch_of_simcoevolity-scripts.py
These scripts will run simcoevolity to generate a batch of 10
simulated datasets each for 4 different sizes of datasets.
Lastly, push the new scripts to the remote repository hosted on a GitHub:
git push origin master
Run simulation scripts¶
Next, cd
into the simcoevolity-scripts
directory:
cd simcoevolity-scripts
If you are working on AU’s Hopper cluster, use a for loop to submit the five
simcoevolity
scripts to the queue:
for script_path in *308303035.sh; do ../../bin/psub "$script_path"; done
Note
If you are working on a different cluster, you will need
to either update the ../../bin/psub
to work for your system,
or replace ../../bin/psub
with whatever command is used on your
cluster to submit jobs.
If you are not on a cluster, you can simply run the scripts directly:
for script_path in *308303035.sh; do bash "$script_path"; done
After submitting the scripts with the for loop, go ahead and cd
out of the
simcoevolity-scripts
directory, which will put you back up in the
scripts
directory:
cd ..
Assuming you are on the Hopper cluster, you can monitor the progress of the jobs by using:
qstat
The simulation files generated by simcoevolity
will be output to subdirectories of the ecoevolity-simulations
directory.
When the jobs finish, from the scripts
directory you can type:
ls ../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-*00
This should show 4 directories, one for each of the different sized datasets we simulated:
fixed-pairs-10-independent-time-1_0-0_05-chars-10000
fixed-pairs-10-independent-time-1_0-0_05-chars-100000
fixed-pairs-10-independent-time-1_0-0_05-chars-5000
fixed-pairs-10-independent-time-1_0-0_05-chars-50000
Within each fo these 4 directories should be a directory associated with your
batch number batch-308303035
.
Analyzing simulated data¶
Next, we will use ecoevolity
to analyze each simulation dataset four
times under 6 different configurations (the DP, PYP, and SW models using and
ignoring the constant characters).
Given that we simulated 10 datasets each of 4 different sizes, this
will be
ecoevolity
analyses.
If you are on the Hopper cluster, we will use a script that will run all of these analyses as a single job array.
Note
If you are not on the Hopper cluster, the submit_sim_analyses.sh
script we use below will not work on your system.
You will either need to update that script to work with your system,
or simply submit all theses analyses “manually.”
This can be done easily with a for loop. For example:
for script_path in ../ecoevolity-simulations/*/batch-308303035/*qsub.sh; do echo "$script_path"; done
Just change “echo” to whatever command is necessary to submit jobs on your system (and remember your batch ID number is different).
To do this, make sure you are in the scripts
directory of the project and
enter:
bash submit_sim_analyses.sh ../ecoevolity-simulations/*/batch-308303035
This will produce a lot of output, because the script first looks for results of each analysis, and, if the results are absent (or incomplete), it adds a job to the job array to run the analysis and writes you a message as to why. Because this is our first time running the job-submission script, no results are present, and all analyses will be run (hence all the output).
On hopper you can monitor the job array using:
qstat -t
which will show the full list of jobs in the array that are running or waiting to run:
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1942324[1].hopper-mgt ...n_job_array-1 jro0014 00:00:53 R general
1942324[2].hopper-mgt ...n_job_array-2 jro0014 00:00:53 R general
1942324[3].hopper-mgt ...n_job_array-3 jro0014 00:00:52 R general
1942324[4].hopper-mgt ...n_job_array-4 jro0014 00:00:52 R general
1942324[5].hopper-mgt ...n_job_array-5 jro0014 00:00:29 R general
1942324[6].hopper-mgt ...n_job_array-6 jro0014 00:00:29 R general
1942324[7].hopper-mgt ...n_job_array-7 jro0014 00:00:28 R general
1942324[8].hopper-mgt ...n_job_array-8 jro0014 00:00:28 R general
1942324[9].hopper-mgt ...n_job_array-9 jro0014 00:00:27 R general
1942324[10].hopper-mgt ..._job_array-10 jro0014 00:00:12 R general
1942324[11].hopper-mgt ..._job_array-11 jro0014 00:00:13 R general
...
The submit_sim_analyses.sh
restricts the job array to only allow
at most 400 of the jobs to run at a time.
The job array will cycle through all 960 jobs until they all finish.
If you just want to know how many jobs are actively running, you can
pipe the output of qstat -t
to grep
and then to wc
:
qstat -t | grep -i "R gen" | wc -l
You can change this to get the number of jobs the array currently has waiting to run:
qstat -t | grep -i "Q gen" | wc -l
The job array will create a lot of output files in your scripts
directory.
If all is working well, you can get rid of these using the following command
from within the scripts
directory of the project:
rm spawn_job_array.*
If all is not going well, these output files might have content to help you figure out what the problem is.
Once the qstat -t
command is showing that all of your analyses have
finished, we will run the submit_sim_analyses.sh
script again.
If any of the 960 jobs failed, they will get re-run:
bash submit_sim_analyses.sh ../ecoevolity-simulations/*/batch-308303035
Note
Only re-run this command after all the analyses started
by this command the first time are no longer running.
In other words, the qstat -t
should produce no output (assuming you are
not running analyses for other projects) before you re-run this command.
If most of your analyses finished successfully, the script will seem like
it’s running slow.
Just be patient; it is checking the output of all the analyses, and only writes
a message to the screen if it finds an analysis that didn’t finish
successfully.
So, if it seems like nothing is happening, that’s a good thing (i.e., the
script is finding lots of successfully completed analyses).
Your output from running the submit_sim_analyses.sh
script for a second
time might look something like:
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-100000/batch-579088984'
Incomplete stdout: /home/jro0014/ecoevolity-model-prior/ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-100000/batch-579088984/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-07-config-run-3-qsub.sh
Incomplete stdout: /home/jro0014/ecoevolity-model-prior/ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-100000/batch-579088984/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-07-config-run-4-qsub.sh
Incomplete stdout: /home/jro0014/ecoevolity-model-prior/ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-100000/batch-579088984/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-08-config-run-1-qsub.sh
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-10000/batch-579088984'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-50000/batch-579088984'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-5000/batch-579088984'
Submitting analyses to queue...
../bin/psub -t 00:30:00 -a 1-3 ../bin/spawn_job_array /home/jro0014/ecoevolity-model-prior/scripts/spawn_job_array.ahk9ldpOomQI
qsub -q general -j oe -l nodes=1:ppn=1,walltime=00:30:00 -t 1-3 ../bin/spawn_job_array -F "/home/jro0014/ecoevolity-model-prior/scripts/spawn_job_array.ahk9ldpOomQI"
1977829[].hopper-mgt
This output tell us that a small number (3) of the analyses failed to finish
correctly, and are being re-run via a new job array.
Again, you can monitor the progress of your re-analyses using qstat -t
, and
once they finish, go ahead and run the submit_sim_analyses.sh
script for a
third time (from within the scripts
directory):
bash submit_sim_analyses.sh ../ecoevolity-simulations/*/batch-308303035
Hopefully the third time, your output will look like:
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-100000/batch-579088984'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-10000/batch-579088984'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-50000/batch-579088984'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05-chars-5000/batch-579088984'
All analyses are complete and clean!
This confirms that all of your analyses have successfully finished!
Note, the job failure rate of the Hopper cluster
fluctuates.
So, you might have some failures that get re-run during your third use
of submit_sim_analyses.sh
above.
If so, just monitor those re-runs with qstat -t
, and run the
submit_sim_analyses.sh
script again after they finish (as we did three
times above).
Eventually, you should get the All analyses are complete and clean!
message.
Go ahead and clean out all the output files from the job array from inside the
scripts
directory:
rm spawn_job_array.*
Summarizing the results¶
After the
submit_sim_analyses.sh
script confirms that All analyses are complete
and clean!
it is time for us to summarize the results from all 960 analyses we ran.
Our results are currently scattered across 960 log files output by
ecoevolity
during these analyses.
These log files contain MCMC samples collected from the posterior distribution
of the respective model given each simulated dataset.
We will use the Python script scripts/parse_sim_results.py
to parse all
these log files (posterior samples) and summarize them in tab-delimited tables.
We will run the parse_sim_results.py
Python script using the
parse_sim_results.sh
Bash script, so that we can submit it as a job to the
cluster’s queue.
Assuming you are on the Hopper cluster and in the scripts
directory of your
copy of the project, run:
../bin/psub parse_sim_results.sh ../ecoevolity-simulations/*/*308303035
Note
If you are not on the Hopper cluster, you can simply run the Python script directly:
python parse_sim_results.py ../ecoevolity-simulations/*/*308303035
Just make sure you have the ecoevolity-model-prior-project
conda
environment activated.
Use the qstat
command to monitor the progress of the job.
Once the output of qstat
confirms the script has finished running,
we can count the number of lines in each of the tab-delmited text
files that were created:
wc -l ../ecoevolity-simulations/*/batch-308303035/*results.tsv
Each of these files should have 11 lines (10 lines of results, plus a line with the column headers).
Now, let’s gzip
these files:
gzip ../ecoevolity-simulations/*/batch-308303035/*results.tsv
Now, add them to the staging area of the project Git repository:
git add ../ecoevolity-simulations/*/batch-308303035/*results.tsv.gz
And, commit them to the repository database:
git commit
A good commit message might look something like:
Adding batch 308303035 of simulation results.
Adding gzipped, tab-delimited files. Each file summarizes the results of
ecoevolity analyses of 10 simcoevolity simulation replicates. Adding these
files for simulations of 4 different dataset sizes, each analysed with 3
models, with and without constant characters.
Finally, push your new results to the remote repository hosted on a GitHub server:
git push origin master
Cleaning up¶
After we have committed and pushed the results of our analyses, let’s cleanup all those thousands of files that were generated during the simulations and analyses:
bash archive_sim_files.sh ../ecoevolity-simulations/*/*308303035
This script will copy these files into compressed archives and remove the original files.
Now, we can add these archives to the git repository:
git add ../ecoevolity-simulations/*/batch-308303035/*.tar.gz
And, commit them to the repository database:
git commit
A good commit message might look something like:
Adding archives of sim files for batch 308303035.
Adding compressed archives of all the ``simcoevolity`` and
``ecoevolity`` files for batch 308303035 of simulation replicates.
These files are handled by Git LFS, so only a reference to the
files is stored in the git database.
Finally, push everything to the remote repository on GitHub:
git push origin master