Simulation-based analyses¶
As described in the Background and The ecoevolity configs sections, we will be simulating datasets under five different models on divergneces among the 10 pairs of populations:
Dirichlet-process (DP) prior
Pitman-Yor process (PYP) prior
Uniform distribution with a split-weight parameter (SW)
Simultaneous divergence
Independent divergences
The settings for these models are contained in three ecoevolity configuration files
in the ecoevolity-configs
directory of the project. These
are, respectively:
pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05.yml
pairs-10-pyp-conc-2_0-1_79-disc-1_0-4_0-time-1_0-0_05.yml
pairs-10-unif-sw-0_55-7_32-time-1_0-0_05.yml
fixed-pairs-10-simultaneous-time-1_0-0_05.yml
fixed-pairs-10-independent-time-1_0-0_05.yml
We will use simcoevolity
to simulate these datasets in a format ready to
analyze with ecoevolity
.
We will generate scripts in the scripts/simcoevolity-scripts
directory that
will run these simulations in batches of manageable numbers of replicates.
Setup our environment¶
Before anything else, navigate to the project directory (if you are not already there):
cd /path/to/your/copy/of/ecoevolity-model-prior
If you are working on AU’s Hopper cluster, this will be:
cd /scratch/YOUR-AU-USERNAME/ecoevolity-model-prior
If you haven’t already, let’s activate the Python environment for this project:
conda activate ecoevolity-model-prior-project
Create simulation scripts¶
Now, lets cd
into the project’s scripts
directory:
cd scripts
Use the create_new_batch_of_simcoevolity-scripts.py
Python script to create
simcoevolity scripts for generating a new batch of 10 simulated datasets for
each of the 5 models (ecoevolity config files):
python create_new_batch_of_simcoevolity-scripts.py -n 10 simcoevolity-scripts/template-simcoevolity-*05.template
The output should confirm the creation of 5 new scripts for running
simcoevolity
(one for each config in ecoevolity-configs
), and report a
batch ID:
Script written to 'simcoevolity-scripts/simcoevolity-fixed-pairs-10-independent-time-1_0-0_05-batch-308303035.sh'
Script written to 'simcoevolity-scripts/simcoevolity-fixed-pairs-10-simultaneous-time-1_0-0_05-batch-308303035.sh'
Script written to 'simcoevolity-scripts/simcoevolity-pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-batch-308303035.sh'
Script written to 'simcoevolity-scripts/simcoevolity-pairs-10-pyp-conc-2_0-1_79-disc-1_0-4_0-time-1_0-0_05-batch-308303035.sh'
Script written to 'simcoevolity-scripts/simcoevolity-pairs-10-unif-sw-0_55-7_32-time-1_0-0_05-batch-308303035.sh'
Simcoevolity scripts successfully written.
Batch ID:
308303035
But, your batch ID number should be different.
IMPORTANT: Make a note of your batch ID number, you will need it moving forward.
For all of the commands below, use your batch ID number in place of
308303035
.
Commit simulation scripts¶
Before we run the simcoevolity
scripts, let’s add them to the staging area
of project Git repository:
git add simcoevolity-scripts/*308303035.sh
Then, commit them to the repository:
git commit
A good commit message might look something like:
Adding batch 308303035 of simcoevolity scripts.
Adding shell scripts generated by:
create_new_batch_of_simcoevolity-scripts.py
These scripts will run simcoevolity to generate a batch of 10
simulated datasets.
Lastly, push the new scripts to the remote repository hosted on a GitHub:
git push origin master
Run simulation scripts¶
Next, cd
into the simcoevolity-scripts
directory:
cd simcoevolity-scripts
If you are working on AU’s Hopper cluster, use a for loop to submit the five
simcoevolity
scripts to the queue:
for script_path in *308303035.sh; do ../../bin/psub "$script_path"; done
Note
If you are working on a different cluster, you will need
to either update the ../../bin/psub
to work for your system,
or replace ../../bin/psub
with whatever command is used on your
cluster to submit jobs.
If you are not on a cluster, you can simply run the scripts directly:
for script_path in *308303035.sh; do bash "$script_path"; done
After submitting the scripts with the for loop, go ahead and cd
out of the
simcoevolity-scripts
directory, which will put you back up in the
scripts
directory:
cd ..
Assuming you are on the Hopper cluster, you can monitor the progress of the jobs by using:
qstat
When the jobs are waiting in queue to start, the output will look like:
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1942030.hopper-mgt ...-308303035.sh jro0014 0 Q general
1942031.hopper-mgt ...-308303035.sh jro0014 0 Q general
1942032.hopper-mgt ...-308303035.sh jro0014 0 Q general
1942033.hopper-mgt ...-308303035.sh jro0014 0 Q general
1942034.hopper-mgt ...-308303035.sh jro0014 0 Q general
When the jobs are running, the output will look like:
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1942030.hopper-mgt ...-308303035.sh jro0014 00:02:19 R general
1942031.hopper-mgt ...-308303035.sh jro0014 00:02:19 R general
1942032.hopper-mgt ...-308303035.sh jro0014 00:02:19 R general
1942033.hopper-mgt ...-308303035.sh jro0014 00:02:19 R general
1942034.hopper-mgt ...-308303035.sh jro0014 00:02:19 R general
When the jobs are complete, the output will briefly look like (after a few
minutes of being complete, the jobs will disappear from the output of
qstat
):
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1942030.hopper-mgt ...-308303035.sh jro0014 00:05:43 C general
1942031.hopper-mgt ...-308303035.sh jro0014 00:05:42 C general
1942032.hopper-mgt ...-308303035.sh jro0014 00:05:23 C general
1942033.hopper-mgt ...-308303035.sh jro0014 00:05:24 C general
1942034.hopper-mgt ...-308303035.sh jro0014 00:05:23 C general
What each of these simcoevolity
scripts does is:
Use the
simcoevolity
tool to simulate datasets and output them into aecoevolity-simulations
directory in the project directory.Create YAML-formatted config files for analyzing each dataset with
ecoevolity.
For each dataset, 6 config files are created. One for each of the three models we wish to compare for the project (the DP, PYP, and SW models defined in the config file in theecoevolity-configs
directory). And another config for each of these three models, but configured to ignore constant characters (i.e., only use characters that vary among the samples genomes sampled from the two populations).For each config file created in Step 2 above, the script generates 4 Bash scripts for analyzing the respective dataset 4 times with
ecoevolity
(i.e., four independent MCMC chains for each analysis). Thus, for each dataset simulated bysimcoevolity
there are 6 config files and 24 Bash scripts for runningecoevolity
analyses.
All of the files created during these 3 steps are output into
a ecoevolity-simulations
directory in the project directory.
If you are still in the scripts
directory, you can list the contents of
this directory using:
ls ../ecoevolity-simulations
This should show 5 directories, one for each config file in ecoevolity-configs
:
fixed-pairs-10-independent-time-1_0-0_05
fixed-pairs-10-simultaneous-time-1_0-0_05
pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05
pairs-10-pyp-conc-2_0-1_79-disc-1_0-4_0-time-1_0-0_05
pairs-10-unif-sw-0_55-7_32-time-1_0-0_05
Let’s look into the first one:
ls ../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05
You should see a directory associated with your batch number (your number will be different from mine):
batch-308303035
If you look in this directory:
ls ../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05/batch-308303035
You will see a very long list of files, so I won’t show the output here.
For each simcoevolity
simulation replicate there are:
10 data files (one for each of the pairs of populations). The names of these files end with “
chars.txt
”.1 file containing the true values of all the parameters that
simcoevolity
used to simulate the data files. These files end with “-true-values.txt
”.6 ecoevolity config files. Two each for the DP, PYP, and SW models. Two each, because we will run analyses for each model both using and ignoring constant characters in the simulated data files. These files end with “
-config.yml
”24 Bash scripts for analyzing the dataset with
ecoevolity
. Four independent analyses (MCMC chains) for each of the 6 config files. These files end with “-qsub.sh
”
Analyzing simulated data¶
Next, we need to run all those Bash scripts to analyze each simulation
replicate with ecoevolity
four times under 6 different configurations.
Given that we simulated 10 datasets under 5 different models, this
will be
ecoevolity
analyses.
If you are on the Hopper cluster, we will use a script that will run all of these analyses as a single job array. Hopper imposes a limit of 500 jobs per user, so we will use the job array to run only 400 of these analyses at a time, and cycle through them until they are all done.
Note
If you are not on the Hopper cluster, the submit_sim_analyses.sh
script we use below will not work on your system.
You will either need to update that script to work with your system,
or simply submit all theses analyses “manually.”
This can be done easily with a for loop. For example:
for script_path in ../ecoevolity-simulations/*/batch-308303035/*qsub.sh; do echo "$script_path"; done
Just change “echo” to whatever command is necessary to submit jobs on your system (and remember your batch ID number is different).
To do this, make sure you are in the scripts
directory of the project and
enter:
bash submit_sim_analyses.sh ../ecoevolity-simulations/*/batch-308303035
This will produce a lot of output similar to (but with many more lines in place of the ellipses):
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05/batch-308303035'
No stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05/batch-308303035/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-00-config-run-1-qsub.sh
No stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05/batch-308303035/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-00-config-run-2-qsub.sh
.
.
.
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-simultaneous-time-1_0-0_05/batch-308303035'
No stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/fixed-pairs-10-simultaneous-time-1_0-0_05/batch-308303035/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-00-config-run-1-qsub.sh
No stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/fixed-pairs-10-simultaneous-time-1_0-0_05/batch-308303035/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-00-config-run-2-qsub.sh
.
.
.
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05/batch-308303035'
No stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05/batch-308303035/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-00-config-run-1-qsub.sh
No stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05/batch-308303035/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-00-config-run-2-qsub.sh
.
.
.
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/pairs-10-pyp-conc-2_0-1_79-disc-1_0-4_0-time-1_0-0_05/batch-308303035'
No stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/pairs-10-pyp-conc-2_0-1_79-disc-1_0-4_0-time-1_0-0_05/batch-308303035/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-00-config-run-1-qsub.sh
No stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/pairs-10-pyp-conc-2_0-1_79-disc-1_0-4_0-time-1_0-0_05/batch-308303035/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-00-config-run-2-qsub.sh
.
.
.
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/pairs-10-unif-sw-0_55-7_32-time-1_0-0_05/batch-308303035'
No stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/pairs-10-unif-sw-0_55-7_32-time-1_0-0_05/batch-308303035/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-00-config-run-1-qsub.sh
No stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/pairs-10-unif-sw-0_55-7_32-time-1_0-0_05/batch-308303035/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-00-config-run-2-qsub.sh
.
.
.
Submitting analyses to queue...
../bin/psub -t 00:30:00 -a 1-1200%400 ../bin/spawn_job_array /scratch/jro0014/ecoevolity-model-prior/scripts/spawn_job_array.2JjV3idzUInN
qsub -q general -j oe -l nodes=1:ppn=1,walltime=00:30:00 -t 1-1200%400 ../bin/spawn_job_array -F "/scratch/jro0014/ecoevolity-model-prior/scripts/spawn_job_array.2JjV3idzUInN"
1942324[].hopper-mgt
Why all the output complaining about “No stdout
”?
Well, this script first looks for the results of all the analyses, and only
runs the analyses for the scripts that lack complete results (all of them in
our case, since we are running them for the first time).
This allows us to re-run this script after all the analyses are finished, and
it will re-run any analyses that failed
(Hopper has a depressingly high rate of job failures).
On hopper you can monitor the job array using:
qstat
which shows the status of the entire job array on one line:
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1942324[].hopper-mgt spawn_job_array jro0014 0 R general
To see the individuals jobs within the array, use:
qstat -t
which will show the full list of jobs in the array that are running or waiting to run:
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1942324[1].hopper-mgt ...n_job_array-1 jro0014 00:00:53 R general
1942324[2].hopper-mgt ...n_job_array-2 jro0014 00:00:53 R general
1942324[3].hopper-mgt ...n_job_array-3 jro0014 00:00:52 R general
1942324[4].hopper-mgt ...n_job_array-4 jro0014 00:00:52 R general
1942324[5].hopper-mgt ...n_job_array-5 jro0014 00:00:29 R general
1942324[6].hopper-mgt ...n_job_array-6 jro0014 00:00:29 R general
1942324[7].hopper-mgt ...n_job_array-7 jro0014 00:00:28 R general
1942324[8].hopper-mgt ...n_job_array-8 jro0014 00:00:28 R general
1942324[9].hopper-mgt ...n_job_array-9 jro0014 00:00:27 R general
1942324[10].hopper-mgt ..._job_array-10 jro0014 00:00:12 R general
1942324[11].hopper-mgt ..._job_array-11 jro0014 00:00:13 R general
...
This list will be longer than 400 jobs, but the job array will make sure at most 400 run at any given time. It will also be shorter than the total number of jobs in the array (1200), because the array will keep adding them into the wait list as it cycles through all the analyses.
If you just want to know how many jobs are actively running, you can
pipe the output of qstat -t
to grep
and then to wc
:
qstat -t | grep -i "R gen" | wc -l
My output was:
239
So, 239 of my analyses are currently running. You can change this to get the number of jobs the array currently has waiting to run:
qstat -t | grep -i "Q gen" | wc -l
If the job array is still adding unlisted jobs to the wait list, this number is usually around 300. If it’s less, this probably means the array is “out of” jobs (they are all running or waiting to run).
The job array will create a lot of output files in your scripts
directory.
If all is working well, you can get rid of these using the following command
from within the scripts
directory of the project:
rm spawn_job_array.*
If all is not going well, these output files might have content to help you figure out what the problem is.
Once the qstat -t
command is showing that all of your analyses have finished,
run the same command from within your scripts
directory again:
bash submit_sim_analyses.sh ../ecoevolity-simulations/*/batch-308303035
Note
Only re-run this command after all the analyses started
by this command the first time are no longer running.
In other words, the qstat -t
should produce no output (assuming you are
not running analyses for other projects) before you re-run this command.
If most of your analyses finished successfully, the script will seem like
it’s running slow.
Just be patient; it is checking the output of all the analyses, and only writes
a message to the screen if it finds an analysis that didn’t finish
successfully.
So, if it seems like nothing is happening, that’s a good thing (i.e., the
script is finding lots of successfully completed analyses).
Here is my output from the submit_sim_analyses.sh
script:
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05/batch-308303035'
Incomplete stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05/batch-308303035/var-only-pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-08-config-run-4-qsub.sh
Incomplete stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05/batch-308303035/var-only-pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-09-config-run-1-qsub.sh
Incomplete stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05/batch-308303035/var-only-pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-09-config-run-2-qsub.sh
Incomplete stdout: /scratch/jro0014/ecoevolity-model-prior/ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05/batch-308303035/var-only-pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05-sim-09-config-run-3-qsub.sh
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-simultaneous-time-1_0-0_05/batch-308303035'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05/batch-308303035'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/pairs-10-pyp-conc-2_0-1_79-disc-1_0-4_0-time-1_0-0_05/batch-308303035'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/pairs-10-unif-sw-0_55-7_32-time-1_0-0_05/batch-308303035'
Submitting analyses to queue...
../bin/psub -t 00:30:00 -a 1-4 ../bin/spawn_job_array /scratch/jro0014/ecoevolity-model-prior/scripts/spawn_job_array.QmIIkUO2GxMY
qsub -q general -j oe -l nodes=1:ppn=1,walltime=00:30:00 -t 1-4 ../bin/spawn_job_array -F "/scratch/jro0014/ecoevolity-model-prior/scripts/spawn_job_array.QmIIkUO2GxMY"
1942363[].hopper-mgt
This output is telling me that four of the analyses (of the 1200 I submitted
the first time) did not finish (their standard output was incomplete).
The output also confirms that these failed analyses are being re-run via a new
job array.
Again, you can monitor the progress of your re-analyses using qstat -t
,
and once they finish, go ahead and run the following command for the
third time (from within the scripts
directory):
bash submit_sim_analyses.sh ../ecoevolity-simulations/*/batch-308303035
Hopefully the third time, your output will look like:
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-independent-time-1_0-0_05/batch-308303035'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/fixed-pairs-10-simultaneous-time-1_0-0_05/batch-308303035'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/pairs-10-dpp-conc-2_0-2_71-time-1_0-0_05/batch-308303035'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/pairs-10-pyp-conc-2_0-1_79-disc-1_0-4_0-time-1_0-0_05/batch-308303035'
Beginning to vet and consolidate sim analysis files in:
'../ecoevolity-simulations/pairs-10-unif-sw-0_55-7_32-time-1_0-0_05/batch-308303035'
All analyses are complete and clean!
This confirms that all of your analyses have successfully finished!
Note, the job failure rate of the Hopper cluster
fluctuates.
So, you might have had some failures that got resubmitted during your third use
of submit_sim_analyses.sh
above.
If so, just monitor those re-runs with qstat -t
, and run the
submit_sim_analyses.sh
script again after they finished (as we did three
times above).
Eventually, you should get the All analyses are complete and clean!
message.
Note
Analyses that get need to be re-run are done so exactly (i.e., the exact same data and starting seed for the random number generator).
These jobs are not failing due to any issues with ecoevolity. Our cluster almost always has some small failure rate when running lots of jobs, no matter how simple the jobs are. So, we can simply run them again, exactly as before, and they will work fine.
I say this, because if we were re-running analyses with different simulated datasets or different starting seeds, we could be creating subtle biases in our analyses. That is not the case here. We are only re-running analyses because our cluster’s queue/scheduler system is … less than ideal.
Go ahead and clean out all the output files from the job array from inside the
scripts
directory:
rm spawn_job_array.*
Summarizing the results¶
After the
submit_sim_analyses.sh
script confirms that All analyses are complete
and clean!
it is time for us to summarize the results from all 1200 analyses we ran.
Our results are currently scattered across 1200 log files output by
ecoevolity
during these analyses.
These log files contain MCMC samples collected from the posterior distribution
of the respective model given the simulated dataset.
We will use the Python script scripts/parse_sim_results.py
to parse all
these log files (posterior samples) and summarize them in tab-delimited tables.
We will run the parse_sim_results.py
Python script using the
parse_sim_results.sh
Bash script, so that we can submit it as a job to the
queue.
Assuming you are on the Hopper cluster and in the scripts
directory of your
copy of the project, run:
../bin/psub parse_sim_results.sh ../ecoevolity-simulations/*/*308303035
Note
If you are not on the Hopper cluster, you can simply run the Python script directly:
python parse_sim_results.py ../ecoevolity-simulations/*/*308303035
Just make sure you have the ecoevolity-model-prior-project
conda
environment activated.
Use the qstat
command to monitor the progress of the job.
Once the output of qstat
confirms the script has finished running,
we can take a look at all the tab-delimited text files it created
that summarize all the results:
ls ../ecoevolity-simulations/*/batch-308303035/*results.tsv
You will notice that each batch
directory of simulations has 6 files that
end with “results.tsv”.
Each one contains the summarized results for one of the three models (DP, PYP,
or SW) while using or ignoring the constant characters.
Each line of these files summarizes the results for one of the simulation
replicates.
So each of these files should have 11 lines (10 lines of results, plus a line
with the column headers).
We can easily confirm this using wc
:
wc -l ../ecoevolity-simulations/*/batch-308303035/*results.tsv
Now, let’s gzip
these files:
gzip ../ecoevolity-simulations/*/batch-308303035/*results.tsv
Now, add them to the staging area of the project Git repository:
git add ../ecoevolity-simulations/*/batch-308303035/*results.tsv.gz
And, commit them to the repository database:
git commit
A good commit message might look something like:
Adding batch 308303035 of simulation results.
Adding gzipped, tab-delimited files. Each file summarizes the
results of ecoevolity analyses of 10 simcoevolity simulation
replicates. Adding these files for simulations under 5 models
analysed with 3 models, with and without constant characters.
Note
Git handles the versioning of text files very well, but not zipped files. So, we usually want to avoid adding zipped files to a Git repository. If we have large files we want to keep in a Git repo, it’s better to use an extension like Git LFS.
However, in this case we are adding files that we never want to version control (we shouldn’t be editing our results files!). So, it is not a problem that Git will not be able to track line-by-line changes to these files.
Finally, push your new results to the remote repository hosted on a GitHub server:
git push origin master
Cleaning up¶
After we have committed and pushed the results of our analyses, let’s cleanup all those thousands of files that were generated during the simulations and analyses:
bash archive_sim_files.sh ../ecoevolity-simulations/*/*308303035
This script will copy these files into compressed archives and remove the original files.
Now, we can add these archives to the git repository:
git add ../ecoevolity-simulations/*/batch-308303035/*.tar.gz
And, commit them to the repository database:
git commit
A good commit message might look something like:
Adding archives of sim files for batch 308303035.
Adding compressed archives of all the ``simcoevolity`` and
``ecoevolity`` files for batch 308303035 of simulation replicates.
These files are handled by Git LFS, so only a reference to the
files is stored in the git database.
Note
As discussed above, Git handles the versioning of text files very
well, but not large, compressed files like the ones we just added.
So, why did we add them? Well, we have configured
Git LFS.
to handle any files that end with “.tar.gz
” (this configuration is in
the .gitattributes
file in the base directory of the project).
Git LFS works by only storing references to these files, rather than the
files themselves. So, git
doesn’t track the contents of these large,
compressed files. Which is good for us; we aren’t going to be making
edits to these files!
Finally, push everything to the remote repository on GitHub:
git push origin master
Reflection¶
That’s it! You’ve just contributed a batch of simulation-based analyses to this project. Take a moment to reflect on what you did and why (the Background and The ecoevolity configs sections might help for this). Can you think of other models or simulation conditions that would be good to explore for this project?