Simulation-based analyses

As described in the Background and The ecoevolity configs sections, we will be simulating datasets …

We will use simcoevolity to simulate these datasets in a format ready to analyze with ecoevolity. We will generate scripts in the scripts/simcoevolity-scripts directory that will run these simulations in batches of manageable numbers of replicates.

Setup our environment

Before anything else, navigate to the project directory (if you are not already there):

cd /path/to/your/copy/of/codiv-sanger-bake-off

For example, if the project is in your home directory, this will be:

cd ~/codiv-sanger-bake-off

If you haven’t already, let’s activate the Python virutal environment for this project:

source pyenv/bin/activate

Create simulation scripts

Now, lets cd into the project’s scripts directory:

cd scripts

Use the create_new_batch_of_simcoevolity-scripts.py Python script to create simcoevolity scripts for generating a new batch of 10 simulated datasets for each of the 5 models (ecoevolity config files):

python create_new_batch_of_simcoevolity_scripts.py ../configs/*.yml

The output should confirm the creation of 12 new scripts for running simcoevolity (one for each config in configs), and report a batch ID:

Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/fixed-independent-pairs-05-sites-00500-locus-500-batch-266950032.sh'
Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/fixed-independent-pairs-05-sites-01000-locus-500-batch-266950032.sh'
Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/fixed-independent-pairs-05-sites-02500-locus-500-batch-266950032.sh'
Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/fixed-independent-pairs-05-sites-10000-locus-500-batch-266950032.sh'
Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/fixed-simultaneous-pairs-05-sites-00500-locus-500-batch-266950032.sh'
Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/fixed-simultaneous-pairs-05-sites-01000-locus-500-batch-266950032.sh'
Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/fixed-simultaneous-pairs-05-sites-02500-locus-500-batch-266950032.sh'
Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/fixed-simultaneous-pairs-05-sites-10000-locus-500-batch-266950032.sh'
Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/pairs-05-sites-00500-locus-500-batch-266950032.sh'
Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/pairs-05-sites-01000-locus-500-batch-266950032.sh'
Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/pairs-05-sites-02500-locus-500-batch-266950032.sh'
Script written to '/home/jamie/Dropbox/projects/codiv-sanger-bake-off/scripts/simcoevolity-scripts/pairs-05-sites-10000-locus-500-batch-266950032.sh'

Simcoevolity scripts successfully written.

Batch ID:
    266950032

But, your batch ID number should be different.

IMPORTANT: Make a note of your batch ID number, you will need it moving forward. For all of the commands below, use your batch ID number in place of 266950032.

Commit simulation scripts

Before we run the simcoevolity scripts, let’s add them to the staging area of project Git repository:

git add simcoevolity-scripts/*266950032.sh

Then, commit them to the repository:

git commit

A good commit message might look something like:

Adding batch 266950032 of simcoevolity scripts.

Adding shell scripts generated by:

    create_new_batch_of_simcoevolity-scripts.py

These scripts will run simcoevolity to generate a batch of 20
simulated datasets.

Lastly, push the new scripts to the remote repository hosted on a GitHub:

git push origin main

If you get a message that looks something like:

! [rejected]        main -> main (fetch first)
error: failed to push some refs to '/home/jamie/git-fun/local1/../remote'
hint: Updates were rejected because the remote contains work that you do
hint: not have locally. This is usually caused by another repository pushing
hint: to the same ref. You may want to first integrate the remote changes
hint: (e.g., 'git pull ...') before pushing again.
hint: See the 'Note about fast-forwards' in 'git push --help' for details.

OR a more cryptic message that looks something like (it’s more cryptic due to Git LFS):

ref main:: Error in git rev-list --stdin --objects --not --remotes=origin --: exit status 128 fatal: bad object 19dd47de1e8368e425ffbec1a00c8f500f76976a

This simply means that your copy of the project repository is behind the copy on GitHub (i.e., someone else has pushed since you last pulled). This is not problem, all you need to do is pull to update your copy:

git pull origin main

This will likely create a new commit that merges the updates on GitHub with your new content. This is common, and Git will create a default commit message for you. After you save and close the commit message, the new merged commit will be finalized. Then you should be able to push:

git push origin main

Run simulation scripts

Next, cd into the simcoevolity-scripts directory:

cd simcoevolity-scripts

If you are working on AU’s Hopper cluster, use a for loop to submit the five simcoevolity scripts to the queue:

for script_path in *266950032.sh; do ../../bin/psub "$script_path"; done

Note

If you are working on a different cluster, you will need to either update the ../../bin/psub to work for your system, or replace ../../bin/psub with whatever command is used on your cluster to submit jobs.

If you are not on a cluster, you can simply run the scripts directly:

for script_path in *266950032.sh; do bash "$script_path"; done

After submitting the scripts with the for loop, go ahead and cd out of the simcoevolity-scripts directory, which will put you back up in the scripts directory:

cd ..

Assuming you are on the Hopper cluster, you can monitor the progress of the jobs by using:

qstat

When the jobs are waiting in queue to start, the output will look like:

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1942030.hopper-mgt         ...-266950032.sh jro0014                0 Q general
1942031.hopper-mgt         ...-266950032.sh jro0014                0 Q general
1942032.hopper-mgt         ...-266950032.sh jro0014                0 Q general
1942033.hopper-mgt         ...-266950032.sh jro0014                0 Q general
1942034.hopper-mgt         ...-266950032.sh jro0014                0 Q general
.
.
.

When the jobs are running, the output will look like:

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1942030.hopper-mgt         ...-266950032.sh jro0014         00:00:19 R general
1942031.hopper-mgt         ...-266950032.sh jro0014         00:00:19 R general
1942032.hopper-mgt         ...-266950032.sh jro0014         00:00:19 R general
1942033.hopper-mgt         ...-266950032.sh jro0014         00:00:19 R general
1942034.hopper-mgt         ...-266950032.sh jro0014         00:00:19 R general
.
.
.

When the jobs are complete, the output will briefly look like (after a few minutes of being complete, the jobs will disappear from the output of qstat):

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1942030.hopper-mgt         ...-266950032.sh jro0014         00:00:43 C general
1942031.hopper-mgt         ...-266950032.sh jro0014         00:00:42 C general
1942032.hopper-mgt         ...-266950032.sh jro0014         00:00:23 C general
1942033.hopper-mgt         ...-266950032.sh jro0014         00:00:24 C general
1942034.hopper-mgt         ...-266950032.sh jro0014         00:00:23 C general
.
.
.

What each of these simcoevolity scripts does is:

  1. Use the simcoevolity tool to simulate datasets and output them into a simulations directory in the project directory.

  2. For each simulated dataset, the script generates 4 Bash scripts for analyzing the respective dataset 4 times with ecoevolity (i.e., four independent MCMC chains for each analysis).

All of the files created during these 2 steps are output into a simulations directory in the project directory. If you are still in the scripts directory, you can list the contents of this directory using:

ls ../simulations

This should show 12 directories, one for each config file in configs:

fixed-independent-pairs-05-sites-00500-locus-500
fixed-independent-pairs-05-sites-01000-locus-500
fixed-independent-pairs-05-sites-02500-locus-500
fixed-independent-pairs-05-sites-10000-locus-500
fixed-simultaneous-pairs-05-sites-00500-locus-500
fixed-simultaneous-pairs-05-sites-01000-locus-500
fixed-simultaneous-pairs-05-sites-02500-locus-500
fixed-simultaneous-pairs-05-sites-10000-locus-500
pairs-05-sites-00500-locus-500
pairs-05-sites-01000-locus-500
pairs-05-sites-02500-locus-500
pairs-05-sites-10000-locus-500

Let’s look into the first one:

ls ../simulations/fixed-independent-pairs-05-sites-00500-locus-500

You should see a directory associated with your batch number (your number will be different from mine):

batch-266950032

If you look in this directory:

ls ../simulations/fixed-independent-pairs-05-sites-00500-locus-500/batch-266950032

You will see a very long list of files, so I won’t show the output here. For each simcoevolity simulation replicate there are:

  • 5 data files (one for each of the pairs of populations). The names of these files end with “chars.txt”.

  • 1 file containing the true values of all the parameters that simcoevolity used to simulate the data files. These files end with “-true-values.txt”.

  • 1 ecoevolity config files. These files end with “-config.yml

  • 4 Bash scripts for analyzing the dataset with ecoevolity. I.e., four independent analyses (MCMC chains) for each dataset. These files end with “-qsub.sh

Analyzing simulated data

Next, we need to run all those Bash scripts to analyze each simulated dataset with ecoevolity four times. Given that we simulated 20 datasets under 12 different settings and we will analyze each dataset 4 times, this will be 20 \times 12 \times 4 = 960 ecoevolity analyses.

If you are on the Hopper cluster, we will use a script that will run all of these analyses as a single job array. Hopper imposes a limit of 500 jobs per user, so we will use the job array to run only 400 of these analyses at a time, and cycle through them until they are all done.

Note

If you are not on the Hopper cluster, the submit_sim_analyses.sh script we use below will not work on your system. You will either need to update that script to work with your system, or simply submit all theses analyses “manually.” This can be done easily with a for loop. For example:

for script_path in ../simulations/*/batch-266950032/*qsub.sh; do echo "$script_path"; done

Just change “echo” to whatever command is necessary to submit jobs on your system (and remember your batch ID number is different).

To do this, make sure you are in the scripts directory of the project and enter:

bash submit_sim_analyses.sh ../simulations/*/batch-266950032

This will produce a lot of output similar to (but with many more lines in place of the ellipses):

Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-independent-pairs-05-sites-00500-locus-500/batch-266950032'
No stdout: /scratch/jro0014/codiv-sanger-bake-off/simulations/fixed-independent-pairs-05-sites-00500-locus-500/batch-266950032/simcoevolity-sim-00-config-run-1-qsub.sh
No stdout: /scratch/jro0014/codiv-sanger-bake-off/simulations/fixed-independent-pairs-05-sites-00500-locus-500/batch-266950032/simcoevolity-sim-00-config-run-2-qsub.sh
.
.
.
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-independent-pairs-05-sites-01000-locus-500/batch-266950032'
No stdout: /scratch/jro0014/codiv-sanger-bake-off/simulations/fixed-independent-pairs-05-sites-01000-locus-500/batch-266950032/simcoevolity-sim-00-config-run-1-qsub.sh
No stdout: /scratch/jro0014/codiv-sanger-bake-off/simulations/fixed-independent-pairs-05-sites-01000-locus-500/batch-266950032/simcoevolity-sim-00-config-run-2-qsub.sh
.
.
.
Submitting analyses to queue...
../bin/psub -t 00:30:00 -a 1-960%400 ../bin/spawn_job_array /scratch/jro0014/codiv-sanger-bake-off/scripts/spawn_job_array.MIzTDrKgMzZ0
qsub -q general -j oe -l nodes=1:ppn=1,walltime=00:30:00 -t 1-960%400 ../bin/spawn_job_array -F  "/scratch/jro0014/codiv-sanger-bake-off/scripts/spawn_job_array.MIzTDrKgMzZ0"
2059031[].hopper-mgt

Why all the output complaining about “No stdout”? Well, this script first looks for the results of all the analyses, and only runs the analyses for the scripts that lack complete results (all of them in our case, since we are running them for the first time). This allows us to re-run this script after all the analyses are finished, and it will re-run any analyses that failed (Hopper has a depressingly high rate of job failures).

On hopper you can monitor the job array using:

qstat

which shows the status of the entire job array on one line:

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1942324[].hopper-mgt       spawn_job_array  jro0014                0 R general

To see the individuals jobs within the array, use:

qstat -t

which will show the full list of jobs in the array that are running or waiting to run:

Job ID                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
1942324[1].hopper-mgt      ...n_job_array-1 jro0014         00:00:53 R general
1942324[2].hopper-mgt      ...n_job_array-2 jro0014         00:00:53 R general
1942324[3].hopper-mgt      ...n_job_array-3 jro0014         00:00:52 R general
1942324[4].hopper-mgt      ...n_job_array-4 jro0014         00:00:52 R general
1942324[5].hopper-mgt      ...n_job_array-5 jro0014         00:00:29 R general
1942324[6].hopper-mgt      ...n_job_array-6 jro0014         00:00:29 R general
1942324[7].hopper-mgt      ...n_job_array-7 jro0014         00:00:28 R general
1942324[8].hopper-mgt      ...n_job_array-8 jro0014         00:00:28 R general
1942324[9].hopper-mgt      ...n_job_array-9 jro0014         00:00:27 R general
1942324[10].hopper-mgt     ..._job_array-10 jro0014         00:00:12 R general
1942324[11].hopper-mgt     ..._job_array-11 jro0014         00:00:13 R general
...

This list will be longer than 400 jobs, but the job array will make sure at most 400 run at any given time. It will also be shorter than the total number of jobs in the array (1200), because the array will keep adding them into the wait list as it cycles through all the analyses.

If you just want to know how many jobs are actively running, you can pipe the output of qstat -t to grep and then to wc:

qstat -t | grep -i "R gen" | wc -l

My output was:

239

So, 239 of my analyses are currently running. You can change this to get the number of jobs the array currently has waiting to run:

qstat -t | grep -i "Q gen" | wc -l

Note, the number output from this command might not be all the jobs left to run, because the job array may not have put all the jobs in the queue yet.

The job array will create a lot of output files in your scripts directory. If all is working well, you can get rid of these using the following command from within the scripts directory of the project:

rm spawn_job_array.o*-*

If all is not going well, these output files might have content to help you figure out what the problem is.

Once the qstat -t command is showing that all of your analyses have finished, run the same command from within your scripts directory again:

bash submit_sim_analyses.sh ../simulations/*/batch-266950032

Note

Only re-run this command after all the analyses started by this command the first time are no longer running. In other words, the qstat -t should produce no output (assuming you are not running analyses for other projects) before you re-run this command.

If most of your analyses finished successfully, the script will seem like it’s running slow. Just be patient; it is checking the output of all the analyses, and only writes a message to the screen if it finds an analysis that didn’t finish successfully. So, if it seems like nothing is happening, that’s a good thing (i.e., the script is finding lots of successfully completed analyses). Here is my output from the submit_sim_analyses.sh script:

Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-independent-pairs-05-sites-00500-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-independent-pairs-05-sites-01000-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-independent-pairs-05-sites-02500-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-independent-pairs-05-sites-10000-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-simultaneous-pairs-05-sites-00500-locus-500/batch-266950032'
No stdout: /scratch/jro0014/codiv-sanger-bake-off/simulations/fixed-simultaneous-pairs-05-sites-00500-locus-500/batch-266950032/simcoevolity-sim-09-config-run-1-qsub.sh
Incomplete stdout: /scratch/jro0014/codiv-sanger-bake-off/simulations/fixed-simulations-pairs-05-sites-00500-locus-500/batch-266950032/simcoevolity-sim-23-config-run-3-qsub.sh
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-simultaneous-pairs-05-sites-01000-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-simultaneous-pairs-05-sites-02500-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-simultaneous-pairs-05-sites-10000-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/pairs-05-sites-00500-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/pairs-05-sites-01000-locus-500/batch-266950032'
Incomplete stdout: /scratch/jro0014/codiv-sanger-bake-off/simulations/pairs-05-sites-01000-locus-500/batch-266950032/simcoevolity-sim-58-config-run-1-qsub.sh
Beginning to vet and consolidate sim analysis files in:
  '../simulations/pairs-05-sites-02500-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/pairs-05-sites-10000-locus-500/batch-266950032'
Submitting analyses to queue...
../bin/psub -t 00:30:00 -a 1-1 ../bin/spawn_job_array /scratch/jro0014/codiv-sanger-bake-off/scripts/spawn_job_array.MTEnmy8gwlY9
qsub -q general -j oe -l nodes=1:ppn=1,walltime=00:30:00 -t 1-1 ../bin/spawn_job_array -F  "/scratch/jro0014/codiv-sanger-bake-off/scripts/spawn_job_array.MTEnmy8gwlY9"
2059058[].hopper-mgt

This output is telling me that three of the analyses (of the 960 I submitted the first time) did not finish (their standard output was either missing or incomplete). The output also confirms that these failed analyses are being re-run with a new job array. Again, you can monitor the progress of your re-analyses using qstat -t, and once they finish, go ahead and run the following command for the third time (from within the scripts directory):

bash submit_sim_analyses.sh ../simulations/*/batch-266950032

Hopefully the third time (and if your lucky, the second time), your output will look like:

Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-independent-pairs-05-sites-00500-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-independent-pairs-05-sites-01000-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-independent-pairs-05-sites-02500-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-independent-pairs-05-sites-10000-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-simultaneous-pairs-05-sites-00500-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-simultaneous-pairs-05-sites-01000-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-simultaneous-pairs-05-sites-02500-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/fixed-simultaneous-pairs-05-sites-10000-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/pairs-05-sites-00500-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/pairs-05-sites-01000-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/pairs-05-sites-02500-locus-500/batch-266950032'
Beginning to vet and consolidate sim analysis files in:
  '../simulations/pairs-05-sites-10000-locus-500/batch-266950032'
All analyses are complete and clean!

This confirms that all of your analyses have successfully finished! Note, the job failure rate of the Hopper cluster fluctuates. So, you might have some failures that get resubmitted during your third use of submit_sim_analyses.sh above. If so, just monitor those re-runs with qstat -t, and run the submit_sim_analyses.sh script again after they finished (as we did three times above). Eventually, you should get the All analyses are complete and clean! message.

Note

Analyses that need to be re-run are done so exactly (i.e., the exact same data and starting seed for the random number generator).

These jobs are not failing due to any issues with ecoevolity. Our cluster almost always has some small failure rate when running lots of jobs, no matter how simple the jobs are. So, we can simply run them again, exactly as before, and they will work fine.

I say this, because if we were re-running analyses with different simulated datasets or different starting seeds, we could be creating subtle biases in our analyses. That is not the case here. We are only re-running analyses because our cluster’s queue/scheduler system is … less than ideal.

Go ahead and clean out all the output files from the job array from inside the scripts directory:

rm spawn_job_array.o*-*

Summarizing the results

After the submit_sim_analyses.sh script confirms that All analyses are complete and clean! it is time for us to summarize the results from all 960 analyses we ran. Our results are currently scattered across 960 log files output by ecoevolity during these analyses. These log files contain MCMC samples collected from the posterior distribution of the model given the simulated dataset. We will use the Python script scripts/parse_sim_results.py to parse all these log files (posterior samples) and summarize them in tab-delimited tables. We will run the parse_sim_results.py Python script using the parse_sim_results.sh Bash script, so that we can submit it as a job to the queue. Assuming you are on the Hopper cluster and in the scripts directory of your copy of the project, let’s use a for-loop to parse the results for each of the 12 config files simultaneously:

for batch_dir in ../simulations/*/batch-266950032; do ../bin/psub parse_sim_results.sh "$batch_dir"; done

Note

If you are not on the Hopper cluster, you can simply run the Python script directly:

python parse_sim_results.py ../simulations/*/*266950032

Just make sure you have the project’s Python virtual environment environment activated.

Use the qstat command to monitor the progress of the parsing jobs. Once the output of qstat confirms the jobs have finished running, we can take a look at all the tab-delimited text files it created that summarize all the results:

ls ../simulations/*/batch-266950032/*results.tsv

You will notice that each batch directory of simulations has its own simcoevolity-results.tsv file. Each one contains the results summarized for 20 datasets simulated according to the settings (i.e., model and dataset size) specified in the corresponding ecoevolity config file. Each line of these files summarizes the results for one of the simulation replicates. So each of these files should have 21 lines (20 lines of results, plus a line with the column headers). We can easily confirm this using wc:

wc -l ../simulations/*/batch-266950032/*results.tsv

Now, let’s gzip these files:

gzip ../simulations/*/batch-266950032/*results.tsv

Now, add them to the staging area of the project Git repository:

git add ../simulations/*/batch-266950032/*results.tsv.gz

And, commit them to the repository database:

git commit

A good commit message might look something like:

Adding batch 266950032 of simulation results.

Adding gzipped, tab-delimited files. Each file summarizes the results of
ecoevolity analyses of 20 simcoevolity simulation replicates.  Adding
results of simulations of 4 different dataset sizes simulated under 3
models (DPP, independent divergences, and simultaneous divergence); all
datasets were analyzed with the same model (DPP).

Note

Git handles the versioning of text files very well, but not zipped files. So, we usually want to avoid adding zipped files to a Git repository. If we have large files we want to keep in a Git repo, it’s better to use an extension like Git LFS.

However, in this case we are adding files that we never want to version control (we shouldn’t be editing our results files!). So, it is not a problem that Git will not be able to track line-by-line changes to these files.

Finally, push your new results to the remote repository hosted on a GitHub server:

git push origin main

If you get a “rejected” or “error” message, your copy of the repository is most likely behind the remote copy on GitHub. You will need to pull before you push. See here for more details.

Cleaning up

After we have committed and pushed the results of our analyses, let’s use a for-loop to cleanup all those thousands of files that were generated during the simulations and analyses:

for batch_dir in ../simulations/*/batch-266950032; do ../bin/psub archive_sim_files.sh "$batch_dir"; done

This for loop runs the archive_sim_files.sh script on each of the directories containing your batch of simulation files. The archive_sim_files.sh script copies these files into compressed archives and removes the original files.

Now, we can add these archives to the git repository:

git add ../simulations/*/batch-266950032/*.tar.xz

And, commit them to the repository database:

git commit

A good commit message might look something like:

Adding archives of sim files for batch 266950032.

Adding compressed archives of all the ``simcoevolity`` and
``ecoevolity`` files for batch 266950032 of simulation replicates.
These files are handled by Git LFS, so only a reference to the
files is stored in the git database.

Note

As discussed above, Git handles the versioning of text files very well, but not large, compressed files like the ones we just added. So, why did we add them? Well, we have configured Git LFS. to handle any files that end with “.tar.xz” (this configuration is in the .gitattributes file in the base directory of the project).

Git LFS works by only storing references to these files, rather than the files themselves. So, git doesn’t track the contents of these large, compressed files. Which is good for us; we aren’t going to be making edits to these files, and it would slow git down to always be checking these files for changes.

Finally, push everything to the remote repository on GitHub:

git push origin main

If you get a “rejected” or “error” message, your copy of the repository is most likely behind the remote copy on GitHub. You will need to pull before you push. See here for more details.

Reflection

That’s it! You’ve just contributed a batch of simulation-based analyses to this project. Take a moment to reflect on what you did and why (the Background and The ecoevolity configs sections might help for this). Can you think of other models or simulation conditions that would be good to explore for this project?