Last updated: 2020-08-07
To access the main functions included in waterway
, you need the following:
In addition, some optional Qiime2 plugins are supported by waterway
:
Once waterway
has been downloaded from Github, run the following commands:
unzip waterway-master
cd waterway-master
chmod 700 waterway.bash
./waterway.bash --add-waterway
You should be prompted to add waterway
as a command to your .bashrc
file. Accept by typing y
. After accepting, please do not move the waterway folder.
First, activate a Qiime2 environment through Conda. Make a new folder to run waterway in, then run waterway
once to generate the config files. For example:
mkdir waterway_config
cd waterway_config
waterway
This will generate two files in the folder waterway_config: config.txt
and optional_analyses.txt
. Everytime waterway
runs, it reads data from a config and optional analyses file specific to one dataset. Whenever we run waterway
from now on, we can specify the location of the config file we want waterway
to read from. This way, we can keep multiple config files for different datasets in different folders without problem.
The only file we need to edit for now is config.txt
, so open the file in your preferred editor. The first five lines should look like this:
#Filepaths here
projpath=/home/username/folder with raw-data, metadata, and outputs folders/
filepath=/home/username/folder with raw-data, metadata, and outputs folders/raw-data
qzaoutput=/home/username/folder with raw-data, metadata, and outputs folders/outputs/
metadata_filepath=/home/username/folder with raw-data, metadata, and outputs folders/metadata/metadata.tsv
Any line starting with a #
is ignored in config.txt
and optional_analyses.txt
, and are used to delimit option blocks (making it easier to read). In lines 2-5, we see configs that we need to set. When inputting filepaths or changing configs, always change the text to the right of the equals.
In our case, let’s imagine that our folder structure looks like the following:
|---- raw_fastq_data/ ---- lots of fastq files here
|
|---- metadata/ ---------- metadata.tsv
/home/user/project_folder/ ----|
|---- waterway_config/ --- config.txt
| |
| |-- optional_analyses.txt
|---- outputs/
The projpath
on line 2 of config.txt
refers to the overarching folder containing all dataset subfolders and files. In our case, this is /home/user/project_folder/
.
The filepath
on line 3 refers to the folder containing all our raw fastq files, or /home/user/project_folder/raw_fastq_data/
.
The qzaoutput
on line 4 refers to the folder we want waterway to output all analysis files. It is strongly recommended that this is an empty folder. In our case, this is /home/user/project_folder/outputs/
.
The metadata_filepath
on line 5 refers to the Qiime2-compatible metadata file for our entire dataset, or /home/user/project_folder/metadata/metadata.tsv
.
Save the config file, then continue to the next section if you need to make a manifest file for your data. Otherwise, skip the next section and go directly to Importing the data into a Qiime2-readable format
If your fastq files are not in a Qiime2-compatible filename format, then you’ll need to make a manifest file. A manifest file is a three-column file which has the sampleID, the filepath to the FastQ file containing the forwards reads, and the filepath to the reverse reads. waterway
can automatically generate this manifest file if the filepath
variable in the config file has been set.
When generating a manifest file, Fastq files must be distinguished for forward and reverse reads for all samples. Ideally, all files containing forward reads should be distinguished by a trailing R1
in the filename, with reverse read files distinguished with an R2
. This follows the conventional Casava 1.8 formatted standard output from Illumina machines. If your files follow a different pattern though, this can be changed in config.txt
, under the variables called FPattern
and RPattern
.
Once the forward and reverse reads have been distinguished, making the manifest file is a simple operation of:
waterway . -M
This will create a file called manifest.tsv
in the folder containing the config.txt
file. The variable called manifest
in config.txt
should now be set to the path to manifest.tsv
.
To import the data WITHOUT using a manifest file, first navigate to the folder containing config.txt
and run the command: waterway .
To import the data WITH a manifest file, first navigate to the folder containing config.txt
and run the command: waterway . -m
Once the data has been imported, you should see the console say “Finished import block” in green, and the files imported_seqs.qza
and imported_seqs.qzv
should be created in the folder specified in the variable qzaoutput
.
To find the optimal trimming/truncation lengths, first open imported_seqs.qzv
in Qiime2View. As a general rule of thumb, you want the forward and reverse trimming lengths to be identical (usually at around 6), while truncation lengths should differ between forward and reverse reads. Forward reads generally have a longer truncation point, due to their overall slightly better quality scores. The threshhold for a “good” quality score is generally around 20, although this can vary.
Once the forward and reverse trimming/truncation lengths have been found, open config.txt
and add in the respective numbers into trimF
, trimR
, truncF
, and truncR
. Note that you can choose multiple truncation lengths to test at this point, and waterway will try all possible combinations of truncF-truncR lengths. For example, if we wanted to test out both a forward truncation value of 221 and 232, as well as a reverse truncation value of 198 and 201, the config file would read:
truncF=(221 232)
truncR=(198 201)
waterway
would then try out all four possible truncF-truncR combinations, and analysis will be run on all of them. At this point, our outputs
folder would now contain four new folders: 221-198
, 221-201
, 232-198
, and 232-201
. We can then survey the results in each folder, and delete the three that we don’t need anymore. Note that if DADA2 is unable to pass any reads using a truncF-truncR combination, a text file called NoOutputs.txt
will be created in the truncation combination that didn’t work, and waterway
will ignore that combination for all subsequent steps.
Once our trimming and truncation lengths have been determined and added to the config file, DADA2 can be run by simply using the waterway
command. waterway
will automatically stop after DADA2 has been completed on all truncF-truncR combinations.
Once the execution of waterway
has stopped and DADA2 has completed, we can view the file called denoising-stats.qzv
in the outputs
folder in Qiime2View. The objective now is to find an optimal sampling depth. The sampling depth refers to the threshhold we can choose to rarefy all samples down to. Rarefying is the process of randomly sampling a certain number of reads from each sample. To visualize this, let us assume we have the following samples which have variable numbers of denoised reads:
Sample Name Denoised Reads
-------------------------------
Sample-1 20,000
Sample-2 25,000
Sample-3 30,000
Sample-4 50,000
In the above case, if we rarefy to 20,000 reads, Sample-1 would exist as-is. However, Sample-2/3/4 will have 20,000 reads randomly chosen, and will end up containing 20,000 reads each. If we decide to rarefy to 30,000 reads, Sample-1/2 would be excluded as they both have less than 30,000 reads total, while Sample-4 would be rarefied down to 30,000 reads.
For 16S analysis, it is generally recommended to have the sampling depth exceed 20,000 reads at minimum.
Once the sampling depth has been determined, it should be entered into config.txt
under the variable sampling depth
.
A group to compare beta diversity values between can be designated in config.txt
under the variable beta_diversity_group
. Note that further beta diversity comparisons can be done after waterway
completes the initial analyses, by looking in optional_analyses.txt
.
If missing metadata is labelled with a certain string (such as “NotAvailable”) in metadata.txt
, designate it under the variable missing_samples
. Otherwise, designate missing_samples
with the string “SDLFKJAWEOIJREWR”, or other nonsensical strings that don’t appear in your metadata.
Before waterway
can be run once again, a classifier file must be designated. A classifier file lets QIIME2 label the DNA reads with taxonomic labels, i.e. it allows us to know how many species of microbes we detect in each sample, and how many microbes per specie we’ve found.
Classifiers can be found on the Qiime2 classifier file list here. Note that classifier files are specific to your Qiime2 version. Designate the location of the classifier file in config.txt
under the variable classifierpath
.
Once all variables are designated, waterway
can be run again, and should complete all main analyses.