waterway

Last updated: 2020-08-07

Requirements for running waterway

To access the main functions included in waterway, you need the following:

A command line Qiime2 installation via a Conda environment (refer to the Qiime2 installation tutorial here)
An installation of the R language (at least version 4.0.1), with the following libraries:
- Tidyverse
- Phyloseq
- qiime2R
- data.table

In addition, some optional Qiime2 plugins are supported by waterway:

Enable waterway as a command

Once waterway has been downloaded from Github, run the following commands:

unzip waterway-master
cd waterway-master
chmod 700 waterway.bash
./waterway.bash --add-waterway

You should be prompted to add waterway as a command to your .bashrc file. Accept by typing y. After accepting, please do not move the waterway folder.

Importing the Fastq files

First, activate a Qiime2 environment through Conda. Make a new folder to run waterway in, then run waterway once to generate the config files. For example:

mkdir waterway_config
cd waterway_config
waterway

This will generate two files in the folder waterway_config: config.txt and optional_analyses.txt. Everytime waterway runs, it reads data from a config and optional analyses file specific to one dataset. Whenever we run waterway from now on, we can specify the location of the config file we want waterway to read from. This way, we can keep multiple config files for different datasets in different folders without problem.

The only file we need to edit for now is config.txt, so open the file in your preferred editor. The first five lines should look like this:

#Filepaths here
projpath=/home/username/folder with raw-data, metadata, and outputs folders/
filepath=/home/username/folder with raw-data, metadata, and outputs folders/raw-data
qzaoutput=/home/username/folder with raw-data, metadata, and outputs folders/outputs/
metadata_filepath=/home/username/folder with raw-data, metadata, and outputs folders/metadata/metadata.tsv

Any line starting with a # is ignored in config.txt and optional_analyses.txt, and are used to delimit option blocks (making it easier to read). In lines 2-5, we see configs that we need to set. When inputting filepaths or changing configs, always change the text to the right of the equals.

In our case, let’s imagine that our folder structure looks like the following:

                                |---- raw_fastq_data/ ---- lots of fastq files here
                                |
                                |---- metadata/ ---------- metadata.tsv
/home/user/project_folder/  ----|
                                |---- waterway_config/ --- config.txt
                                |                      |
                                |                      |-- optional_analyses.txt
                                |---- outputs/

The projpath on line 2 of config.txt refers to the overarching folder containing all dataset subfolders and files. In our case, this is /home/user/project_folder/.

The filepath on line 3 refers to the folder containing all our raw fastq files, or /home/user/project_folder/raw_fastq_data/.

The qzaoutput on line 4 refers to the folder we want waterway to output all analysis files. It is strongly recommended that this is an empty folder. In our case, this is /home/user/project_folder/outputs/.

The metadata_filepath on line 5 refers to the Qiime2-compatible metadata file for our entire dataset, or /home/user/project_folder/metadata/metadata.tsv.

Save the config file, then continue to the next section if you need to make a manifest file for your data. Otherwise, skip the next section and go directly to Importing the data into a Qiime2-readable format

Making a manifest file for your Fastq files

If your fastq files are not in a Qiime2-compatible filename format, then you’ll need to make a manifest file. A manifest file is a three-column file which has the sampleID, the filepath to the FastQ file containing the forwards reads, and the filepath to the reverse reads. waterway can automatically generate this manifest file if the filepath variable in the config file has been set.

When generating a manifest file, Fastq files must be distinguished for forward and reverse reads for all samples. Ideally, all files containing forward reads should be distinguished by a trailing R1 in the filename, with reverse read files distinguished with an R2. This follows the conventional Casava 1.8 formatted standard output from Illumina machines. If your files follow a different pattern though, this can be changed in config.txt, under the variables called FPattern and RPattern.

Once the forward and reverse reads have been distinguished, making the manifest file is a simple operation of:

waterway . -M

This will create a file called manifest.tsv in the folder containing the config.txt file. The variable called manifest in config.txt should now be set to the path to manifest.tsv.

Importing the data into a Qiime2-readable format

To import the data WITHOUT using a manifest file, first navigate to the folder containing config.txt and run the command: waterway .

To import the data WITH a manifest file, first navigate to the folder containing config.txt and run the command: waterway . -m

Once the data has been imported, you should see the console say “Finished import block” in green, and the files imported_seqs.qza and imported_seqs.qzv should be created in the folder specified in the variable qzaoutput.

Finding the trimming and truncation lengths for DADA2

To find the optimal trimming/truncation lengths, first open imported_seqs.qzv in Qiime2View. As a general rule of thumb, you want the forward and reverse trimming lengths to be identical (usually at around 6), while truncation lengths should differ between forward and reverse reads. Forward reads generally have a longer truncation point, due to their overall slightly better quality scores. The threshhold for a “good” quality score is generally around 20, although this can vary.

Once the forward and reverse trimming/truncation lengths have been found, open config.txt and add in the respective numbers into trimF, trimR, truncF, and truncR. Note that you can choose multiple truncation lengths to test at this point, and waterway will try all possible combinations of truncF-truncR lengths. For example, if we wanted to test out both a forward truncation value of 221 and 232, as well as a reverse truncation value of 198 and 201, the config file would read:

truncF=(221 232)
truncR=(198 201)

waterway would then try out all four possible truncF-truncR combinations, and analysis will be run on all of them. At this point, our outputs folder would now contain four new folders: 221-198, 221-201, 232-198, and 232-201. We can then survey the results in each folder, and delete the three that we don’t need anymore. Note that if DADA2 is unable to pass any reads using a truncF-truncR combination, a text file called NoOutputs.txt will be created in the truncation combination that didn’t work, and waterway will ignore that combination for all subsequent steps.

Running DADA2

Once our trimming and truncation lengths have been determined and added to the config file, DADA2 can be run by simply using the waterway command. waterway will automatically stop after DADA2 has been completed on all truncF-truncR combinations.

Finding an optimal sampling depth

Once the execution of waterway has stopped and DADA2 has completed, we can view the file called denoising-stats.qzv in the outputs folder in Qiime2View. The objective now is to find an optimal sampling depth. The sampling depth refers to the threshhold we can choose to rarefy all samples down to. Rarefying is the process of randomly sampling a certain number of reads from each sample. To visualize this, let us assume we have the following samples which have variable numbers of denoised reads:

Sample Name   Denoised Reads
-------------------------------
 Sample-1         20,000
 Sample-2         25,000
 Sample-3         30,000
 Sample-4         50,000

In the above case, if we rarefy to 20,000 reads, Sample-1 would exist as-is. However, Sample-2/3/4 will have 20,000 reads randomly chosen, and will end up containing 20,000 reads each. If we decide to rarefy to 30,000 reads, Sample-1/2 would be excluded as they both have less than 30,000 reads total, while Sample-4 would be rarefied down to 30,000 reads.

For 16S analysis, it is generally recommended to have the sampling depth exceed 20,000 reads at minimum.

Once the sampling depth has been determined, it should be entered into config.txt under the variable sampling depth.

Identify a group to compare beta diversities between

A group to compare beta diversity values between can be designated in config.txt under the variable beta_diversity_group. Note that further beta diversity comparisons can be done after waterway completes the initial analyses, by looking in optional_analyses.txt.

If missing metadata is labelled with a certain string (such as “NotAvailable”) in metadata.txt, designate it under the variable missing_samples. Otherwise, designate missing_samples with the string “SDLFKJAWEOIJREWR”, or other nonsensical strings that don’t appear in your metadata.

Designating a classifier database

Before waterway can be run once again, a classifier file must be designated. A classifier file lets QIIME2 label the DNA reads with taxonomic labels, i.e. it allows us to know how many species of microbes we detect in each sample, and how many microbes per specie we’ve found.

Classifiers can be found on the Qiime2 classifier file list here. Note that classifier files are specific to your Qiime2 version. Designate the location of the classifier file in config.txt under the variable classifierpath.

Once all variables are designated, waterway can be run again, and should complete all main analyses.