Methods and Protocols – In Situ Labs

ISL RESOURCES

TOOLS – METHODS – PROTOCOL

High Throughput Sequencing

Sequence Analysis: Docker

Widlife Sampling Procedures

Data System

Primer Testing and Design

High Throughput Sequencing

Sequence Analysis: Docker

For sequence data analysis a variety of software tools will be needed (Guppy, Amplicon Sorter, NGSpeciesID, NanoFilt, NanoPlot, etc). We suggest setting up software in one or more DOCKER images to avoid compatibility issues.

The Docker application allows one to have self-contained environments on their computer which can use any operating system that is compatible with a desired software package. The containers are isolated from the rest of the computer. For example, one may have a MacBook computer that runs OS, but a docker image that has a UBUNTU operating system, or a Windows operating system. As long as a computer can run the Docker application, one can move their workflow to any computer that has the appropriate hardware (i.e. sufficient RAM and CPUs/GPUs), or a remote server. Here is a reasonably good Docker tutorial on YouTube .

You don’t have to become a Docker pro to start analyzing data

Try these simple steps that will run a docker image that some of the ISL hubs are currently using. Note, there will be some differences to running Docker on Windows versus OS versus Linux – the following example is for OS (but a simple web search on how to run Docker on another platform will clarify the equivalent steps below).

1) Download and install docker (create a free account if required)

2) Once installed, open up the Terminal applicationand run the following command. The image is about 8 Gb, so this may take a while.

(cmd)$ docker pull insitulab/junglegenomics:latest

3) Once complete you can now start an interactive session (the ‘-it’ parameter os what makes the session interactive).

(cmd)$ docker run -it insitulab/junglegenomics:latest

If it’s working, test that you can run the following commands (everything from the ‘#’ onwards is just a comment and NOT part of the command)

$ ls -lah   # everything in the root directory of the image
$ minibary.py  # shows the help information for minibar
$ blastn -h    # shows help information for NCBI blastN tool
$ NGSpeciesID  # shows help information for NGSpeciesID

4) A self-contained environment is running at this moment, so whatever is done inside this session stays inside and will be lost when the session ends. To access data and save work that occurs inside the image, one must connect this environment to an actual directory/folder on the computer (or server space) that is being used. ‘To do this, exit this session with the following command

(cmd)$ exit

Now restart the container with the -v parameter, which makes the connection as follows

(cmd)$ docker run -it -v users/johnD/documents/data/:/data  \
insitulab/junglegenomics:latest

The -v has two arguments. To the left of the ‘:’ is the path to the directory of interest on the computer or server being used. To the right of the ‘:’ is the location where the data will be in the docker container. Anything output to the “data” directory will be written to the hard drive, and likewise, anything that is deposited from the computer into the “data” directory will be accessible from inside the container. Do the following to check this

(cmd)$ cd data/
(cmd)$ ls .

5) If all prior steps have been completed, then mission accomplished and good luck analyzing the data. Helpful information about this docker image can be found at https://hub.docker.com/r/insitulab/junglegenomics

Widlife Sampling Procedures

Data System

Primer Testing and Design

A tool to aid PCR primer design and evaluation. Full description is available on github.

https://github.com/insitulabs/assessPrimers

Docker image link: https://hub.docker.com/r/insitulab/assessprimer

The primer assessment tool requires several inputs including:

List of nucleotide reference sequences
A file of primers (forward and/or reverse)
A reference protein sequence (optional)
Prefix for output files

The inputs are used to create a non-redundant multiple sequence alignment of all references sequences to each other as well as to each primer pair. From this alignment, the following statistics are printed to stdout:

Primer number: The number of unique primers calculated after converting all degenerate bases to their non-degenerate equivalents.

Entropy: Cumulative entropy score for each length of k-mer along the alignment (lower entropy scores reflect more conserved sequences)

Start coordinate position

Number mismatches: For each primer provided, histogram of the number of mismatches for each reference sequence

Example output (see Github for more detail):

PMX1 coordinate: 3106 entropy: 0.27 numPrimers:32768 GARGGNYNNTGYCARAARNTNTGGAC
PMX1 captures:
45 sequences with 0 mismatches
6 sequences with 1 mismatch(es)
4 sequences with 2 mismatch(es)

PMX2 coordinate: 3202 entropy: 0.33 numPrimers:65536 GGNGAYAAYCARNYNATWGCNRTNA
PMX2 captures:
31 sequences with 0 mismatches
19 sequences with 1 mismatch(es)
3 sequences with 2 mismatch(es)
2 sequences with 3 mismatch(es)