Commit 47ea55c6 authored by Tomas Pavlik's avatar Tomas Pavlik
Browse files

initial commit

parent d92fa176
Loading
Loading
Loading
Loading

.gitignore

0 → 100644
+18 −0
Original line number Diff line number Diff line
_data/metadata_filtered.tsv
_data/silva-138-99-seqs.qza
_data/silva-138-99-tax.qza
_data/silva_new.qza
_data/silva_new_wght.qza
demultiplexed_log
filtered
illumina
kraken2
KRAKEN2_DB
qiime2
qiime2-amplicon-2023.9-py38-linux-conda.yml
quality_R1
quality_R2
_software
verify.ok
.idea
.snakemake
 No newline at end of file
+71 −0
Original line number Diff line number Diff line
<<<<<<< HEAD
# IV110_2023


@@ -91,3 +92,73 @@ For open source projects, say how it is licensed.

## Project status
If you have run out of energy or time for your project, put a note at the top of the README saying that development has slowed down or stopped completely. Someone may choose to fork your project or volunteer to step in as a maintainer or owner, allowing your project to keep going. You can also make an explicit request for maintainers.
=======
# Metagenomic analysis pipeline
### IV110 Projekt z bioinformatiky I, fall '23

---

This pipeline generates all the necessary underlying data for the subsequent analysis, report and presentation. Written 
under snakemake, it automatically, reliably and, most importantly, reproducibly processes source Illumina sequences data,
transforming and analysing them in the process.

## 1000 miles birds-eye view, aka the pipeline steps 

![Pipeline description](pipeline.jpg)

## Used tools

For simplicity and learning sake, this pipeline (written under [Snakemake](https://snakemake.readthedocs.io/en/stable/)) is almost fully integrated into [qiime2's](https://qiime2.org/) environment.
The first step, preprocessing, is done using [cutadapt](https://cutadapt.readthedocs.io/en/stable/) and qiime2's [DADA2](https://benjjneb.github.io/dada2/) plugin.
Following, taxonomic analysis is carried out using:
 - qiime2's [Naive Bayes classifiers](https://docs.qiime2.org/2023.9/data-resources/) utilizing [SILVA 138_1](https://www.arb-silva.de/documentation/release-1381/) database.
 - kraken2's full-scale database, custom-built from NCBI sources

The taxonomic analysis is later visualised using [Krona](https://bio.tools/krona).
Phylogenic analyses, as well as diversity analyses, are carried out utilising integrated qiime-amplicon plugins, the same as a prediction of samples' categories.
The last step, metabolic pathway analysis, is carried out using [PICRUSt](https://picrust.github.io/picrust/), again as qiime2 plugin.

It is possible (and was once integrated/supported) to build a custom qiime2 classifier (taxonomic analysis) using [RESCRIPt](https://github.com/bokulich-lab/RESCRIPt),
but this step was omitted in the final release as it was replaced by a full-scale kraken2 analysis.

## Running the pipeline

> Please note, due to too old versions of required software, many steps of this pipeline cannot be run (practically can, but they will internally fail) on `metacentrum` or similar
> without pulling all your hair while trying to import external packages. As a friendly suggestion, do not even attempt it. This pipeline was NOT built for cluster/remote execution.

The pipeline is fully functional and given a correctly set-up environment, it is fully sufficient to run
> $ conda activate qiime_amplicon_environment
>
> $ snakemake -c1

... and wait for 4-ish hours. The pipeline allows for fully reproducible (and consistent) results as well as being able to
change specific steps, propagating updates where needed.

### Obtaining source data
Given the project input files (~25GB), a fully-run pipeline produces over 400GB of data, which can be later reduced to ~60GB by deleting files which are not
necessary for later analyses. Due to this and other reasons, this pipeline **is NOT shipped** with the data necessary to run it. Please download the following:
 -  Illumina sequences data from [link](https://is.muni.cz/auth/el/fi/podzim2023/IV110/index.qwarp?prejit=11818341) as "./illumina" folder
 -  SILVA 138_1 non-weighted Naive Bayes classifier [Silva 138 99% OTUs full-length sequences](https://docs.qiime2.org/2023.9/data-resources/) as "_data/silva_new.qza"
 -  SILVA 138_1 weighted Naive Bayes classifier [Weighted Silva 138 99% OTUs full-length sequences]((https://docs.qiime2.org/2023.9/data-resources/)) as "_data/silva_new_wght.qza"
 -  SILVA 138_1 source source sequences, [Silva 138 SSURef NR99 full-length sequences]((https://docs.qiime2.org/2023.9/data-resources/)), as "_data/silva-138-99-seqs.qza"
 -  SILVA 138_1 source source taxonomy, [Silva 138 SSURef NR99 full-length taxonomy]((https://docs.qiime2.org/2023.9/data-resources/)), as "_data/silva-138-99-tax.qza"

Please refer to verify.dat (or README_illumina.txt) for precisely written necessary locations as well as hashes of input files.
The pipeline's first step, `verify` checks the presence of necessary files as well as their hashes and fails early if source data is missing, misplaced or corrupted.
Disabling or bypassing this step is strongly discouraged.

### Setting up the environment

The snakemake's pipeline is run under conda environment, set up according to [Natively installing QIIME 2](https://docs.qiime2.org/2023.9/install/native/).
This might get tricky and may require you to use full-scale conda (not miniconda alternatives) as well as do some hackery while installing necessary plugins denoted in Snakemake file (step `verify`).
Also, please note that the pipeline is written under Linux and without modifying the commands, it will most likely not execute under other operating systems.
As for the other required packages and their releases, this project utilises all the necessary software available from conda/pip/AUR on the newest versions available on the 1st of December, 2023.

### Starting the pipeline and hardware restrictions

Due to speeding up the process, everything that could be run using multiple processes utilises the chance. It may be possible the pipeline
is unexpectedly killed by the OS (SIGKILL, SIGXCPU and similar) due to requiring too many resources. Please refer Snakemake file to manually disable
multiprocessing and make commands run sequentially. In the heaviest parts of the analysis, the pipeline takes about 50GB of RAM and utilises all available cores of the processor.

To start the pipeline, just run the first two commands from this section. 
>>>>>>> 727dbf9 (initial commit)

README_illumina.txt

0 → 100644
+260 −0
Original line number Diff line number Diff line
illumina/:
run83
run89
run90
run92
run93
run95
run96

illumina/run83:
pool10_R1.fastq.gz
pool10_R2.fastq.gz
pool11_R1.fastq.gz
pool11_R2.fastq.gz
pool12_R1.fastq.gz
pool12_R2.fastq.gz
pool13_R1.fastq.gz
pool13_R2.fastq.gz
pool14_R1.fastq.gz
pool14_R2.fastq.gz
pool15_R1.fastq.gz
pool15_R2.fastq.gz
pool16_R1.fastq.gz
pool16_R2.fastq.gz
pool17_R1.fastq.gz
pool17_R2.fastq.gz
pool1_R1.fastq.gz
pool1_R2.fastq.gz
pool2_R1.fastq.gz
pool2_R2.fastq.gz
pool3_R1.fastq.gz
pool3_R2.fastq.gz
pool4_R1.fastq.gz
pool4_R2.fastq.gz
pool5_R1.fastq.gz
pool5_R2.fastq.gz
pool6_R1.fastq.gz
pool6_R2.fastq.gz
pool7_R1.fastq.gz
pool7_R2.fastq.gz
pool8_R1.fastq.gz
pool8_R2.fastq.gz
pool9_R1.fastq.gz
pool9_R2.fastq.gz

illumina/run89:
pool10_R1.fastq.gz
pool10_R2.fastq.gz
pool11_R1.fastq.gz
pool11_R2.fastq.gz
pool12_R1.fastq.gz
pool12_R2.fastq.gz
pool13_R1.fastq.gz
pool13_R2.fastq.gz
pool14_R1.fastq.gz
pool14_R2.fastq.gz
pool15_R1.fastq.gz
pool15_R2.fastq.gz
pool16_R1.fastq.gz
pool16_R2.fastq.gz
pool17_R1.fastq.gz
pool17_R2.fastq.gz
pool18_R1.fastq.gz
pool18_R2.fastq.gz
pool1_R1.fastq.gz
pool1_R2.fastq.gz
pool2_R1.fastq.gz
pool2_R2.fastq.gz
pool3_R1.fastq.gz
pool3_R2.fastq.gz
pool4_R1.fastq.gz
pool4_R2.fastq.gz
pool5_R1.fastq.gz
pool5_R2.fastq.gz
pool6_R1.fastq.gz
pool6_R2.fastq.gz
pool7_R1.fastq.gz
pool7_R2.fastq.gz
pool8_R1.fastq.gz
pool8_R2.fastq.gz
pool9_R1.fastq.gz
pool9_R2.fastq.gz

illumina/run90:
pool10_R1.fastq.gz
pool10_R2.fastq.gz
pool11_R1.fastq.gz
pool11_R2.fastq.gz
pool12_R1.fastq.gz
pool12_R2.fastq.gz
pool13_R1.fastq.gz
pool13_R2.fastq.gz
pool14_R1.fastq.gz
pool14_R2.fastq.gz
pool15_R1.fastq.gz
pool15_R2.fastq.gz
pool16_R1.fastq.gz
pool16_R2.fastq.gz
pool17_R1.fastq.gz
pool17_R2.fastq.gz
pool1_R1.fastq.gz
pool1_R2.fastq.gz
pool2_R1.fastq.gz
pool2_R2.fastq.gz
pool3_R1.fastq.gz
pool3_R2.fastq.gz
pool4_R1.fastq.gz
pool4_R2.fastq.gz
pool5_R1.fastq.gz
pool5_R2.fastq.gz
pool6_R1.fastq.gz
pool6_R2.fastq.gz
pool7_R1.fastq.gz
pool7_R2.fastq.gz
pool8_R1.fastq.gz
pool8_R2.fastq.gz
pool9_R1.fastq.gz
pool9_R2.fastq.gz

illumina/run92:
pool10_R1.fastq.gz
pool10_R2.fastq.gz
pool11_R1.fastq.gz
pool11_R2.fastq.gz
pool12_R1.fastq.gz
pool12_R2.fastq.gz
pool13_R1.fastq.gz
pool13_R2.fastq.gz
pool14_R1.fastq.gz
pool14_R2.fastq.gz
pool15_R1.fastq.gz
pool15_R2.fastq.gz
pool16_R1.fastq.gz
pool16_R2.fastq.gz
pool17_R1.fastq.gz
pool17_R2.fastq.gz
pool1_R1.fastq.gz
pool1_R2.fastq.gz
pool2_R1.fastq.gz
pool2_R2.fastq.gz
pool3_R1.fastq.gz
pool3_R2.fastq.gz
pool4_R1.fastq.gz
pool4_R2.fastq.gz
pool5_R1.fastq.gz
pool5_R2.fastq.gz
pool6_R1.fastq.gz
pool6_R2.fastq.gz
pool7_R1.fastq.gz
pool7_R2.fastq.gz
pool8_R1.fastq.gz
pool8_R2.fastq.gz
pool9_R1.fastq.gz
pool9_R2.fastq.gz

illumina/run93:
pool10_R1.fastq.gz
pool10_R2.fastq.gz
pool11_R1.fastq.gz
pool11_R2.fastq.gz
pool12_R1.fastq.gz
pool12_R2.fastq.gz
pool13_R1.fastq.gz
pool13_R2.fastq.gz
pool14_R1.fastq.gz
pool14_R2.fastq.gz
pool15_R1.fastq.gz
pool15_R2.fastq.gz
pool16_R1.fastq.gz
pool16_R2.fastq.gz
pool17_R1.fastq.gz
pool17_R2.fastq.gz
pool1_R1.fastq.gz
pool1_R2.fastq.gz
pool2_R1.fastq.gz
pool2_R2.fastq.gz
pool3_R1.fastq.gz
pool3_R2.fastq.gz
pool4_R1.fastq.gz
pool4_R2.fastq.gz
pool5_R1.fastq.gz
pool5_R2.fastq.gz
pool6_R1.fastq.gz
pool6_R2.fastq.gz
pool7_R1.fastq.gz
pool7_R2.fastq.gz
pool8_R1.fastq.gz
pool8_R2.fastq.gz
pool9_R1.fastq.gz
pool9_R2.fastq.gz

illumina/run95:
pool10_R1.fastq.gz
pool10_R2.fastq.gz
pool11_R1.fastq.gz
pool11_R2.fastq.gz
pool12_R1.fastq.gz
pool12_R2.fastq.gz
pool13_R1.fastq.gz
pool13_R2.fastq.gz
pool14_R1.fastq.gz
pool14_R2.fastq.gz
pool15_R1.fastq.gz
pool15_R2.fastq.gz
pool16_R1.fastq.gz
pool16_R2.fastq.gz
pool17_R1.fastq.gz
pool17_R2.fastq.gz
pool1_R1.fastq.gz
pool1_R2.fastq.gz
pool2_R1.fastq.gz
pool2_R2.fastq.gz
pool3_R1.fastq.gz
pool3_R2.fastq.gz
pool4_R1.fastq.gz
pool4_R2.fastq.gz
pool5_R1.fastq.gz
pool5_R2.fastq.gz
pool6_R1.fastq.gz
pool6_R2.fastq.gz
pool7_R1.fastq.gz
pool7_R2.fastq.gz
pool8_R1.fastq.gz
pool8_R2.fastq.gz
pool9_R1.fastq.gz
pool9_R2.fastq.gz

illumina/run96:
pool10_R1.fastq.gz
pool10_R2.fastq.gz
pool11_R1.fastq.gz
pool11_R2.fastq.gz
pool12_R1.fastq.gz
pool12_R2.fastq.gz
pool13_R1.fastq.gz
pool13_R2.fastq.gz
pool14_R1.fastq.gz
pool14_R2.fastq.gz
pool15_R1.fastq.gz
pool15_R2.fastq.gz
pool16_R1.fastq.gz
pool16_R2.fastq.gz
pool1_R1.fastq.gz
pool1_R2.fastq.gz
pool2_R1.fastq.gz
pool2_R2.fastq.gz
pool3_R1.fastq.gz
pool3_R2.fastq.gz
pool4_R1.fastq.gz
pool4_R2.fastq.gz
pool5_R1.fastq.gz
pool5_R2.fastq.gz
pool6_R1.fastq.gz
pool6_R2.fastq.gz
pool7_R1.fastq.gz
pool7_R2.fastq.gz
pool8_R1.fastq.gz
pool8_R2.fastq.gz
pool9_R1.fastq.gz
pool9_R2.fastq.gz

Snakefile

0 → 100644
+303 −0

File added.

Preview size limit exceeded, changes collapsed.

_data/fwd.fasta

0 → 100644
+41 −0
Original line number Diff line number Diff line
>I01
AAAGCGT
>I02
ACGAAGT
>I03
ACCTTGT
>I04
ATAATGT
>I05
AGGGTGT
>I06
AGCCAGT
>I07
AGTTCGT
>I08
AATGCAGT
>I09
AATTTAGT
>I10
AGATTGGT
>I11
ATCCTCGT
>I12
AATAGAGT
>I13
ATCCCTGT
>I14
ACATTTGT
>I15
ATTGCGTGT
>I16
ATAAAGAGT
>I17
ATGCTGAGT
>I18
ACGGCTCGT
>I19
AGATGATGT
>I20
AATATACGT
Loading