README.md 9.12 KB
Newer Older
Matej Lexa's avatar
Matej Lexa committed
1
2
# hic-te

3
A Nextflow workflow to analyze HiC data from SRA (NCBI Short Read Archive) for 3D contacts between repeat families. Slightly biased (but not limited to) towards LTR retroTEs and plant genomes.
Monika Čechová's avatar
Monika Čechová committed
4

5
6
7
8
*SYNOPSIS*

nextflow run [FILEBASE].nf -profile LIST[,LIST...] [PARAMS...]

9
To run the pipeline, the following parameters are mandatory:
Monika Čechová's avatar
Monika Čechová committed
10

11
*DATA*  
Monika Čechová's avatar
Monika Čechová committed
12

13
**reads**  
14
Reads from a Hi-C experiment. The easiest way to provide this is by listing their SRA id.  
15
--sra_run SRR14458670  
Monika Čechová's avatar
Monika Čechová committed
16

17
**reference**  
18
Reference genome corresponding to the organism the reads belong to. The reference should be in the fasta format.  
19
--reference Athaliana_167_TAIR10.fa
Monika Čechová's avatar
Monika Čechová committed
20

21
**plantsat**  
Matej Lexa's avatar
Matej Lexa committed
22
This file contains sequences of satellite DNA (medium and long tandem repeats) in plants (http://w3lamc.umbr.cas.cz/PlantSat). However other sequences to be mapped to the reference can be added.  
23
--plantsat PlantSat_Arabidopsis.fa
Monika Čechová's avatar
Monika Čechová committed
24

25
**exons_gff3**  
26
This file contains the annotations of the exons in the chosen reference genome.  
27
--exons_gff3 Athaliana_167_TAIR10.gene_exons.gff3
Monika Čechová's avatar
Monika Čechová committed
28

29
*SOFTWARE*  
30

31
Please provide paths to the local installations or use docker/singularity containers as implemented in the workflow.  
Monika Čechová's avatar
Monika Čechová committed
32

33
BBTools, https://jgi.doe.gov/data-and-tools/bbtools/  
34
--bbmap_java_path bbmap.jar
35
Provide path to the directory containing bbmap.jar  
36
--bbmap_reformat_path reformat.sh
37
Provide path to the function reformat.sh that is part of the bbmap package  
Monika Čechová's avatar
Monika Čechová committed
38

39
40
RepeatExplorer, http://repeatexplorer.org/  
Provide path to the clustering function of the Repeat Explorer. This is important to provide reference-free annotations of repeats in the Hi-C data.  
41
--re_seqclust_path repex_tarean/seqclust
Monika Čechová's avatar
Monika Čechová committed
42

Matej Lexa's avatar
Matej Lexa committed
43
*RUNNING THE PIPELINE*  
44
45

Typical commands for running the pipeline are as follows:  
Monika Čechová's avatar
Monika Čechová committed
46

47
**test data run on the local machine**  
48
nextflow run main_TE_2.nf -profile test  
Monika Čechová's avatar
Monika Čechová committed
49

50
51
**test data run with the support of docker/singularity**  
nextflow run main_TE_2.nf -profile test,docker"  
52
nextflow run main_TE_2.nf -profile test,singularity"
Monika Čechová's avatar
Monika Čechová committed
53

Matej Lexa's avatar
Matej Lexa committed
54
*OPTIONAL PARAMETERS*
55

Matej Lexa's avatar
Matej Lexa committed
56
**alternative input to save resources**
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75

    skip_nester             =   true
    input_nester            =   "/mnt/nas/biodata/hic-te/data2/data"

    skip_bow_index          =   false
    input_bow_index         =   "/mnt/nas/biodata/hic-te/ref_Slyc"

    skip_diachromatic       =   false
    input_diachromatic      =   "${params.data_dir}"

    skip_align              =   false
    input_align_bam         =   "/mnt/nas/biodata/hic-te/output/${params.sra_run}_diachromatic_truncated_paired_uniq_ref.bam"

    skip_re                 =   false
    input_re                =   "/mnt/nas/biodata/xnguyen3/hic-te/input/RE_output"

    skip_re_contigs_map     =   false
    input_re_contigs_gff    =   "/mnt/nas/biodata/hic-te/output/annotations/Osativa_323_v7.0_contigs.annotated_filtered.gff"

Matej Lexa's avatar
Matej Lexa committed
76
**general**
77
78
79
80
81
82
83
84
85

    help                    =   false

    data_dir                =   "/mnt/nas/biodata/hic-te/output"

    report_dir              =   "${params.data_dir}/pipeline_report"

    save_intermediate       =   true

Matej Lexa's avatar
Matej Lexa committed
86
**TE-greedy-nester settings**
87
88

    gff_suffix              =   "_genome_browser"
89
90
91
92
93
94
    
**Use Repeat Masker output instead**
    
    repeat_masker_gff       =   ""
    repeat_masker_out       =   ""

95

Matej Lexa's avatar
Matej Lexa committed
96
**diachromatic**
97
98
99

    REnzyme                 =   "DpnII"

Matej Lexa's avatar
Matej Lexa committed
100
**sra**
101
102
103
104
105
106
107

    skip_prefetch_sra       =   false

    skip_convert_fastq      =   false

    fastq_dir               =   "${params.data_dir}/${params.sra_run}"

Matej Lexa's avatar
Matej Lexa committed
108
109
**annotations**

110
111
112
113
114
115
116
117
    exons_gff3              =   "/mnt/nas/biodata/hic-te/Slycopersicum_691_ITAG4.0.gene_exons.gff3"

    plantsat                =   "/mnt/nas/biodata/hic-te/PlantSat_Solanum.fa"

    mRNA                    =   "/mnt/nas/biodata/hic-te/Slycopersicum_691_ITAG4.0.mRNA.gff3"

    rm                      =   "/mnt/nas/biodata/hic-te/Slycopersicum_691_ITAG4.0.repeatmasked_assembly_SL4.0.gff3"

Matej Lexa's avatar
Matej Lexa committed
118
**repeat explorer**
119
120
121
122
123
124

    re_seqclust_path        =   "/home/lexa/git/repex_tarean/seqclust" // full path to seqclust

    clustering_threshold    =   0.01

    RE_run                  =   "RE_output_" + params.sra_run + "_" + params.clustering_threshold
125
126
127
128
129
    
**heatmap vizualization**

    norm_ratio_threshold    =   2
    min_fam_pair_count      =   30
130

131
132
*NORMALIZATION METHODS*

133
134
135
136
In reference-based analysis, the contacts between families are counted as the number of HiC paired reads (after cleaning with Diachromatic) where one read of the pair maps to a region that is annotated by the family in question (family 1) and where the other read maps to a region annotated as another family (family 2). In special cases family1 = family2 (the diagonal in the heatmap). The counting is done in the pipeline after a “bedtools intersect” command is issued, which joins the mapping reads with the annotated intervals. The counted number is the number of lines in the output of this command that have the desired combination of family1 and family2 values (counted by extract_pairs.pl).

In reference-free analysis, counting follows the same logic, except annotation of HiC reads is not done by association to mapped positions (there are none) but by association with annotations assigned by Repeat Explorer to the given HiC reads pair.

137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
After counting all valid HiC pairs in the pipeline a table is created that contains family names in two columns (family1, family2) and in cases based on the reference genome also mapped positions (pos1, pos2). The number of combinations observed between positions and repeat families contains technical and methodological biases. For example there are many more pairs observed for adjacent positions on the same chromosomes compared to long-distance or interchromosomal HiC pairs. Some kind of normalization is therefore necessary before reporting basic statistics or creating heatmap visualizations. Choosing the right normalization method is far from trivial. After careful consideration, we chose three different methods that we use in parallel in towards the end of calculations in the pipeline when a familyxfamily matrix underlying each heatmap is calculated.

**joint probability**

The baseline probability of observing a HiC pair between family A and family B is estimated as joint probability of individual probabilities for observing family A or family B in a given HiC read.

  p(A,B) = p(A).p(B). 

All counts of A-B pairs are divided by this number.

**label permutation**

Family names in the table in columns family1 and family2 are randomly assigned to random (and therefore possibly different) rows of the table. A matrix is also created from this altered table. All counts of A-B pairs are divided by corresponding values in this matrix. 

**annotation shuffling**

While creating the HiC pair table, a parallel table is made, which uses chromosomal positions randomly shuffled along the reference genome. The pair count matrix constructed from such table is used for normalization as above (see label permutation). 

155
156
157
158
159
160
*NOTES*

**adding genomes/organisms**

While the pipeline will happily run on any HiC data and the corresponding reference genome, there are some limitation when running the vanilla gitlab version in such manner. Repeat classification done by Repeat Explorer, TE-greedy-nester and inner blast annotation scripts is plant-oriented, using the Neumann et al. classification scheme. TE-greedy nester enriches the  annotations for LTR retrotransposons. Alternative repeat annotations may be prefered for other organisms. Tandem repeats are collected as input in the PlantSat(other_annotations) file. This file is mapped onto the reference genome, so any sequences can be added in FASTA format.

161
162
163
164
**using Repeat Masker for reference-based annotation instead of TE-greedy-nester**

To make analysis more meaningful for animal species were LTR-retrotransposons are not the main category of repeats, or to provide annotation of additional repeat classes, comparet to only LTR-retrotransposons annotated by TE-greedy-nester, we allow the main reference-based repeat annotation to be provided in a GFF3 file. The pipeline is specifically tuned to accept a combination of *.out and *.gff files from RepeatMasker, but can be adapted to other sources of annotation. The main requirement is for the GFF3 file to contain an annot="repeat_family" variable and for the corresponding Perl script (here enrich_rmsk_gff3_annotation.pl) to be able to add that name from available output (here *.out and *.gff produced by Repeat Masker and UCSC Genome Browser bed_to_gff3 or Genome Tools gt bed2gff3.

Matej Lexa's avatar
Matej Lexa committed
165
**envisioned modifications**
166
167
168
169

  + adding tandem repeats to PlantSat.fa
  + changing classification scheme

170
**reusing intermediate data for speed-up**
171

172
The pipeline contains a number of intermediate steps that will be repeated if another analysis is run for the same organism or the same downloaded SRR file. See #alternative input to save resources# above (under optional parameters). For example once the reference genome is analyzed by TE-greedy-nester and an index is created by bowtie2, these steps can be skipped in subsequent runs against the same organism (reference genome). Just set skip_nester and skip_bow_index to 'true'. The same goes for Repeat Explorer output (set skip-re to 'true'). In each case the path or file with pre-calculated data needs to be provided by a corresponding input_* parameter.