Commit 3a12d251 authored by Matej Lexa's avatar Matej Lexa
Browse files

Added a description of the counting procedure

parent 590996c9
Loading
Loading
Loading
Loading
+4 −0
Original line number Diff line number Diff line
@@ -119,6 +119,10 @@ nextflow run main_TE_2.nf -profile test,singularity"

*NORMALIZATION METHODS*

In reference-based analysis, the contacts between families are counted as the number of HiC paired reads (after cleaning with Diachromatic) where one read of the pair maps to a region that is annotated by the family in question (family 1) and where the other read maps to a region annotated as another family (family 2). In special cases family1 = family2 (the diagonal in the heatmap). The counting is done in the pipeline after a “bedtools intersect” command is issued, which joins the mapping reads with the annotated intervals. The counted number is the number of lines in the output of this command that have the desired combination of family1 and family2 values (counted by extract_pairs.pl).

In reference-free analysis, counting follows the same logic, except annotation of HiC reads is not done by association to mapped positions (there are none) but by association with annotations assigned by Repeat Explorer to the given HiC reads pair.

After counting all valid HiC pairs in the pipeline a table is created that contains family names in two columns (family1, family2) and in cases based on the reference genome also mapped positions (pos1, pos2). The number of combinations observed between positions and repeat families contains technical and methodological biases. For example there are many more pairs observed for adjacent positions on the same chromosomes compared to long-distance or interchromosomal HiC pairs. Some kind of normalization is therefore necessary before reporting basic statistics or creating heatmap visualizations. Choosing the right normalization method is far from trivial. After careful consideration, we chose three different methods that we use in parallel in towards the end of calculations in the pipeline when a familyxfamily matrix underlying each heatmap is calculated.

**joint probability**