\title{TE-nester: a recursive software tool for structure-based discovery of transposable elements fragmented by insertion of repetitive sequences (application to retrotransposons in plant genomes)\\
\title{TE-nester: a recursive software tool for structure-based discovery of nested transposable elements\\
\thanks{This research is funded by the Czech Grant Agency grant No. GA18-00258S to EK and ML).}
\thanks{This research is funded by the Czech Grant Agency grant No. GA18-00258S to EK and ML).}
}
}
@@ -57,7 +57,7 @@ kejnovsk@ibp.cz}
\maketitle
\maketitle
\begin{abstract}
\begin{abstract}
Eukaryotic genomes are generally rich in repetitive sequences. LTR retrotransposons are the most abundant class of repetitive sequences in plant genomes. They form segments of genomic sequences that accumulate via individual events and bursts of retrotransposition. The individual copies then undergo various types of evolutionary erosion as well as fixation, resulting in a complex mix of fragments present in different parts of the genomes. A limited number of tools exist that can identify fragments of repetitive sequences that likely originate from a longer, originally unfragmented element, using mostly sequence similarity to guide reconstruction of fragmented sequences. Here, we test a slightly different approach based on structural (as opposed to sequence similarity) detection of unfragmented full-length elements, which are then recursively eliminated from the analyzed sequence to repeatedly uncover unfragmented copies hidden underneath more recent insertions. This approach has the potential to detect relatively old and highly fragmented copies. We created a software tool for this kind of analysis called TE-nester and applied it to a number of assembled plant genomes to discover pairs of nested LTR retrotransposons of various age and fragmentation state. We test hypotheses about genome evolution and TE life cycle and insertion history against this unique and novel dataset. The software, still under development, is available for download from a repository at \url{https://gitlab.fi.muni.cz/lexa/nested}.
Eukaryotic genomes are generally rich in repetitive sequences. LTR retrotransposons are the most abundant class of repetitive sequences in plant genomes. They form segments of genomic sequences that accumulate via individual events and bursts of retrotransposition. A limited number of tools exist that can identify fragments of repetitive sequences that likely originate from a longer, originally unfragmented element, using mostly sequence similarity to guide reconstruction of fragmented sequences. Here, we use a slightly different approach based on structural (as opposed to sequence similarity) detection of unfragmented full-length elements, which are then recursively eliminated from the analyzed sequence to repeatedly uncover unfragmented copies hidden underneath more recent insertions. This approach has the potential to detect relatively old and highly fragmented copies. We created a software tool for this kind of analysis called TE-nester and applied it to a number of assembled plant genomes to discover pairs of nested LTR retrotransposons of various age and fragmentation state. TE-nester will allow us to test hypotheses about genome evolution, TE life cycle and insertion history. The software, still under improvement, is available for download from a repository at \url{https://gitlab.fi.muni.cz/lexa/nested}.
\end{abstract}
\end{abstract}
\begin{IEEEkeywords}
\begin{IEEEkeywords}
@@ -94,7 +94,7 @@ After several rounds of design decisions we arrived at a procedure that works in
Evaluation of full-length TE candidates is done by constructing a weighted directed graph, where nodes represent required sites in a full-length element (such as domains, pbs, ppt, tsd) (Fig.~\ref{fig1}). The program is trying to find a path from left LTR to the right LTR, whilst visiting every required node in the correct order (domains are ordered differently in Gypsy and Copia families, some, like {\it env} are family-specific or optional). By assigning weight to the edges, we prioritize a path that has as complete a structure as possible. At the same time, we allow alternative paths with respective penalties in case of either a missing node, or an incorrect order of available nodes.
Evaluation of full-length TE candidates is done by constructing a weighted directed graph, where nodes represent required sites in a full-length element (such as domains, pbs, ppt, tsd) (Fig.~\ref{fig1}). The program is trying to find a path from left LTR to the right LTR, whilst visiting every required node in the correct order (domains are ordered differently in Gypsy and Copia families, some, like {\it env} are family-specific or optional). By assigning weight to the edges, we prioritize a path that has as complete a structure as possible. At the same time, we allow alternative paths with respective penalties in case of either a missing node, or an incorrect order of available nodes.
\caption{The weighted directed graph used to evaluate individual TE candidates.}
\caption{The weighted directed graph used to evaluate individual TE candidates.}
\label{fig1}
\label{fig1}
\end{figure}
\end{figure}
@@ -102,13 +102,13 @@ Evaluation of full-length TE candidates is done by constructing a weighted direc
We also need a way to recover various subsequences of the analyzed sequence, such as the original unfragmented sequences of older TEs fragmented by nesting. The identified features also must be properly annotated to the analyzed sequence. This is achieved by a procedure where the removed sequences are virtually returned to their positions in the genome and the coordinates of TEs and their features are adjusted for the inserted element. Once all TEs that were removed in the first phase are processed, we generate a GFF3 file with coordinates that map to the analyzed sequence (Fig.~\ref{fig2}). The final GFF output file can be used to visualize all the identified features with specialized software, such as Genome Tools Annotation Sketch (Fig.~\ref{fig3})\cite{gremme_etal_2013}, a genome browser, or to extract sequences for certain features using bedtools, for example.
We also need a way to recover various subsequences of the analyzed sequence, such as the original unfragmented sequences of older TEs fragmented by nesting. The identified features also must be properly annotated to the analyzed sequence. This is achieved by a procedure where the removed sequences are virtually returned to their positions in the genome and the coordinates of TEs and their features are adjusted for the inserted element. Once all TEs that were removed in the first phase are processed, we generate a GFF3 file with coordinates that map to the analyzed sequence (Fig.~\ref{fig2}). The final GFF output file can be used to visualize all the identified features with specialized software, such as Genome Tools Annotation Sketch (Fig.~\ref{fig3})\cite{gremme_etal_2013}, a genome browser, or to extract sequences for certain features using bedtools, for example.