final version prepared (eb2aef57) · Commits · Matej Lexa / nested

doc/bibm_lexa_etal_2018.tex

+20 −31

Original line number	Diff line number	Diff line
		@@ -7,13 +7,14 @@
		\usepackage{graphicx}
		\usepackage{textcomp}
		\usepackage{xcolor}
		\usepackage{balance}
		\usepackage{url}
		\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
		T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
		\begin{document}

		\title{TE-nester: a recursive software tool for structure-based discovery of transposable elements fragmented by insertion of repetitive sequences (application to retrotransposons in plant genomes)\\
		\thanks{This research is funded by the Czech Grant Agency grant No. S-XXXXX to EK and ML).}
		\thanks{This research is funded by the Czech Grant Agency grant No. GA18-00258S to EK and ML).}
		}

		\author{\IEEEauthorblockN{1\textsuperscript{st} Matej Lexa}
		@@ -56,7 +57,7 @@ kejnovsk@ibp.cz}
		\maketitle

		\begin{abstract}
		Eukaryotic genomes are generally rich in repetitive sequences. LTR retrotransposons are the most abundant class of repetitive sequences in plant genomes. They form segments of genomic sequences that accumulate via individual events and bursts of retrotransposition. The individual copies then undergo various types of evolutionary erosion as well as fixation, resulting in a complex mix of fragments present in different parts of the genomes. A limited number of tools exist that can identify fragments of repetitive sequences that likely originate from a longer, originally unfragmented element, using mostly sequence similarity to guide reconstruction of fragmented sequences. Here, we test a slightly different approach based on structural (as opposed to sequence similarity) detection of unfragmented full-length elements, which are then recursively eliminated from the analyzed sequence to repeatedly uncover unfragmented copies hidden underneath more recent insertions. This approach has the potential to detect relatively old and highly fragmented copies. We created a software tool for this kind of analysis called TE-nester and applied it to a number of assembled plant genomes to discover pairs of nested LTR retrotransposons of various age and fragmentation state. We test hypotheses about genome evolution and TE life cycle and insertion history against this unique and novel dataset. The software is available for download from a repository at \url{https://gitlab.fi.muni.cz/lexa/nested}.
		Eukaryotic genomes are generally rich in repetitive sequences. LTR retrotransposons are the most abundant class of repetitive sequences in plant genomes. They form segments of genomic sequences that accumulate via individual events and bursts of retrotransposition. The individual copies then undergo various types of evolutionary erosion as well as fixation, resulting in a complex mix of fragments present in different parts of the genomes. A limited number of tools exist that can identify fragments of repetitive sequences that likely originate from a longer, originally unfragmented element, using mostly sequence similarity to guide reconstruction of fragmented sequences. Here, we test a slightly different approach based on structural (as opposed to sequence similarity) detection of unfragmented full-length elements, which are then recursively eliminated from the analyzed sequence to repeatedly uncover unfragmented copies hidden underneath more recent insertions. This approach has the potential to detect relatively old and highly fragmented copies. We created a software tool for this kind of analysis called TE-nester and applied it to a number of assembled plant genomes to discover pairs of nested LTR retrotransposons of various age and fragmentation state. We test hypotheses about genome evolution and TE life cycle and insertion history against this unique and novel dataset. The software, still under development, is available for download from a repository at \url{https://gitlab.fi.muni.cz/lexa/nested}.
		\end{abstract}

		\begin{IEEEkeywords}
		@@ -66,13 +67,13 @@ bioinformatics, software, LTR-retrotransposons, sequence analysis, genome evolut
		\section{Introduction}
		Genomes of most eukaryotic organisms contain repetitive sequences present either as tandem repeat arrays or dispersed repeats created by different classes of transposable elements (or transposons)\cite{smit_1999}\cite{kapitonov_jurka_1999}. The dispersed repeats are produced throughout evolution in transposition bursts of various intensities where some transposition events result in insertions fragmenting another transposon already present at the particular insertion locus, and therefore {\it nesting} of what would otherwise be separate full-length repeats. Previous estimates of this kind of nesting in plants range from no nesting detected in Physcomitrella patens to 14.6\% in Oryza sativa\cite{gao_etal_2012}.

		Many tools and approaches exist for discovery of repeated sequences and their families\cite{bergman_quesneville_2007}\cite{saha_etal_2008}. To discover nesting and make more sense of what would otherwise be a complex sequence made of a mixture of repeat fragments, people have come up with strategies to identify transposon fragments that may have originally formed a full-length element. Perhaps the most popular tool with such capability is Repeat Masker in its newer incarnations. It identifies fragments based on sequence similarity to a library of known repeats and stitches together closeby fragments that seem to be continuation of each other when mapped to a model element. Another more specialized software tool is TEnest, software for disentangling nested insertions of LTR retrotrasnposons, which however still relies on sequence similarity and classification of identified repeats into families\cite{kronmiller_wise_2008}\cite{kronmiller_wise_2013}. If closeby fragments belong to the same family, the software may assign them to the same originally full-length element and thus establish a nesting order. An alignment-based software tool called Greedier\cite{li_etal_2008} also has the ability to discover nested insertions of transposons. Interestingly, this tool is similar to our approach in to aspects. First, we also greedily identify retrotransposons in successive iterations, however we do not use sequence similarity to identify regions of repetitive sequence insertons. Second, we also create a graph data structure to find best TE candidates, however the two structures carry different types of data and are used for slightly different purposes.
		Many tools and approaches exist for discovery of repeated sequences and their families\cite{bergman_quesneville_2007}\cite{saha_etal_2008}. To discover nesting and make more sense of what would otherwise be a complex sequence made of a mixture of repeat fragments, people have come up with strategies to identify transposon fragments that may have originally formed a full-length element. Perhaps the most popular tool with such capability is Repeat Masker in its newer incarnations\cite{smit_etal_1996}\cite{smit_etal_2015}. It identifies fragments based on sequence similarity to a library of known repeats and stitches together closeby fragments that seem to be continuation of each other when mapped to a model element. Another more specialized software tool is TEnest, software for disentangling nested insertions of LTR retrotrasnposons, which however still relies on sequence similarity and classification of identified repeats into families\cite{kronmiller_wise_2008}\cite{kronmiller_wise_2013}. If closeby fragments belong to the same family, the software may assign them to the same originally full-length element and thus establish a nesting order. An alignment-based software tool called Greedier\cite{li_etal_2008} also has the ability to discover nested insertions of transposons. Interestingly, this tool is similar to our approach in two aspects. First, we also greedily identify retrotransposons in successive iterations, however we do not use sequence similarity to identify regions of repetitive sequence insertions. Second, we also create a graph data structure to find best TE candidates, however the two structures carry different types of data and are used for slightly different purposes.

		Because all of the available tools rely heavily on evaluation of sequence similarity at some key step, we set out to test an alternative approach using structure-based recognition of repetitive sequences, relying on identification of component features of a typical transposon and their relative position. Such tools are specifically available for certain classes of repetitive sequences, such as LTR retrotransposons\cite{mccarthy_mcdonald_2003}\cite{xu_wang_2007}\cite{ellinghaus_etal_2008}, however, none of those are capable of detecting element nesting. We therefore employed a recursive approach in combination with LTR Finder \cite{xu_wang_2007} and implemented it in a Python tool called TE-nester. Here, we report the results of applying this computational machinery to about a dozen of plant genomes and discuss the results in the light of current knowledge on TE life cycle and genome evolution, especially in areas where it is important to know relative insertion times.

		\section{Algorithm}

		Our main goal was to design an application capable of processing sequences automatically and finding nested transposable elements in reasonable time. We needed to take into consideration specific problems related to correct detection of element nesting. First, while reasonable sensitive, the procedure should be resistant to detecting false positives. To this end we incorporate a greedy algorithm that evaluates several possible candidates for full-length TEs but ultimately picks only the best ones, based on presence of the typical full-length TE sequence features. As a result, false positives are quite rare in the beginning and may become more frequent at later stages of computation which, however, can be stopped at that point. Another requirement is the ability to detect deep nesting. In such cases, the oldest elements are barely recognizable because of ageing and the procedure must allow for certain imperfections without compromising the ability to detect the partly-eroded element.
		Our main goal was to design an application capable of processing sequences automatically and finding nested transposable elements in reasonable time. We needed to take into consideration specific problems related to correct detection of element nesting. First, while sensitive enough, the procedure should be resistant to detecting false positives. To this end we incorporate a greedy algorithm that evaluates several possible candidates for full-length TEs but ultimately picks only the best ones, based on presence of the typical full-length TE sequence features. As a result, false positives are quite rare in the beginning and may become more frequent at later stages of computation which, however, can be stopped at that point. Another requirement is the ability to detect deep nesting. In such cases, the oldest elements are barely recognizable because of ageing and the procedure must allow for certain imperfections without compromising the ability to detect the partly-eroded element.

		After several rounds of design decisions we arrived at a procedure that works in the following way:

		@@ -90,7 +91,7 @@ After several rounds of design decisions we arrived at a procedure that works in

		\end{enumerate}

		Evaluation of full-length TE candidates is done by constructing a weighted directed graph, where nodes represent required sites in a full-length element (such as domains, pbs, ppt, tsd) (Fig.~\ref{fig1}). The program is trying to find a path from left LTR to the right LTR, whilst visiting every required node in the correct order (domains are ordered differently in Gypsy and Copia families). By assigning weight to the edges, we prioriize a path that has as complete a structure as possible. At the same time, we allow alternative paths with respective penalties in case of either a missing node, or an incorrect order of available nodes.
		Evaluation of full-length TE candidates is done by constructing a weighted directed graph, where nodes represent required sites in a full-length element (such as domains, pbs, ppt, tsd) (Fig.~\ref{fig1}). The program is trying to find a path from left LTR to the right LTR, whilst visiting every required node in the correct order (domains are ordered differently in Gypsy and Copia families, some, like {\it env} are family-specific or optional). By assigning weight to the edges, we prioritize a path that has as complete a structure as possible. At the same time, we allow alternative paths with respective penalties in case of either a missing node, or an incorrect order of available nodes.

		\begin{figure}[htbp]
		\centerline{\includegraphics[width=\columnwidth]{img/fig1.png}}
		@@ -98,7 +99,7 @@ Evaluation of full-length TE candidates is done by constructing a weighted direc
		\label{fig1}
		\end{figure}

		We also need a way to recover various subsequences of the analyzed sequence, such as the original unfragmented sequences of older TEs fragmented by nesting. The identified features also must be properly annotated to the analyzed sequence. This is achieved by a procedure were the removed sequences are virtually returned to their positions in the genome and the coordinates of TEs and their features are adjusted for the inserted element. Once all TEs that were removed in the first phase are processed, we generate a GFF3 file with coordinates that map to the analyzed sequence (Fig.~\ref{fig2}). The final GFF output file can be used to visualize all the identified features with specialized software, such as Genome Tools Annotation Sketch (Fig.~\ref{fig3}), a genome browser, or to extract sequences for certain features using bedtools, for example.
		We also need a way to recover various subsequences of the analyzed sequence, such as the original unfragmented sequences of older TEs fragmented by nesting. The identified features also must be properly annotated to the analyzed sequence. This is achieved by a procedure where the removed sequences are virtually returned to their positions in the genome and the coordinates of TEs and their features are adjusted for the inserted element. Once all TEs that were removed in the first phase are processed, we generate a GFF3 file with coordinates that map to the analyzed sequence (Fig.~\ref{fig2}). The final GFF output file can be used to visualize all the identified features with specialized software, such as Genome Tools Annotation Sketch (Fig.~\ref{fig3})\cite{gremme_etal_2013}, a genome browser, or to extract sequences for certain features using bedtools, for example.

		\begin{figure}[htbp]
		\centerline{\includegraphics[width=\columnwidth]{img/fig2.png}}
		@@ -112,39 +113,25 @@ We also need a way to recover various subsequences of the analyzed sequence, suc
		\label{fig3}
		\end{figure}

		\section{Implementation and testing}

		xxx

		\subsection{Implementation}

		xxx

		\subsection{Testing}

		xxx

		\section{Plant genome analysis}

		To gain new insights into plant genome organization and also to identify potential weak spots of TE-nester, we set out to analyze the assembled sequences of several plant genomes. Twelve genomes of 4 monocots and 8 dicots were analysed with TE-nester as described in Methods.

		TO DO NESTER RESULTS FROM PAVEL
		To gain new insights into plant genome organization and also to identify potential weak spots of TE-nester, we set out to analyze the assembled sequences of several plant genomes. Twelve genomes of 4 monocots and 8 dicots were analysed with TE-nester as described in Methods. Approximately 2000 high-quality nested TE pairs were identified, some of them in a nesting hierarchy of several elements.

		\section{Methods}

		The TE-nester software described herein is available as a frozen repository copy from \url{http://www.fi.muni.cz/~lexa/te-nester.zip}. The latest version can always be obtained from our GitLab project homepage at \url{https://gitlab.fi.muni.cz/lexa/nested}.
		The TE-nester software described herein is available from our GitLab project homepage at \url{https://gitlab.fi.muni.cz/lexa/nested}.

		To carry out tests of the software, especially its ability to recover nested sequences, we used nested-generator, part of the code that is designed to carry out virtual insertions of TE sequences from a library into a background sequence. The precise position of each inserted sequence in the resulting test sequence is recorded, allowing us to compare the generated GFF3 file with results of analysis of the same sequence by te-nester. To make the test as similar to real world genomic data as possible, we used a library of more thn 60000 LTR retrotransposons identified in plant genomes as available in the MIPS ReDat collection \url{http://mips.redat.de/something/here.html}. The command to generate the testing data is {\it command-here}.
		To carry out tests of the software, especially its ability to recover nested sequences, we use nested-generator, part of the code that is designed to carry out virtual insertions of TE sequences from a library into a background sequence. The precise position of each inserted sequence in the resulting test sequence is recorded, allowing us to compare the generated GFF3 file with results of analysis of the same sequence by TE-nester.

		Plant genomes were downloaded from Phytozome v12.1 at \url{https://phytozome.jgi.doe.gov/pz/portal.html} for the following species: {\it A.lyrata}, {\it A.thaliana}, {\it G.raimondii}, {\it G.max}, {\it M.truncatula}, {\it S. lycopersicum}, {\it S.tuberosum}, {\it S. bicolor}, {\it B.distachyon}, {\it O.sativa}, {\it P.patens}. Each genome was analysed by TE-nester by executing the following command on a Linux Ubuntu v.XX machine with a XYZ processor and XGB of memory: {\it nested-nester genomeassembly.fa}.
		Plant genomes were downloaded from Phytozome v12.1 at \url{https://phytozome.jgi.doe.gov/pz/portal.html} for the following species: {\it A.lyrata}, {\it A.thaliana}, {\it G.raimondii}, {\it G.max}, {\it M.truncatula}, {\it S. lycopersicum}, {\it S.tuberosum}, {\it S. bicolor}, {\it B.distachyon}, {\it O.sativa}, {\it P.patens}. Each genome was analysed by TE-nester by executing the following command on a Linux Ubuntu machine with a recent processor and 8GB of memory: {\it nested-nester genomeassembly.fa}.

		TODO PAVEL - doplnit zpracovani GFF vystupu filtraci

		The resulting GFF file was then used to calculate all the statistics presented herein.
		The resulting GFF file was then used to calculate all the statistics presented herein and the poster.

		\section*{Acknowledgment}

		We thank
		We thank Michal Jenco for his TE annotation efforts that led to the idea to design TE-nester.

		\balance

		%\section*{References}

		@@ -154,14 +141,16 @@ We thank
		\bibitem{gao_etal_2012} C. Gao, M. Xiao, X. Ren, A. Hayward, J. Yin, L. Wu, D. Fu and J. Li. Characterization and functional annotation of nested transposable elements in eukaryotic genomes. Genomics, vol. 100, 2012, pp. 222--230, 2012.
		\bibitem{bergman_quesneville_2007} C. M. Bergman, H. Quesneville. Discovering and detecting transposable elements in genome sequences. Briefings in Bioinformatics, vol. 8, pp. 382--92, 2007.
		\bibitem{saha_etal_2008} S. Saha, S. Bridges, Z. V. Magbanua, D. G. Peterson. Computational Approaches and Tools Used in Identification of Dispersed Repetitive DNA Sequences. Tropical Plant Biology, Feb 2008.
		\bibitem{smit_etal_1996} A. F. Smit, R. Hubley and P. Green RepeatMasker. Published on the web at http://www.repeatmasker.org, 1996.
		\bibitem{bedell_etal_2000} J. A. Bedell, I. Korf and W. Gish MaskerAid: a performance enhancement to RepeatMasker. Bioinformatics, vol. 16, pp. 1040--1041, 2000.
		\bibitem{smit_etal_1996} A. F. Smit, R. Hubley and P. Green. RepeatMasker. Published on the web at http://www.repeatmasker.org, 1996.
		\bibitem{smit_etal_2015} A. F. Smit, R. Hubley and P. Green. RepeatMasker Open-4.0. Published on the web at http://www.repeatmasker.org, 2015.
		\bibitem{kronmiller_wise_2008} B. A. Kronmiller and R. P. Wise. TEnest: automated chronological annotation and visualization of nested plant transposable elements. Plant Physiol, vol. 146, pp. 45--59, 2008.
		\bibitem{kronmiller_wise_2013} B. A. Kronmiller and R. P. Wise. TEnest 2.0: computational annotation and visualization of nested transposable elements. Methods Mol Biol, vol. 1057, pp.305--19, 2013.
		\bibitem{li_etal_2008} X. Li, T. Kahveci and A. M. Settles A novel genome-scale repeat finder geared towards transposons. Bioinformatics, vol. 24, pp. 468--476, 2008.
		\bibitem{mccarthy_mcdonald_2003} E. M. McCarthy and J. F. McDonald. LTR\_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics, vol. 19, pp. 362--367, 2003.
		\bibitem{xu_wang_2007} Z. Xu and H. Wang. LTR\_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic acids research, vol. 35(suppl 2), pp. W265--W268, 2007.
		\bibitem{ellinghaus_etal_2008} D. Ellinghaus, S. Kurtz and U. Willhoeft. LTRharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics, vol. 9, p. 18, 2008.
		\bibitem{gremme_etal_2013} G. Gremme, S. Steinbiss and S. Kurtz.
		GenomeTools: a comprehensive software library for efficient processing of structured genome annotations. IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 10, p. 645--656, 2013.

		\end{thebibliography}