Why run StringTie?
Back to top
How long does it take to run StringTie?
StringTie is not only accurate but also very fast compared to most other transcriptome assemblers. Here we show some typical running times for StringTie and Cufflinks on four large real data sets including three human RNA-seq data sets downloaded from the ENCODE project (GEO accessions GSM981256, GSM981244, and GSM984609) and one RNA-seq data generated from nuclear RNA from a human kidney cell line (NCBI Study accession number SRP041943). Both programs were run on the same multi-core 2.1 GHz AMD Opteron servers using 8 threads. Time is shown in as hours:minutes.Back to top
How does Cufflinks compare to StringTie?
Our focus in developing StringTie was on building a system that can assemble and quantitate transcripts regardless of whether gene annotation is available. For this task, the Cufflinks system has been the leading method since it first appeared in 2010. In our experiments, Cufflinks consistently outperformed all other transcriptome assemblers (except StringTie) on a variety of human RNA-seq data sets, in many cases by a large margin. Nonetheless, StringTie consistently outperforms Cufflinks by a substantial amount, as shown below on four real data sets: GSM981256, GSM981244, GSM984609, and SRP041943. Note that we only show comparisons to Cufflinks because all other methods that we tested performed considerably worse. See the forthcoming StringTie paper and its Supplement for details including comparisons to other methods.- StringTie correctly assembles 32-53% more transcripts than Cufflinks. The figure below shows Venn diagrams representing the transcripts correctly identified by either StringTie, Cufflinks, or both. Note that this figure only counts transcripts that precisely match known genes and are presumably correct.
- Transcripts with a larger number of exons are more likely to be assembled by StringTie rather than Cufflinks, as shown by these box and whisker diagrams that display the distribution of the number of exons in transcripts identified by StringTie and not Cufflinks, or by Cufflinks and not StringTie.
- StringTie assembles a larger number of correct isoforms per gene locus.
Back to top
How many reads should I sequence?
Fig1. Plot shows an increase in variation (%TPM) associated with a lower number of reads aligned to the reference. The median (dark line), interquartile range (dark-red area) and whiskers (light-red area) are shown in the figure. High and low whiskers were computed as: and , where IQR is interquartile range, Q3 and Q1 are 3rd and 1st quartiles respectively.
Additional statistical evaluations (i.e. Ranking order correlations, precision and per-gene analysis) have been computed and plotted to facilitate better understanding of the dataset and changes associated with a lower sequencing depth. Information may be located in the following GitHub repository, which contains a python application developed to automate selection of alignments, assembly, statistical analysis and plotting. Readme file contains detailed instructions about the pipeline and output. All the plots generated in the experiment are also available within the figures directory on the repository.
Back to top
Back to top