CCB » Software » StringTie

Overview

StringTie is a fast and highly efficient assembler of RNA-Seq alignments into potential transcripts. It uses a novel network flow algorithm as well as an optional de novo assembly step to assemble and quantitate full-length transcripts representing multiple splice variants for each gene locus. Its input can include not only the alignments of raw reads used by other transcript assemblers, but also alignments longer sequences that have been assembled from those reads.In order to identify differentially expressed genes between experiments, StringTie's output can be processed by specialized software like Ballgown, Cuffdiff or other programs (DESeq2, edgeR, etc.).

News

  • 8/11/2016 - The HISAT, StringTie and Ballgown protocol paper is published in Nature Protocols 11, 1650-1667 (2016) | doi:10.1038/nprot.2016.095
    • A new version of StringTie - v1.3.0 - with a new improved expression level estimation algorithm will be released in the next couple of weeks.
  • 7/25/2016 - v1.2.4 release
    This release provides a fix for a stability issue reported for v1.2.3, addresses a couple of GTF/GFF parsing problems, and implements some minor corrections in the output files.

  • 7/21/2016 - using DESeq2 and edgeR with StringTie's output
    Thanks to a productive summer internship of high school student David Miller there is now a script available for converting the output of StringTie into read count tables (at gene and transcript levels) which can be then imported into packages like DESeq2 or edgeR in order to estimate differential expression for genes and transcripts. Please check the new manual section for this DESeq2 and edgeR integration.
  • 5/12/2016 - v1.2.3 release
    • StringTie now accepts SAM records (or BAM, with the new --bam option) streamed directly at stdin (when the input file is given as -), which allows easier integration with on-the-fly alignment filtering and processing tools generating SAM/BAM output; however such input must still be sorted by coordinate. This feature allows uses such as:
      • running StringTie only on a specific subset of alignments (a region) from an indexed BAM file by piping samtools view output into StringTie
      • running StringTie on alignments stored in highly compressed formats like CRAM by converting them on the fly to SAM/BAM records
    • StringTie now makes sure that the thread stack size is set to at least 8MB, thus preventing a stack overflow crash on systems configured with lower default stack size (e.g. Mac OS X systems).
    • fixed a bit vector allocation issue causing a crash on some data sets.
  • 2/18/2016 - v1.2.2 release
    Note: this version had a minor update and the download packages were rebuilt on 2/19/2016 after fixing a typo which slightly changed the output for the --merge mode when the -p option was used.
    • for -e ("estimate only") option, all the reference transcripts are now reported in the output GTF (with 0.0 coverage/FPKM/TPM if they were not found to be covered by any reads).
    • similarly, for the -A option, all the input reference genes are now reported in the gene abundance file (with 0.0 Coverage/FPKM/TPM for those not expressed in the sample).
    • minor fix in the estimation algorithm for partially covered transcripts.
    • slight adjustment of the columns reported in the -A gene abundance file (see header).
    • optimized memory usage and improved results for --merge mode in some cases with many input GTF files provided (hundreds).
    • --merge mode has more aggressive filters set for transcripts that are unlikely to be real such as low covered single exon intronic transcripts.
    • a new option -i is now available in --merge mode to allow keeping transcripts with retained introns (by default these transcripts are only reported when there is strong evidence - high expression coverage - to support them).
  • 1/14/2016 - v1.2.1 release
    • fixed a gene and transcript numbering issue that affected the output in some cases.
  • 1/3/2016 - v1.2.0 release
    • new feature: a "transcripts merge" usage mode of StringTie which is triggered by the new --merge option; with this option StringTie expects as input a list of GTF files and merges/assembles all these transcripts into a non-redundant set of transcripts. This performs a function similar to the CuffMerge script in the Cufflinks/Tuxedo suite -- please see the updated protocol in the manual.
    • an improved "estimation only" usage mode (-e option); in this mode only, StringTie now assigns reads to partially covered reference transcripts as well, therefore relaxing the previous requirement that reference transcripts need to be covered end-to-end in order to be found as expressed.
    • improvements in dealing with abnormally large bundles by filtering of likely spurious spliced alignments produced by some aligners (e.g. STAR); note that HISAT2, used with the --dta option, is now the recommended aligner to use for StringTie.
  • 11/9/2015 - v1.1.2 release
    • fixed a bug for the case when a reference transcript had a one bp overlap with a bundle of reads.
  • 10/25/2015 - v1.1.1 release
    • fixed a junction management issue which could have caused StringTie to crash in rare cases of very low coverage.
  • 10/20/2015 - minor update of v1.1.0
    • new option --version simply returns the version string at stdout
    • options --version,--help and -h send their output to stdout and exit with a 0 code.
  • 10/19/2015 - v1.1.0 release
    This StringTie release includes the following updates:
    • major memory usage improvements due to: changes of internal data structures, collapsing reads aligned in the same place and filtering of spurious spliced alignments within large bundles -- most RNA-seq data samples use much less than 1Gb of memory now.
    • TPM is now also reported for transcripts and genes (besides FPKM)
    • new -A option provides gene abundance estimates in a separate file
    • -s option which sets the coverage saturation is deprecated starting at this version
    • modifying the tag HI in the BAM alignment file to start at 0 (as previously required with STAR produced alignment files) is no longer needed starting at this version
  • 5/18/2015 - v1.0.4 release
    This StringTie release includes the following updates:
    • fixed a strand assignment issue that was causing loss of strand information and the duplication of some transcript assemblies
    • improved coverage estimation for single exon transcripts partially overlapped by read alignments from a neighboring transcript
    • improved coverage estimation for overlapping transcripts on opposite strands
    • improved strand assignment to read alignments overlapping reference transcripts
    • added the -x option that can be used to instruct StringTie to not perform any transcript assembly on one or more reference sequences that are of no interest for the RNA-Seq analysis (e.g. one could use -x chrM if there is no interest in mitochondrial gene expression)
    • addressed a GFF parsing issue encountered for some annotation files (e.g. TAIR)
  • 4/3/2015 - v1.0.3 release
    This StringTie release includes the following updates:
    • fixed an output problem for the -B/-b option (Ballgown data) which caused e2t.ctab and i2t.ctab files to not provide the complete structure for some transcripts
    • fixed a memory de/allocation issue which could have caused StringTie to crash on some systems
    • now StringTie warns if the given annotation file does not provide any reference transcripts annotated on the genomic sequences on which the reads were mapped (naming convention mismatch)
    • added ref_gene_id and ref_gene_name attributes to the assembled transcripts if the corresponding gene annotation is available in the reference annotation file, for assembled transcripts which fully match a reference transcript
  • 3/11/2015 - v1.0.2 release
    This StringTie release includes the following changes:
    • fixed a linking issue on Ubuntu systems
    • fixed compilations errors with llvm (which also broke compilation on recent OS X versions)
    • removed the repeated warnings about transcript_id missing from Ensembl GTF non-transcript lines
    • now StringTie checks if the given read alignments are sorted by coordinate
  • 2/21/2015 - v1.0.1 release
    This StringTie release mainly provides corrections and improvements for the Ballgown *.ctab files. The following changes were implemented:
    • if the -o option includes a directory path, StringTie will attempt to create all the directories which do not exist in the specified path.
    • the functionality of the -B option has changed now to only be a command line switch (i.e. no argument is expected), which will simply enable the creation of the Ballgown table files in the same directory as the one provided with the -o option.
    • the -b option was added as a variant of -B which simply allows the creation of the Ballgown table files in a different directory.
    • the new -e option can direct StringTie to skip the processing of loci and transcripts which do not overlap any of the reference transcripts provided with -G. This option is recommended when StringTie is used simply for estimating the abundance of the provided reference transcripts and for generating Ballgown table files (-B/-b option).
    • the -S option is being phased out and it is no longer maintained.
  • 2/18/2015 - StringTie paper published
    The StringTie paper is published online in Nature Biotechnology.
    • 1/17/2015 - v1.0.0 release
      This release improves the "guided" assembly procedure, which is called when reference annotation is provided with the -G option. StringTie takes a conservative approach to using gene and transcript annotations: it only predicts the presence of transcripts whose introns are each supported by at least one spliced read alignment. (Some competing methods, by contrast, output all transcripts in the annotation regardless of the supporting read alignments.)
      • 11/25/2014 - v0.99 release
        This release adds the -B option which provides support for differential expression analysis using Ballgown, a system developed by Alyssa Frazee and Jeff Leek. This option instructs StringTie to generate the *.ctab files with coverage data for the provided reference/merged transcripts, which can be then loaded and analyzed with Ballgown. Differential expression can also be done as before, using the Cuffdiff2 system from Cole Trapnell and colleagues.
        • 10/22/2014 - 0.98 release
          This release attempts to address some cases where possibly spurious read alignments might cause excessive memory usage.
          • a new filter is introduced in order to discard very long introns that are not supported by coverage
          • paired reads that are not reachable through the splice graph are now treated as unpaired
          • there is now an upper limit on the number of transfrags to be processed at the same time
        • 10/1/2014 - 0.97 update
          Changed the default value for the -c parameter controling the minimum reads per base coverage to 2.5.
        • 5/22/2014 - 0.97 release
          This release introduces a parameter (-S) for a more sensitive run of StringTie. While StringTie was optimized on many simulated and real data sets to achieve the highest possible sensitivity, while still maintaining a high precision, one might be interested in exploring more transcripts, expressed at lower levels in the data. One way to achieve this is by using the -S parameter, and/or the parameter -f which adjusts the minimum expression of the lower expressed transcripts as a fraction from the most abundant transcript in the same loci.
          A higher precision can be obtained by using the parameter -c, which controls the minimum read per base coverage for the assembled transcripts.
        • 4/10/2014 - 0.96 release
          First public release of StringTie.


        Back to top

        Obtaining and installing StringTie

        The current version of StringTie can be downloaded as precompiled binary or as a source package:


        In order to build and install StringTie from the source package the following steps can be taken:

        1. Unpack the downloaded StringTie source archive in a directory of your choice, e.g.:
             cd ~/src/
             tar xvfz ~/Downloads/stringtie-VER.tar.gz
                      
          A directory called stringtie-VER (where VER is the current numeric version of the program) will be created in the current directory.

        2. Change directory and build the stringtie executable:
             cd stringtie-VER
             make release
          
        3. Alternatively, the source tree can be downloaded from GitHub and built in a similar fashion:
               git clone https://github.com/gpertea/stringtie
               cd stringtie
               make release
               
        4. Optionally, the stringtie executable can be copied to one of the shell's PATH directories for easy access.

        For evaluating and further processing the GTF output of StringTie, the utility gffcompare can be downloaded from the GFF utilities page.


        Back to top

        Licensing and contact Information

        StringTie is free, open source software released under an Artistic License .
        You can contact us about StringTie at: mpertea jhu edu

        For technical issues, bug reports and code contributions please use StringTie's GitHub repository.


        Back to top

        Publications

        Pertea M, Pertea GM, Antonescu CM, Chang TC, Mendell JT & Salzberg SL. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads Nature Biotechnology 2015, doi:10.1038/nbt.3122

        Pertea M, Kim D, Pertea GM, Leek JT, Salzberg SL Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown, Nature Protocols 11, 1650-1667 (2016), doi:10.1038/nprot.2016.095


        Back to top