CCB » CBCC » StringTie

Running StringTie

Run stringtie from the command line like this:
stringtie <aligned_reads.bam> [options]*

The main input of the program is a BAM file with RNA-Seq read mappings which must be sorted by their genomic location (for example the accepted_hits.bam file produced by TopHat or the output of HISAT2 after sorting and converting it using samtools as explained below).

The following optional parameters can be specified when running stringtie:
-h/--help Prints help message and exits.
-v Turns on verbose mode, printing bundle processing details.
-o [<path/>]<out.gtf> Sets the name of the output GTF file where StringTie will write the assembled transcripts. This can be specified as a full path, in which case directories will be created as needed. By default StringTie writes the GTF at standard output.
-p <int> Specify the number of processing threads (CPUs) to use for transcript assembly. The default is 1.
-G <ref_ann.gff> Use the reference annotation file (in GTF or GFF3 format) to guide the assembly process. The output will include expressed reference transcripts as well as any novel transcripts that are assembled. This option is required by options -B, -b, -e, -C (see below).
--rf Assumes a stranded library fr-firststrand.
--fr Assumes a stranded library fr-secondstrand.
-l <label> Sets <label> as the prefix for the name of the output transcripts. Default: STRG
-f <0.0-1.0> Sets the minimum isoform abundance of the predicted transcripts as a fraction of the most abundant transcript assembled at a given locus. Lower abundance transcripts are often artifacts of incompletely spliced precursors of processed transcripts. Default: 0.1
-m <int> Sets the minimum length allowed for the predicted transcripts. Default: 200
-A <gene_abund.tab> Gene abundances will be reported (tab delimited format) in the output file with the given name.
-C <cov_refs.gtf> StringTie outputs a file with the given name with all transcripts in the provided reference file that are fully covered by reads (requires -G).
-a <int> Junctions that don't have spliced reads that align across them with at least this amount of bases on both sides are filtered out. Default: 10
-j <float> There should be at least this many spliced reads that align across a junction (i.e. junction coverage). This number can be fractional, since some reads align in more than one place. A read that aligns in n places will contribute 1/n to the junction coverage. Default: 1
-t This parameter disables trimming at the ends of the assembled transcripts. By default StringTie adjusts the predicted transcript's start and/or stop coordinates based on sudden drops in coverage of the assembled transcript.
-c <float> Sets the minimum read coverage allowed for the predicted transcripts. A transcript with a lower coverage than this value is not shown in the output. Default: 2.5
-g <int> Minimum locus gap separation value. Reads that are mapped closer than this distance are merged together in the same processing bundle. Default: 50 (bp)
-B This switch enables the output of Ballgown input table files (*.ctab) containing coverage data for the reference transcripts given with the -G option. (See the Ballgown documentation for a description of these files.) With this option StringTie can be used as a direct replacement of the tablemaker program included with the Ballgown distribution.
If the option -o is given as a full path to the output transcript file, StringTie will write the *.ctab files in the same directory as the output GTF.
-b <path> Just like -B this option enables the output of *.ctab files for Ballgown, but these files will be created in the provided directory <path> instead of the directory specified by the -o option. Note: adding the -e option is recommended with the -B/-b options, unless novel transcripts are still wanted in the StringTie GTF output.
-e Limits the processing of read alignments to only estimate and output the assembled transcripts matching the reference transcripts given with the -G option (requires -G, recommended for -B/-b). With this option, read bundles with no reference transcripts will be entirely skipped, which may provide a considerable speed boost when the given set of reference transcripts is limited to a set of target genes, for example.
-M <0.0-1.0> Sets the maximum fraction of muliple-location-mapped reads that are allowed to be present at a given locus. Default: 0.95.
-x <seqid_list> Ignore all read alignments (and thus do not attempt to perform transcript assembly) on the specified reference sequences. Parameter <seqid_list> can be a single reference sequence name (e.g. -x chrM) or a comma-delimited list of sequence names (e.g. -x 'chrM,chrX,chrY'). This can speed up StringTie especially in the case of excluding the mitochondrial genome, whose genes may have very high coverage in some cases, even though they may be of no interest for a particular RNA-Seq analysis. The reference sequence names are case sensitive, they must match identically the names of chromosomes/contigs of the target genome against which the RNA-Seq reads were aligned in the first place.
--merge Transcript merge mode. This is a special usage mode of StringTie, distinct from the assembly usage mode described above. In the merge mode, StringTie takes as input a list of GTF/GFF files and merges/assembles these transcripts into a non-redundant set of transcripts. This mode is used in the new differential analysis pipeline to generate a global, unified set of transcripts (isoforms) across multiple RNA-Seq samples.
If the -G option (reference annotation) is provided, StringTie will assemble the transfrags from the input GTF files with the reference transcripts.

The following additional options can be used in this mode:
-G <guide_gff> reference annotation to include in the merging (GTF/GFF3)
-o <out_gtf> output file name for the merged transcripts GTF (default: stdout)
-m <min_len> minimum input transcript length to include in the merge (default: 50)
-c <min_cov> minimum input transcript coverage to include in the merge (default: 0)
-F <min_fpkm> minimum input transcript FPKM to include in the merge (default: 0)
-T <min_tpm> minimum input transcript TPM to include in the merge (default: 0)
-f <min_iso> minimum isoform fraction (default: 0.01)
-i keep merged transcripts with retained introns (default: these are not kept unless there is strong evidence for them)
-l <label> name prefix for output transcripts (default: MSTRG)

Back to top

Input files

StringTie takes as input a binary SAM (BAM) file sorted by reference position. This file contains spliced read alignments and can be produced directly by programs such as TopHat or it can be obtained by converting and sorting the output of HISAT2. We recommend using HISAT2 as it is a fast and accurate alignment program. A text file in SAM format which was produced by HISAT2 must be sorted and converted to BAM format using the samtools program:
  samtools view -Su alns.sam | samtools sort - alns.sorted

The file resulted from the above command (alns.sorted.bam) can be used as input for StringTie.

Every spliced read alignment (i.e. an alignment across at least one junction) in the input SAM file must contain the tag XS to indicate the genomic strand that produced the RNA from which the read was sequenced. Alignments produced by TopHat and HISAT2 (when ran with --dta option) already include this tag, but if you use a different read mapper you should check that this XS tag is included for spliced alignments.

Optionally, a reference annotation file in GTF/GFF3 format can be provided to StringTie. In this case, StringTie will check to see if the reference transcripts are expressed in the RNA-Seq data, and for the ones that are expressed it will compute coverage, TPM and FPKM values. Note that if option -e is not used the reference transcripts need to be fully covered by reads in order to be included in StringTie's output. In that case, other transcripts assembled from the data by StringTie and not present in the reference file will be printed as well.


Back to top

Output files

Main Outputs
  1. Stringtie's main output is a GTF file containing the assembled transcripts
  2. Gene abundances in tab-delimited format
  3. Fully covered transcripts that match the reference annotation, in GTF format
  4. Files (tables) required as input to Ballgown, which uses them to estimate differential expression
  5. In merge mode, a merged GTF file from a set of GTF files

1. StringTie's primary GTF output
The primary output of StringTie is a Gene Transfer Format (GTF) file that contains details of the transcripts that StringTie assembles from RNA-Seq data. GTF is an extension of GFF (Gene Finding Format, also called General Feature Format), and is very similar to GFF2 and GFF3. The field definitions for the 9 columns of GTF output can be found at the Ensembl site here. The following is an example of a transcript assembled by StringTie as shown in a GTF file (scroll right within the box to see the full field contents):
      seqname source      feature     start   end     score   strand  frame attributes
      chrX    StringTie   transcript  281394  303355  1000    +       .     gene_id "ERR188044.1"; transcript_id "ERR188044.1.1"; reference_id "NM_018390"; ref_gene_id "NM_018390"; ref_gene_name "PLCXD1"; cov "101.256691"; FPKM "530.078918"; TPM "705.667908";
      chrX    StringTie   exon        281394  281684  1000    +       .     gene_id "ERR188044.1"; transcript_id "ERR188044.1.1"; exon_number "1"; reference_id "NM_018390"; ref_gene_id "NM_018390"; ref_gene_name "PLCXD1"; cov "116.270836";
      ...
      

Description of each column's values:

  • seqname: Denotes the chromosome, contig, or scaffold for this transcript. Here the assembled transcript is on chromosome X.
  • source: The source of the GTF file. Since this example was produced by StringTie, this column simply shows 'StringTie'.
  • feature: Feature type; e.g., exon, transcript, mRNA, 5'UTR).
  • start: Start position of the feature (exon, transcript, etc), using a 1-based index.
  • end: End position of the feature, using a 1-based index.
  • score: A confidence score for the assembled transcript. Currently this field is not used, and StringTie reports a constant value of 1000 if the transcript has a connection to a read alignment bundle.
  • strand: If the transcript resides on the forward strand, '+'. If the transcript resides on the reverse strand, '-'.
  • frame: Frame or phase of CDS features. StringTie does not use this field and simply records a ".".
  • attributes: A semicolon-separated list of tag-value pairs, providing additional information about each feature. Depending on whether an instance is a transcript or an exon and on whether the transcript matches the reference annotation file provided by the user, the content of the attributes field will differ. The following list describes the possible attributes shown in this column:
    • gene_id: A unique identifier for a single gene and its child transcript and exons based on the alignments' file name.
    • transcript_id: A unique identifier for a single transcript and its child exons based on the alignments' file name.
    • exon_number: A unique identifier for a single exon, starting from 1, within a given transcript.
    • reference_id: The transcript_id in the reference annotation (optional) that the instance matched.
    • ref_gene_id: The gene_id in the reference annotation (optional) that the instance matched.
    • ref_gene_name: The gene_name in the reference annotation (optional) that the instance matched.
    • cov: The average per-base coverage for the transcript or exon.
    • FPKM: Fragments per kilobase of transcript per million read pairs. This is the number of pairs of reads aligning to this feature, normalized by the total number of fragments sequenced (in millions) and the length of the transcript (in kilobases).
    • TPM: Transcripts per million. This is the number of transcripts from this particular gene normalized first by gene length, and then by sequencing depth (in millions) in the sample. A detailed explanation and a comparison of TPM and FPKM can be found here, and TPM was defined by B. Li and C. Dewey here.
2. Gene abundances in tab-delimited format
If StringTie is run with the -A <gene_abund.tab> option, it returns a file containing gene abundances. The tab-delimited gene abundances output file has nine fields; here is an example of a gene abundance file produced by StringTie using reference annotation:
      Gene ID     Gene Name   Reference   Strand  Start   End     Coverage    FPKM        TPM
      NM_000451   SHOX        chrX        +       624344  646823  0.000000    0.000000    0.000000
      NM_006883   SHOX        chrX        +       624344  659411  0.000000    0.000000    0.000000
      ...
      

  • Column 1 / Gene ID: The gene identifier comes from the reference annotation provided with the -G option. If no reference is provided this field is replaced with the name prefix for output transcripts (-l).
  • Column 2 / Gene Name: This field contains the gene name in the reference annotation provided with the -G option. If no reference is provided this field is populated with '-'.
  • Column 3 / Reference: Name of the reference sequence that was used in the alignment of the reads. Equivalent to the 3rd column in the .SAM alignment.
  • Column 4 / Strand: '+' denotes that the gene is on the forward strand, '-' for the reverse strand.
  • Column 5 / Start: Start position of the gene (1-based index).
  • Column 6 / End: End position of the gene (1-based index).
  • Column 7 / Coverage: Per-base coverage of the gene.
  • Column 8 / FPKM: normalized expression level in FPKM units (see previous section).
  • Column 9 / TPM: normalized expression level in RPM units (see previous section).

3. Fully covered transcripts matching the reference annotation transcripts (in GTF format)

If StringTie is run with the -C <cov_refs.gtf> option (requires -G <reference_annotation>), it returns a file with all the transcripts in the reference annotation that are fully covered, end to end, by reads. The output format is a GTF file as described above. Each line of the GTF is corresponds to a gene or transcript in the reference annotation.

4. Ballgown Input Table Files

If StringTie is run with the -B option, it returns a Ballgown input table file, which contains coverage data for all transcripts. The output table files are placed in the same directory as the main GTF output. These tables have these specific names: (1) e2t.ctab, (2) e_data.ctab, (3) i2t.ctab, (4) i_data.ctab, and (5) t_data.ctab. A detailed description of each of these five required inputs to Ballgown can be found on the Ballgown site, at this link.

5. Merge mode: Merged GTF

If StringTie is run with the --merge option, it takes as input a list of GTF/GFF files and merges/assembles these transcripts into a non-redundant set of transcripts. This step creates a uniform set of transcripts for all samples to facilitate the downstream calculation of differentially expressed levels for all transcripts among the different experimental conditions. Output is a merged GTF file with all merged gene models, but without any numeric results on coverage, FPKM, and TPM. Then, with this merged GTF, StringTie can re-estimate abundances by running it again with the -e option on the original set of alignment files, as illustrated in the figure below.

Back to top

Evaluating transcript assemblies

A simple way of getting more information about the transcripts assembled by StringTie (summary of gene and transcript counts, novel vs. known etc.), or even performing basic tracking of assembled isoforms across multiple RNA-Seq experiments, is to use the gffcompare program. Basic usage information and download options for this program can be found on the GFF utilities page.

Back to top

Differential expression analysis

Together with HISAT and Ballgown, StringTie can be used for estimating differential expression across multiple RNA-Seq samples and generating plots and differential expression tables as described in our protocol paper. DE workflow
Our recommended workflow includes the following steps:
  1. for each RNA-Seq sample, map the reads to the genome with HISAT2 using the --dta option. It is highly recommended to use the reference annotation information when mapping the reads, which can be either embedded in the genome index (built with the --ss and --exon options, see HISAT2 manual), or provided separately at run time (using the --known-splicesite-infile option of HISAT2). The SAM output of each HISAT2 run must be sorted and converted to BAM using samtools as explained above.
  2. for each RNA-Seq sample, run StringTie to assemble the read alignments obtained in the previous step; it is recommended to run StringTie with the -G option if the reference annotation is available.
  3. run StringTie with --merge in order to generate a non-redundant set of transcripts observed in all the RNA-Seq samples assembled previously. The stringtie --merge mode takes as input a list of all the assembled transcripts files (in GTF format) previously obtained for each sample, as well as a reference annotation file (-G option) if available.
  4. for each RNA-Seq sample, run StringTie using the -B/-b and -e options in order to estimate transcript abundances and generate read coverage tables for Ballgown. The -e option is not required but recommended for this run in order to produce more accurate abundance estimations of the input transcripts. Each StringTie run in this step will take as input the sorted read alignments (BAM file) obtained in step 1 for the corresponding sample and the -G option with the merged transcripts (GTF file) generated by stringtie --merge in step 3. Please note that this is the only case where the -G option is not used with a reference annotation, but with the global, merged set of transcripts as observed across all samples. (This step is the equivalent of the Tablemaker step described in the original Ballgown pipeline.)
  5. Ballgown can now be used to load the coverage tables generated in the previous step and perform various statistical analyses for differential expression, generate plots etc.

An alternate, faster differential expression analysis workflow can be pursued if there is no interest in novel isoforms (i.e. assembled transcripts present in the samples but missing from the reference annotation), or if only a well known set of transcripts of interest are targeted by the analysis. This simplified protocol has only 3 steps (depicted below) as it bypasses the individual assembly of each RNA-Seq sample and the "transcript merge" step. DE workflow This simplified workflow attempts to directly estimate and analyze the expression of a known set of transcripts as given in the reference annotation file.



Using StringTie with DESeq2 and edgeR


DESeq2 and edgeR are two popular Bioconductor packages for analyzing differential expression, which take as input a matrix of read counts mapped to particular genomic features (e.g., genes). We provide a Python script (prepDE.py) to extract this read count information directly from the files generated by StringTie (run with the -e parameter).

There are two possible ways to provide input: one option is to provide a directory with the same structure as created in the StringTie protocol paper in preparation for Ballgown. By default (no -i option), this is located in the working directory and named ballgown (see below).

    ./ballgown/sample1/sample1.gtf
    ./ballgown/sample2/sample2.gtf
    ./ballgown/sample3/sample3.gtf
  

Alternatively, one can provide a textfile listing sample IDs and their respective paths (sample_lst.txt).

Usage: prepDE.py [options]
generates two CSV files containing the count matrices for genes and transcripts, using the coverage values found in the output of stringtie -e

Options:
-h, --helpshow this help message and exit
-i INPUT, --input=INPUT, --in=INPUTthe parent directory of the sample sub-directories or a .txt file listing sample IDs and the paths to GTF files in tab-delimited format [default: ballgown]
-g Gwhere to output the gene count matrix [default: gene_count_matrix.csv]
-t Twhere to output the transcript count matrix [default: transcript_count_matrix.csv]
-l LENGTH, --length=LENGTHthe average read length [default: 75]
-p PATTERN, --pattern=PATTERNa regular expression that selects the sample subdirectories
-c, --clusterwhether to cluster genes that overlap with different gene IDs, ignoring ones with geneID pattern (see below)
-s STRING, --string=STRINGif a different prefix is used for geneIDs assigned by StringTie [default: MSTRG]
-k KEY, --key=KEYif clustering, what prefix to use for geneIDs assigned by this script [default: prepG]
--legend=LEGENDif clustering, where to output the legend file mapping transcripts to assigned geneIDs [default: legend.csv]

These count matrices (CSV files) can then be imported into R for use by DESeq2 and edgeR (using the DESeqDataSetFromMatrix and DGEList functions, respectively).

Protocol: Using StringTie with DESeq2

Given a list of GTFs, which were re-estimated upon merging, users can follow the below protocol to use DESeq2 for differential expression analysis. To generate GTFs from raw reads follow the StringTie protocol paper (up to the Ballgown step).

As described above, prepDE.py either accepts a .txt (sample_lst.txt) file listing the sample IDs and the GTFs' paths or by default expects a ballgown directory produced by StringTie run with the -B option. The following steps leads us through generating count matrices for genes and transcripts, importing this data into DESeq2, and conducting some basic analysis.

  1. Generate count matrices using prepDE.py
    1. Assuming default ballgown directory structure produced by StringTie
      $ python prepDE.py
    2. List of GTFs sample IDs and paths (sample_lst.txt)
      $ python prepDE.py -i sample_lst.txt
  2. Install R and DESeq2. Upon installing R, install DESeq2 on R:
    > source("https://bioconductor.org/biocLite.R")
    > biocLite("DESeq2")
  3. Import DESeq2 library in R
    > library("DESeq2")
  4. Load gene(/transcript) count matrix and labels
    > countData <- as.matrix(read.csv("gene_count_matrix.csv", row.names="gene_id"))
    > colData <- read.csv(PHENO_DATA, sep="\t", row.names=1)

    Note: The PHENO_DATA file contains information on each sample, e.g., sex or population. The exact way to import this depends on the format of the file.
  5. Check all sample IDs in colData are also in CountData and match their orders
    > all(rownames(colData) %in% colnames(countData))
    [1] TRUE
    > countData <- countData[, rownames(colData)]
    > all(rownames(colData) == colnames(countData))
    [1] TRUE
  6. Create a DESeqDataSet from count matrix and labels
    > dds <- DESeqDataSetFromMatrix(countData = countData,
            colData = colData, design = ~ CHOOSE_FEATURE)
  7. Run the default analysis for DESeq2 and generate results table
    > dds <- DESeq(dds)
    > res <- results(dds)
  8. Sort by adjusted p-value and display
    > (resOrdered <- res[order(res$padj), ])

Back to top

Assembling super-reads

Although StringTie is primarily a genome-guided approach, it can borrow algorithmic techniques from de novo genome assembly to help with transcript assembly. Its input can include not only the spliced read alignments used by reference-based assemblers, but also longer contigs that it assembles de novo from unambiguous, non-branching parts of a transcript. If your RNA-Seq data is paired, you could use our method to reconstruct the RNA-Seq fragments from their end sequences, which we call super-reads. This is an optional step for preparing the alignments for the input of StringTie, but it can improve the accuracy of the transcriptome assembly. To use this step you will first need to install the MaSuRCA genome assembler. The super-read module from MaSuRCA extends every read in both directions as long as this extension is unique. The superreads.pl script that we provide here identifies pairs of reads that belong to the same super-read, and extracts the sequence containing the pair plus the sequence between them; i.e., the entire sequence of the original DNA fragment. Thus for example if the original RNA-seq data comprised paired 100-bp reads from a 300-bp fragment library, these steps will convert many of those pairs into single, 300-bp super-reads.

The usage of the superreads.pl script is documented below.
  Usage: superreads.pl <pair_read1_fastq> <pair_read2_fastq> <masurca_directory> [options]*

Arguments:
The first two arguments of the superreads.pl script are files in the fastq format containing the the sequences of the first and second read in each fragment, respecitvely. They can either plain text fastq files or compressed (with gzip or bzip2) files The third argument represents the directory where the MaSuRCA package was installed on your system.

Options:
-t <num_threads> Sets the number of threads to use. Default: 10.
-j <jf_size> MaSuRCA requires the Jellyfish program to run, and this parameter sets the Jellyfish hash size. Please see the MaSuRCA documentation for more information about how to choose this parameter. Default: 2500000000.
-s <step> As it progresses, the superreads.pl script prints the steps it successfully completed. If for any reason, the assembly process is stopped, you don't need to redo all the successfully completed steps, and you can restart the script at the first step it didn't complete. Default: 1.
-r <paired_read_prefix> Sets the prefix for the paired reads as required by MaSuRCA. Default: pe.
-f <fragment_size> Specifies the mean library insert length. Default: 300.
-d <standard_deviation> Specifies the standard deviation of the library insert length. If the standard deviation is not known, set it to approximately 15% of the mean. Default: 45.
-l <super_reads_file_name> Specifies the name for the assembled super-reads file. Default: LongReads.fq.
-u <not_assembled_reads_prefix> Specifies the prefix for the unassembled reads file names. By default it appends ".notAssembled.fq.gz" to the initial paired files.

The output files of the superreads.pl script (the assembled super-reads file, and the two files containing the unassembled paired reads) can be then aligned to a reference genome with your read mapper of preference. For instance, you can align them with TopHat2 like this:
tophat [options]* <genome_index_base> \
 PE_reads_1.notAssembled.fq.gz,LongReads.fq PE_reads_2.notAssembled.fq.gz



Back to top