Introduction

(\(\
(-.-)
o('')('')

Emending Alignment of Spliced Transcript Reads.

EASTR is a tool for detecting and eliminating spuriously spliced alignments in RNA-seq datasets. It improves the accuracy of transcriptome assembly by identifying and removing misaligned spliced alignments. The tool can process GTF, BED, and BAM files as input. EASTR can be applied to any RNA-seq dataset regardless of the alignment software used.

Quickstart

Install using conda:

conda install bioconda:eastr

Install using pip:

pip install eastr

Warning

Installing with pip requires you to install bowtie2 >= 2.5.2 and samtools >= 1.19.

Usage

The tests directory, which contains tests scripts like run_eastr.sh and run_all.sh, is available in our GitHub repository. These scripts demonstrate two different ways to run the EASTR pipeline: using a bamlist and using a GTF file. Below, we provide detailed instructions for each use case.

Bowtie2 Index Requirement

EASTR requires a Bowtie2 index. If you do not already have a Bowtie2 index for your reference genome, you can generate one using the following command:

bowtie2-build /path/to/reference_genome.fasta /path/to/output/bowtie2_index

Species-specific Considerations

For Homo sapiens and other species with pseudoautosomal regions, ensure that these regions are masked before creating a Bowtie2 index.
Additionally, exclude alternative scaffolds for Homo sapiens to prevent inflated repeat counts.

Running EASTR on a bamlist

Ensure you are in the appropriate directory containing the BAM/original folder and reference files.
Create a list of BAM files (make sure the list contains the full paths to the BAM files):
```
ls path/to/BAM/original/*.bam > bamlist.txt
```

Run the EASTR pipeline on the bamlist with the following command:

eastr \
    --bam bamlist.txt \
    --reference /path/to/reference_fasta \
    --bowtie2_index /path/to/bowtie2_index \
    --out_filtered_bam /path/to/output/BAM/filtered  # optional
    --out_original_junctions /path/to/output/original_junctions # optional
    --out_removed_junctions /path/to/output/removed_junctions # optional
    --removed_alignments_bam # optional
    --verbose # optional
    -p 12 # optional

Running EASTR on a GTF

Run the EASTR pipeline on the GTF file with the following command:

  eastr
    --gtf /path/to/gtf_file
    --reference /path/to/reference_fasta
    --bowtie2_index /path/to/bowtie2_index
    --out_removed_junctions /path/to/output/outfile.bed # optional

Analyzing an Example Dataset

We have included a script that demonstrates the application of the EASTR pipeline to an Arabidopsis dataset featured in our study. The sra_list_arabidopsis.txt file, located in the tests directory on our GitHub repository, lists the accession IDs of the samples analyzed.

Required Software

To successfully run the example analysis, ensure that the following software is installed:

These tools are necessary for:

Downloading FASTQ files using the get_fastq.sh script.
Converting the GFF reference annotation to GTF using the get_ref.sh script

The EASTR pipeline takes BAM files as input. The run_all.sh script acquires FASTQ files, the FASTA reference and annotation, and then aligns the FASTQ files using HISAT2 to generate BAM files. These BAM files are subsequently used as input to EASTR. Additionally, EASTR can accept a GTF annotation file and output a BED file containing questionable junctions (executed in the last command of the run_eastr.sh script).

To execute the entire EASTR pipeline, which filters BAM files and identifies reference annotation errors, use the run_all.sh script found in the tests directory. This script ensures all necessary steps and subscripts are carried out in the correct order. To analyze the example dataset, follow these steps:

Navigate to the tests directory within the EASTR package:
Make sure all scripts are executable (chmod +x *sh):
Run the run_all.sh script.

The script will download the necessary FASTQ files, reference genome, and then perform the alignment and EASTR analysis. The output files will be generated in their respective directories within the tests folder.

When executed on 4 CPUs, the EASTR command to filter 6 BAM files completes in approximately 35 minutes, with the bulk of this time being dedicated to the filtering of BAM files (a single BAM file typically takes between 15-20 minutes to filter on one CPU). On 1 CPU, the EASTR command to identify questionable introns in an annotation takes about 30 seconds.

Command line options

usage: EASTR [-h] (--gtf GTF | --bed BED | --bam BAM) -r REFERENCE -i
             BOWTIE2_INDEX [--bt2_k BT2_K] [-o O]
             [--min_duplicate_exon_length MIN_DUPLICATE_EXON_LENGTH] [-a A]
             [--min_junc_score MIN_JUNC_SCORE] [--trusted_bed TRUSTED_BED]
             [--verbose] [--removed_alignments_bam] [-A A] [-B B] [-O O O]
             [-E E E] [-k K] [--scoreN SCOREN] [-w W] [-m M]
             [--out_original_junctions OUT] [--out_removed_junctions OUT]
             [--out_filtered_bam OUT] [--filtered_bam_suffix STR] [-p P]

EASTR: Emending alignments of spuriously spliced transcript reads. The script
takes GTF, BED, or BAM files as input and processes them using the provided
reference genome and BowTie2 index. It identifies spurious junctions and
filters the input data accordingly.

options:
  -h, --help            show this help message and exit
  --gtf GTF             Input GTF file containing transcript annotations
  --bed BED             Input BED file with intron coordinates
  --bam BAM             Input BAM file or a TXT file containing a list of BAM
                        files with read alignments
  -r REFERENCE, --reference REFERENCE
                        reference FASTA genome used in alignment
  -i BOWTIE2_INDEX, --bowtie2_index BOWTIE2_INDEX
                        Path to Bowtie2 index for the reference genome
  --bt2_k BT2_K         Minimum number of distinct alignments found by bowtie2
                        for a junction to be considered spurious. Default: 10
  -o O                  Length of the overhang on either side of the splice
                        junction. Default = 50
  --min_duplicate_exon_length MIN_DUPLICATE_EXON_LENGTH
                        Minimum length of the duplicated exon. Default = 27
  -a A                  Minimum required anchor length in each of the two
                        exons, default = 7
  --min_junc_score MIN_JUNC_SCORE
                        Minimum number of supporting spliced reads required
                        per junction. Junctions with fewer supporting reads in
                        all samples are filtered out if the flanking regions
                        are similar (based on mappy scoring matrix). Default:
                        1
  --trusted_bed TRUSTED_BED
                        Path to a BED file path with trusted junctions, which
                        will not be removed by EASTR.
  --verbose             Display additional information during BAM filtering,
                        including the count of total spliced alignments and
                        removed alignments
  --removed_alignments_bam
                        Write removed alignments to a BAM file
  -p P                  Number of parallel processes, default=1

Minimap2 parameters:
  -A A                  Matching score. Default = 3
  -B B                  Mismatching penalty. Default = 4
  -O O O                Gap open penalty. Default = [12, 32]
  -E E E                Gap extension penalty. A gap of length k costs
                        min(O1+k*E1, O2+k*E2). Default = [2, 1]
  -k K                  K-mer length for alignment. Default=3
  --scoreN SCOREN       Score of a mismatch involving ambiguous bases.
                        Default=1
  -w W                  Minimizer window size. Default=2
  -m M                  Discard chains with chaining score. Default=25.

Output:
  --out_original_junctions OUT
                        Write original junctions to the OUT file or directory
  --out_removed_junctions OUT
                        Write removed junctions to OUT file or directory; the
                        default output is to terminal
  --out_filtered_bam OUT
                        Write filtered bams to OUT file or directory
  --filtered_bam_suffix STR
                        Suffix added to the name of the output BAM files.
                        Default='_EASTR_filtered'

Citation

To cite EASTR in publications, please use the following reference:

Shinder I, Hu R, Ji HJ, Chao KH, Pertea M. EASTR: Identifying and eliminating systematic alignment errors in multi-exon genes. Nat Commun. 2023 Nov 9;14(1):7223. doi: 10.1038/s41467-023-43017-4. PMID: 37940654; PMCID: PMC10632439.