Introduction
(\(\
(-.-)
o('')('')
Emending Alignment of Spliced Transcript Reads.
EASTR is a tool for detecting and eliminating spuriously spliced alignments in RNA-seq datasets. It improves the accuracy of transcriptome assembly by identifying and removing misaligned spliced alignments. The tool can process GTF, BED, and BAM files as input. EASTR can be applied to any RNA-seq dataset regardless of the alignment software used.
Quickstart
Install using conda
:
conda install bioconda:eastr
Install using pip
:
pip install eastr
Warning
Installing with pip requires you to install bowtie2 >= 2.5.2 and samtools >= 1.19.
Usage
The tests
directory, which contains tests scripts like run_eastr.sh
and run_all.sh
, is available in our GitHub repository. These scripts demonstrate two different ways to run the EASTR pipeline: using a bamlist and using a GTF file. Below, we provide detailed instructions for each use case.
Bowtie2 Index Requirement
EASTR requires a Bowtie2 index. If you do not already have a Bowtie2 index for your reference genome, you can generate one using the following command:
bowtie2-build /path/to/reference_genome.fasta /path/to/output/bowtie2_index
Species-specific Considerations
- For Homo sapiens and other species with pseudoautosomal regions, ensure that these regions are masked before creating a Bowtie2 index.
- Additionally, exclude alternative scaffolds for Homo sapiens to prevent inflated repeat counts.
Running EASTR on a bamlist
- Ensure you are in the appropriate directory containing the
BAM/original
folder and reference files. -
Create a list of BAM files (make sure the list contains the full paths to the BAM files):
ls path/to/BAM/original/*.bam > bamlist.txt
-
Run the EASTR pipeline on the bamlist with the following command:
eastr \ --bam bamlist.txt \ --reference /path/to/reference_fasta \ --bowtie2_index /path/to/bowtie2_index \ --out_filtered_bam /path/to/output/BAM/filtered # optional --out_original_junctions /path/to/output/original_junctions # optional --out_removed_junctions /path/to/output/removed_junctions # optional --removed_alignments_bam # optional --verbose # optional -p 12 # optional
Running EASTR on a GTF
Run the EASTR pipeline on the GTF file with the following command:
eastr
--gtf /path/to/gtf_file
--reference /path/to/reference_fasta
--bowtie2_index /path/to/bowtie2_index
--out_removed_junctions /path/to/output/outfile.bed # optional
Analyzing an Example Dataset
We have included a script that demonstrates the application of the EASTR
pipeline to an Arabidopsis dataset featured in our study. The
sra_list_arabidopsis.txt
file, located in the tests
directory on our GitHub repository, lists the
accession IDs of the samples analyzed.
Required Software
To successfully run the example analysis, ensure that the following software is installed:
These tools are necessary for:
- Downloading FASTQ files using the
get_fastq.sh
script. - Converting the GFF reference annotation to GTF using the
get_ref.sh
script
The EASTR pipeline takes BAM files as input. The run_all.sh
script acquires
FASTQ files, the FASTA reference and annotation, and then aligns the FASTQ files
using HISAT2 to generate BAM files. These BAM files are subsequently used as
input to EASTR. Additionally, EASTR can accept a GTF annotation file and output
a BED file containing questionable junctions (executed in the last command of
the run_eastr.sh
script).
To execute the entire EASTR pipeline, which filters BAM files and identifies
reference annotation errors, use the run_all.sh
script found in the tests
directory. This script ensures
all necessary steps and subscripts are carried out in the correct order. To analyze
the example dataset, follow these steps:
- Navigate to the
tests
directory within the EASTR package: - Make sure all scripts are executable (
chmod +x *sh
): -
Run the
run_all.sh
script.The script will download the necessary FASTQ files, reference genome, and then perform the alignment and EASTR analysis. The output files will be generated in their respective directories within the
tests
folder.
When executed on 4 CPUs, the EASTR command to filter 6 BAM files completes in approximately 35 minutes, with the bulk of this time being dedicated to the filtering of BAM files (a single BAM file typically takes between 15-20 minutes to filter on one CPU). On 1 CPU, the EASTR command to identify questionable introns in an annotation takes about 30 seconds.
Command line options
usage: EASTR [-h] (--gtf GTF | --bed BED | --bam BAM) -r REFERENCE -i
BOWTIE2_INDEX [--bt2_k BT2_K] [-o O]
[--min_duplicate_exon_length MIN_DUPLICATE_EXON_LENGTH] [-a A]
[--min_junc_score MIN_JUNC_SCORE] [--trusted_bed TRUSTED_BED]
[--verbose] [--removed_alignments_bam] [-A A] [-B B] [-O O O]
[-E E E] [-k K] [--scoreN SCOREN] [-w W] [-m M]
[--out_original_junctions OUT] [--out_removed_junctions OUT]
[--out_filtered_bam OUT] [--filtered_bam_suffix STR] [-p P]
EASTR: Emending alignments of spuriously spliced transcript reads. The script
takes GTF, BED, or BAM files as input and processes them using the provided
reference genome and BowTie2 index. It identifies spurious junctions and
filters the input data accordingly.
options:
-h, --help show this help message and exit
--gtf GTF Input GTF file containing transcript annotations
--bed BED Input BED file with intron coordinates
--bam BAM Input BAM file or a TXT file containing a list of BAM
files with read alignments
-r REFERENCE, --reference REFERENCE
reference FASTA genome used in alignment
-i BOWTIE2_INDEX, --bowtie2_index BOWTIE2_INDEX
Path to Bowtie2 index for the reference genome
--bt2_k BT2_K Minimum number of distinct alignments found by bowtie2
for a junction to be considered spurious. Default: 10
-o O Length of the overhang on either side of the splice
junction. Default = 50
--min_duplicate_exon_length MIN_DUPLICATE_EXON_LENGTH
Minimum length of the duplicated exon. Default = 27
-a A Minimum required anchor length in each of the two
exons, default = 7
--min_junc_score MIN_JUNC_SCORE
Minimum number of supporting spliced reads required
per junction. Junctions with fewer supporting reads in
all samples are filtered out if the flanking regions
are similar (based on mappy scoring matrix). Default:
1
--trusted_bed TRUSTED_BED
Path to a BED file path with trusted junctions, which
will not be removed by EASTR.
--verbose Display additional information during BAM filtering,
including the count of total spliced alignments and
removed alignments
--removed_alignments_bam
Write removed alignments to a BAM file
-p P Number of parallel processes, default=1
Minimap2 parameters:
-A A Matching score. Default = 3
-B B Mismatching penalty. Default = 4
-O O O Gap open penalty. Default = [12, 32]
-E E E Gap extension penalty. A gap of length k costs
min(O1+k*E1, O2+k*E2). Default = [2, 1]
-k K K-mer length for alignment. Default=3
--scoreN SCOREN Score of a mismatch involving ambiguous bases.
Default=1
-w W Minimizer window size. Default=2
-m M Discard chains with chaining score. Default=25.
Output:
--out_original_junctions OUT
Write original junctions to the OUT file or directory
--out_removed_junctions OUT
Write removed junctions to OUT file or directory; the
default output is to terminal
--out_filtered_bam OUT
Write filtered bams to OUT file or directory
--filtered_bam_suffix STR
Suffix added to the name of the output BAM files.
Default='_EASTR_filtered'
Citation
To cite EASTR in publications, please use the following reference:
Shinder I, Hu R, Ji HJ, Chao KH, Pertea M. EASTR: Identifying and eliminating systematic alignment errors in multi-exon genes. Nat Commun. 2023 Nov 9;14(1):7223. doi: 10.1038/s41467-023-43017-4. PMID: 37940654; PMCID: PMC10632439.