EDGE-pro (Estimated Degree of Gene Expression for Prokaryotes) quantifies the expression level of each gene in a bacterial organism. INSTALATION ----------- Download the package and run tar -vxf EDGE.tar (to untar the file) If counts.cpp code is changed, use Makefile to compile the new code. Binaries for Bowtie2 that are included in this package are built for linux-x86_64 machine. If you have a different platform, please download the correct version of Bowtie2 from Bowtie2 sourceforge website http://sourceforge.net/projects/bowtie-bio/files/bowtie2/ or build binaries from the source code also available at Bowtie2 sourceforge website. Also, you may want to download a newer version of Bowtie2 as it is being constantly developed. USAGE: ------ [OMP_NUM_THREADS=n] PATH/edge.pl <-g genome> <-p ptt> <-r rnt> <-u reads> [options] COMBINE MULTIPLE FILES: ----------------------- Since some genomes contain multiple chromosomes and/or plasmids, the corresponding files for each chromosome/plasmid must be combined before inputing them into EDGE-pro: Combine all genome files using cat *.fasta > genome.fasta Combine all ptt files using cat *.ptt > genome.ptt Combine all rnt files using cat *.rnt > genome.rnt WARNING: All 3 types of files must have the same order of chromosomes/plasmids (e.g. if chr1 is before chr2 in genome.fasta file, then chr1 must be before chr2 in ptt and rnt files as well). If there is no ptt or rnt file for one of chromosomes/plasmids, place this chromosome/plasmid at the end of the file. MULTIPLE READS FILES: --------------------- If multiple fastq files of reads exist for the same sample, the fastq files might be combined using: cat reads1 reads2 > reads if reads in both files are of the same length. Otherwise, please run the pipeline separately on each files of reads, and combine the counts from separate runs at the end using: ./addColumn.perl 2 2 file1 is the counts file obtained from the first run of pipeline (it is of the form prefix.counts_x, where prefix is the PREFIX value when ./run.perl is run, and x is the order --see below for the explanation of the output files). If reads are paired-end reads, do not combine first and second mates. RUNNING EDGE-pro: ------------- ./edge.pl script must be in the working directory or it should be called by PATH/edge.pl, where PATH is replaced by absolute or relative path to the directory where run.perl is located since specification by -s (see optional files/parameters) only refers to scripts other than edge.pl. MANDATORY FILES: ---------------- -g genome: fasta file containing bacterial genome. If multiple chromosomes/plasmids exist, they must be combined into one file before running EDGE-pro (see "combined multiple files" above) -p ptt: ptt file with coordinates of coding genes, in Genbank format. If multiple chromosomes/plasmids exist, see "combine multiple files" above. -r rnt: rnt file with coordinates of rRNAs and tRNAs, in Genbank format. If multiple chromosomes/plasmids exist, see "combine multiple files" above. -u reads: fastq file of reads. If multiple fastq files exists, see "multiple reads files" above. OPTIONAL FILES/PARAMETERS: -------------------------- -v reads2: fastq file of mates in paired-end data. If this file is not entered, single-end reads are assumed. -m min: min is an integer value. It is minimum insert size in paired-end library. Default: 0. -M max: max is an integer value. It is maximum insert size in paired-end library. Default: 500. -t threads: threads is an integer value. It is the number of threads to be used by Bowtie2. Default: 1. OMP_NUM_THREADS is an integer environmental optional parameter that specifies the number of threads to be used to count per base coverage. Note that it is entered before the command ./edge.pl. Default: 16. -s source_dir: It is a string specifying the absolute of relative path to the directory that contains all scripts. Default: working directory. -o prefix: It is a string specifying the prefix of all output files. Default: out. -w window: It is an integer specifying the window size close to overlapping region used to approximate the coverage of a gene close to the overlapping region in order to distrbute the coverage of the overlapping region between two overlapping genes. Default: 100. -i untranslated region: It is an integer specifying the window size of the untranslated region bewteen the initial transcription site and the start codon. Default: 40. -x similarity: It is a decimal number spcifying the percentage used to determine when two coverage values are considered similar. For example, if the similarity is x, and coverage of a region is C, then another region is considered similarly expressed if its coverage is in the interval [(1-x)*C,(1+x)*C]. Default: 0.15. -l read length: It is an integer specifying the read length. If read length is not specofied, the first 1000 reads are used to approximate the read length. -c min coverage: It is an integer specifying the minimum average coverage of gene for gene to be considered expressed. Coverage less than specified is assumed to be noise and gene is considered to not be expressed. Default: 3. -n count type: It is 0 or 1 specifying how to count reads that map to multiple places. 0 denotes giving a partial count to each place where the read maps. 1 denotes picking randomly one of the places where the read maps and assigning full count to that one place. Default: 0.\n". OUTPUT FILES: ------------- All output files are named prefix.*_x, where prefix is specified by parameter -o, x is a number/order that speficies the "order number" of a gene/plasmid to which the output corresponds (the order is assigned based on the order in which chromosomes/plasmids are listed in the reference genome's fasta file, starting with 0), and the * is replaced by one of the following: rpkm: the most important file. It contains for each gene: gene name, start and end coordinates of the gene, average coverage for the gene, number of reads mapping the gene, and the RPKM value of the gene. counts: It contains the position in the genome, and the number of reads mapped to that position (i.e. coverage for each base) uniqueCounts: Same as 'counts' file, but numbers are based only on the uniquelly mapped reads. multipleCounts: Same as 'counts' file, but numbers are based only on the reads mapped to multiple places. alignments: Bowtie2 output file. numberReads: Total number of mapped reads (i.e., numberUniqueReads+numnerMultiReads). numberUniqueReads: number of uniquelly mapped reads. numberMultiReads: number of reads mapped to multiple places. rRNA.numberReads: number of reads mapped to rRNAs. rpkm.numberReads: number of reads used in calculating RPKM values (i.e., numberReads-rRNA.numberReads). DIFFERENTIAL EXPRESSION: ------------------------ EDGE-pro does not calculate differential expression between multiple samples. However, the EDGE-pro package provides a script to convert EDGE-pro output to format used by DESeq software, a stand-alone tool for calculating differential expression. The script can be called on any number of EDGE-pro files: PATH/edgeToDeseq.perl <..> where each input is EDGE-pro output file of rpkkm values, which has a form prefix.rpkm_x. The script assumes that all input files have same number and order of genes, which will be the case in the EDGE-pro output files for the same chromosome/plasmid. If multiple chromosomes/plasmids exist, it is recommended that the script edgeToDeseq.perl and subsequently DESeq be run on each chromosome/plasmid separately. DESeq program is available for download from http://www-huber.embl.de/users/anders/DESeq/ CITATION: --------- If you use this program, please cite: Magoc, T., Wood, D., and Salzberg, S., "EDGE-pro: Estimated Degree of Gene Expression in Prokaryotic Genomes", Evolutionary Bioinformatics, vol. 9, pp. 127-136, 2013. COMMENTS/QUESTIONS/REQUESTS: ---------------------------- Send an e-mail to edge.comments@gmail.com EXAMPLE ------- The 'example' directory, which is included in tar file, contains five files: Cjejuni.fa: genome file Cjejuni.ptt: ptt file Cjejuni.rnt: rnt file wild1.fastq: reads from one sample wild2.fastq: reads from second sample You can enter the 'example' directory by typing cd example and then see the content of the directory by typing ls To run EDGE-pro, type: ../edge.pl -g Cjejuni.fa -p Cjejuni.ptt -r Cjejuni.rnt -u wild1.fastq -s .. In this example, the source code is in directory one level higher, so we call EDGE-pro by specifying ".." in front of edge.pl and we specify the path to source code by "..". If for example, the code was downloaded into the directory /home/packages/edge, we would run EDGE-pro by /home/packages/edge -g Cjejuni.fa -p Cjejuni.ptt -r Cjejuni.rnt -u wild1.fastq -s /home/packages/edge If your computer allows the usage of multiple threads, you may run EDGE-pro by OMP_NUM_THREADS=8 ../edge.pl -g Cjejuni.fa -p Cjejuni.ptt -r Cjejuni.rnt -u wild1.fastq -s .. -t 16 Here, we allow count.cpp module (coverage per base count) to run on 8 threads, and alignment by Bowtie2 to run on 16 threads. The most important output file will be out.rpkm_0. Since C.jejuni has only one chromosome/plasmid, there is only one rpkm output file, denoted by "_0" at the end. If there were multiple chromosomes/plasmids, the output files would be out.rpkm_0, out.rpkm_1, out.rpkm_2, etc. If you want to change the prefix of output files, run ../edge.pl -g Cjejuni.fa -p Cjejuni.ptt -r Cjejuni.rnt -u wild1.fastq -s .. -o wild1 The rpkm output file will be wild1.rpkm_0 If you have paired-end reads, run ../edge.pl -g Cjejuni.fa -p Cjejuni.ptt -r Cjejuni.rnt -u wild1.fastq -s .. -v file2.fastq file2.fastq is not included in this example package. It is the fastq file of mates. For paired-end data, minimum and maximum insert sizes could be specified to be used by Bowtie2 by setting parameters -m min and -M max. EDGE-pro does not perform differential expression between multiple samples, but EDGE-pro package provides a script to convert EDGE-pro output files into format used by a stand-alone differential expression tool DESeq. To do this, run EDGE-pro on each sample: ../edge.pl -g Cjejuni.fa -p Cjejuni.ptt -r Cjejuni.rnt -u wild1.fastq -s .. -o wild1 ../edge.pl -g Cjejuni.fa -p Cjejuni.ptt -r Cjejuni.rnt -u wild2.fastq -s .. -o wild2 The rpkm outputs will be wild1.rpkm_0 and wild2.rpkm_0. To convert these outputs into input used by DESeq, run ../edgeToDeseq.perl wild1.rpkm_0 wild2.rpkm_0 (If source code is elsewhere, replace ".." by the path to source code.) The output of edgeToDeseq will be in 'deseqFile' file. If more than two files are to be analized by DESeq, all the files could be combined by one call to edgeToDeseq.perl: PATH/edgeToDeseq.perl <..>