Method Article
Alternative splicing (AS) and alternative polyadenylation (APA) expand the diversity of transcript isoforms and their products. Here, we describe bioinformatic protocols to analyze bulk RNA-seq and 3' end sequencing assays to detect and visualize AS and APA varying across experimental conditions.
As well as the typical analysis of RNA-Seq to measure differential gene expression (DGE) across experimental/biological conditions, RNA-seq data can also be utilized to explore other complex regulatory mechanisms at the exon level. Alternative splicing and polyadenylation play a crucial role in the functional diversity of a gene by generating different isoforms to regulate gene expression at the post-transcriptional level, and limiting analyses to the whole gene level can miss this important regulatory layer. Here, we demonstrate detailed step by step analyses for identification and visualization of differential exon and polyadenylation site usage across conditions, using Bioconductor and other packages and functions, including DEXSeq, diffSplice from the Limma package, and rMATS.
RNA-seq has been widely used over the years typically for estimating differential gene expression and gene discovery1. In addition, it can also be utilized to estimate varying exon level usage due to gene expressing different isoforms, hence contributing to a better understanding of gene regulation at the post-transcriptional level. The majority of eukaryotic genes generate different isoforms by alternative splicing (AS) to increase the diversity of mRNA expression. AS events can be divided into different patterns: skipping of complete exons (SE) where a ("cassette") exon is completely removed out of the transcript along with its flanking introns; alternative (donor) 5' splice site selection (A5SS) and alternative 3' (acceptor) splice site selection (A3SS) when two or more splice sites are present on either end of an exon; retention of introns (RI) when an intron is retained within the mature mRNA transcript and mutual exclusion of exon usage (MXE) where only one of the two available exons can be retained at a time2,3. Alternative polyadenylation (APA) also plays an important role in regulating gene expression using alternative poly (A) sites to generate multiple mRNA isoforms from a single transcript4. Most polyadenylation sites (pAs) are located in the 3' untranslated region (3' UTRs), generating mRNA isoforms with diverse 3' UTR lengths. As the 3' UTR is the central hub for recognizing regulatory elements, different 3' UTR lengths can affect mRNA localization, stability and translation5. There are a class of 3' end sequencing assays optimized to detect APA that differ in the details of the protocol6. The pipeline described here is designed for PolyA-seq, but can be adapted for other protocols as described.
In this study, we present a pipeline of differential exon analysis methods7,8 (Figure 1), which can be divided into two broad categories: exon-based (DEXSeq9, diffSplice10) and event-based (replicate Multivariate Analysis of Transcript Splicing (rMATS)11). The exon-based methods compare the fold change across conditions of individual exons, against a measure of overall gene fold change to call differentially expressed exon usage, and from that compute a gene-level measure of AS activity. Event-based methods use exon-intron-spanning junction reads to detect and classify specific splicing events such as exon skipping or retention of introns, and distinguish these AS types in the output3. Thus, these methods provide complementary views for a complete analysis of AS12,13. We selected DEXSeq (based on the DESeq214 DGE package) and diffSplice (based on the Limma10 DGE package) for the study as they are amongst the most widely used packages for differential splicing analysis. rMATS was chosen as a popular method for event-based analysis. Another popular event-based method is MISO (Mixture of Isoforms)1. For APA we adapt the exon-based approach.
Figure 1. Analysis pipeline. Flowchart of the steps used in the analysis. Steps include: obtaining the data, performing quality checks and read alignment followed by counting reads using annotations for known exons, introns and pA sites, filtering to remove low counts and normalization. PolyA-seq data was analysed for alternative pA sites using diffSplice/DEXSeq methods, bulk RNA-Seq was analysed for alternative splicing at the exon level with diffSplice/DEXseq methods, and AS events analysed with rMATS. Please click here to view a larger version of this figure.
The RNA-seq data used in this survey was acquired from Gene Expression Omnibus (GEO) (GSE138691)15. We used mouse RNA-seq data from this study with two condition groups: wild-type (WT) and Muscleblind-like type 1 knockout (Mbnl1 KO) with three replicates each. To demonstrate differential polyadenylation site usage analysis, we obtained mouse embryo fibroblasts (MEFs) PolyA-seq data (GEO Accession GSE60487)16. The data has four condition groups: Wild-type (WT), Muscleblind-like type1/type 2 double knockout (Mbnl1/2 DKO), Mbnl 1/2 DKO with Mbnl3 knockdown (KD) and Mbnl1/2 DKO with Mbnl3 control (Ctrl). Each condition group consists of two replicates.
GEO Accession | SRA Run number | Sample name | Condition | Replicate | Tissue | Sequencing | Read length | |
RNA-Seq | GSM4116218 | SRR10261601 | Mbnl1KO_Thymus_1 | Mbnl1 knockout | Rep 1 | Thymus | Paired-end | 100 bp |
GSM4116219 | SRR10261602 | Mbnl1KO_Thymus_2 | Mbnl1 knockout | Rep 2 | Thymus | Paired-end | 100 bp | |
GSM4116220 | SRR10261603 | Mbnl1KO_Thymus_3 | Mbnl1 knockout | Rep 3 | Thymus | Paired-end | 100 bp | |
GSM4116221 | SRR10261604 | WT_Thymus_1 | Wild type | Rep 1 | Thymus | Paired-end | 100 bp | |
GSM4116222 | SRR10261605 | WT_Thymus_2 | Wild type | Rep 2 | Thymus | Paired-end | 100 bp | |
GSM4116223 | SRR10261606 | WT_Thymus_3 | Wild type | Rep 3 | Thymus | Paired-end | 100 bp | |
3P-Seq | GSM1480973 | SRR1553129 | WT_1 | Wild type (WT) | Rep 1 | Mouse embryonic Fibroblasts (MEFs) | Single-end | 40 bp |
GSM1480974 | SRR1553130 | WT_2 | Wild type (WT) | Rep 2 | Mouse embryonic Fibroblasts (MEFs) | Single-end | 40 bp | |
GSM1480975 | SRR1553131 | DKO_1 | Mbnl 1/2 double knockout (DKO) | Rep 1 | Mouse embryonic Fibroblasts (MEFs) | Single-end | 40 bp | |
GSM1480976 | SRR1553132 | DKO_2 | Mbnl 1/2 double knockout (DKO) | Rep 2 | Mouse embryonic Fibroblasts (MEFs) | Single-end | 40 bp | |
GSM1480977 | SRR1553133 | DKOsiRNA_1 | Mbnl 1/2 double knockout with Mbnl 3 siRNA (KD) | Rep 1 | Mouse embryonic Fibroblasts (MEFs) | Single-end | 40 bp | |
GSM1480978 | SRR1553134 | DKOsiRNA_2 | Mbnl 1/2 double knockout with Mbnl 3 siRNA (KD) | Rep 2 | Mouse embryonic Fibroblasts (MEFs) | Single-end | 36 bp | |
GSM1480979 | SRR1553135 | DKONTsiRNA_1 | Mbnl 1/2 double knockout with non-targeting siRNA (Ctrl) | Rep 1 | Mouse embryonic Fibroblasts (MEFs) | Single-end | 40 bp | |
GSM1480980 | SRR1553136 | DKONTsiRNA_2 | Mbnl 1/2 double knockout with non-targeting siRNA (Ctrl) | Rep 2 | Mouse embryonic Fibroblasts (MEFs) | Single-end | 40 bp |
Table 1. Summary of RNA-Seq and PolyA-seq datasets used for the analysis.
1. Installation of tools and R packages used in the analysis
conda install -c daler sratoolkit
conda install -c conda-forge parallel
conda install -c bioconda star bowtie fastqc rmats rmats2sashimiplot samtools fasterq-dump cutadapt bedtools deeptools
bioc_packages<- c("DEXSeq", "Rsubread", "EnhancedVolcano", "edgeR", "limma", "maser","GenomicRanges")
packages<- c("magrittr", "rtracklayer", "tidyverse", "openxlsx", "BiocManager")
#Install if not already installed
installed_packages<-packages%in% rownames(installed.packages())
installed_bioc_packages<-bioc_packages%in% rownames(installed.packages())
if(any(installed_packages==FALSE)) {
install.packages(packages[!installed_packages],dependencies=TRUE)
BiocManager::install(packages[!installed_bioc_packages], dependencies=TRUE)
}
2. Alternative splicing (AS) analysis using RNA-seq
seq 10261601 10261606 | parallel prefetch SRR{}
parallel -j 3 fastq-dump --gzip --skip-technical --read-filter pass --dumpbase --split-e --clip --origfmt {} ::: <name of sra files together>
wget -nv -O annotation.gtf.gz http://ftp.ensembl.org/pub/release-103/gtf/mus_musculus/Mus_musculus.GRCm39.103.gtf.gz \ && gunzip -f annotation.gtf.gz
wget -nv -O genome.fa.gz http://ftp.ensembl.org/pub/release-103/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.primary_assembly.fa.gz \ && gunzip -f genome.fa.gz
GTF=$(readlink -f annotation.gtf)
GENOME=$(readlink -f genome.fa)
mkdir fastqc_out
parallel "fastqc {} -o fastqc_out" ::: $RAW_DATA/*.fastq.gz
#Build STAR index
GDIR=STAR_indices
mkdir $GDIR
STAR --runMode genomeGenerate --genomeFastaFiles $GENOME --sjdbGTFfile $GTF --runThreadN 8 --genomeDir $GDIR
ODIR=results/mapping
mkdir -p $ODIR
#Align reads to the genome
for fq1 in $RAW_DATA/*R1.fastq.gz;
do
fq2=$(echo $fq1| sed 's/1.fastq.gz/2.fastq.gz/g');
OUTPUT=$(basename ${fq1}| sed 's/R1.fastq.gz//g');
STAR --genomeDir $GDIR \
--runThreadN 12 \
--readFilesCommand zcat \
--readFilesIn ${fq1}${fq2}\
--outFileNamePrefix $ODIR\/${OUTPUT} \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard
Done
Rscript prepare_mm_exon_annotation.R annotation.gtf
packages<- c("Rsubread","tidyverse", "magrittr", "EnhancedVolcano", "edgeR","openxlsx")
invisible(lapply(packages, library, character.only=TRUE))
load("mm_exon_anno.RData")
countData <- dir("bams", pattern=".bam$", full.names=T) %>%
featureCounts(annot.ext=anno,
isGTFAnnotationFile=FALSE,
minMQS=0,useMetaFeatures=FALSE,
allowMultiOverlap=TRUE,
largestOverlap=TRUE,
countMultiMappingReads=FALSE,
primaryOnly=TRUE,
isPairedEnd=TRUE,
nthreads=12)
# Non-specific filtering: Remove the exons with low counts
isexpr<- rownames(countData$counts)[rowSums(cpm(countData$counts)>1) >=3]
countData$counts<-countData$counts[rownames(countData$counts) %in%isexpr, ]
anno<-anno%>% filter(GeneID%in% rownames(countData$counts))
# Remove genes with only 1 site and NA in geneIDs
dn<-anno%>%group_by(GeneID)%>%summarise(nsites=n())%>% filter(nsites>1&!is.na(GeneID))
anno<-anno%>% filter(GeneID%in%dn$GeneID)
countData$counts<-countData$counts[rownames(countData$counts) %in%anno$GeneID, ]
library(DEXSeq)
sampleTable<-data.frame(row.names= c("Mbnl1KO_Thymus_1", "Mbnl1KO_Thymus_2", "Mbnl1KO_Thymus_3", "WT_Thymus_1", "WT_Thymus_2", "WT_Thymus_3"), condition= rep(c("Mbnl1_KO", "WT"),c(3,3)), libType= rep(c("paired-end")))
exoninfo<-anno[anno$GeneID%in% rownames(countData$counts),]
exoninfo<-GRanges(seqnames=anno$Chr,
ranges=IRanges(start=anno$Start, end=anno$End, width=anno$Width),strand=Rle(anno$Strand))
mcols(exoninfo)$TranscriptIDs<-anno$TranscriptIDs
mcols(exoninfo)$Ticker<-anno$Ticker
mcols(exoninfo)$ExonID<-anno$ExonID
mcols(exoninfo)$n<-anno$n
mcols(exoninfo)$GeneID<-anno$GeneID
transcripts_l= strsplit(exoninfo$TranscriptIDs, "\\,")
save(countData, sampleTable, exoninfo, transcripts_l, file="AS_countdata.RData")
dxd<-DEXSeqDataSet(countData$counts,sampleData=sampleTable, design=~sample+exon+condition:exon,featureID=exoninfo$ExonID,groupID=exoninfo$GeneID,featureRanges=exoninfo, transcripts=transcripts_l)
dxd %<>% estimateSizeFactors %>% estimateDispersions %T>% plotDispEsts
dxd%<>%testForDEU%>%estimateExonFoldChanges(fitExpToVar=
"condition")#Estimate fold changes
dxr=DEXSeqResults(dxd)
plotDEXSeq(dxr,"Wnk1", displayTranscripts=TRUE, splicing=TRUE,legend
=TRUE,cex.axis=1.2,cex=1.3,lwd=2)
library(limma)
library(edgeR)
mycounts=countData$counts
#Change the rownames of the countdata to exon Ids instead of genes for unique rownames.
rownames(mycounts) = exoninfo$ExonID
dge<-DGEList(counts=mycounts)
#Filtering
isexpr<- rowSums(cpm(dge)>1) >=3
dge<-dge[isexpr,,keep.lib.sizes=FALSE]
#Extract the exon annotations for only transcripts meeting non-specific filter
exoninfo=anno%>% filter(ExonID%in% rownames(dge$counts))
#Convert the exoninfo into GRanges object
exoninfo1<-GRanges(seqnames=exoninfo$Chr,
ranges=IRanges(start=exoninfo$Start, end=exoninfo$End, width=exoninfo$Width),strand=Rle(exoninfo$Strand))
mcols(exoninfo1)$TranscriptIDs<-exoninfo$TranscriptIDs
mcols(exoninfo1)$Ticker<-exoninfo$Ticker
mcols(exoninfo1)$ExonID<-exoninfo$ExonID
mcols(exoninfo1)$n<-exoninfo$n
mcols(exoninfo1)$GeneID<-exoninfo$GeneID
transcripts_l= strsplit(exoninfo1$TranscriptIDs, "\\,")
dge<-calcNormFactors(dge)
Treat<- factor(sampleTable$condition)
design<- model.matrix(~0+Treat)
colnames(design) <- levels(Treat)
v<-voom(dge,design,plot=FALSE)
fit<-lmFit(v,design)
fit<-eBayes(fit)
colnames(fit)
cont.matrix<-makeContrasts(
Mbnl1_KO_WT=Mbnl1_KO-WT,
levels=design)
fit2<-contrasts.fit(fit,cont.matrix)
ex<-diffSplice(fit2,geneid=exoninfo$GeneID,exonid=exoninfo$ExonID)
ts<-topSplice(ex,n=Inf,FDR=0.1, test="t", sort.by="logFC")
tg<-topSplice(ex,n=Inf,FDR=0.1, test="simes")
plotSplice(ex,geneid="Wnk1", FDR=0.1)
#Volcano plot
EnhancedVolcano(ts,lab=ts$ExonID,selectLab= head((ts$ExonID),2000), xlab= bquote(~Log[2]~'fold change'), x='logFC', y='P.Value', title='Volcano Plot', subtitle='Mbnl1_KO vs WT (Limma_diffSplice)', FCcutoff=2, labSize=4,legendPosition="right", caption= bquote(~Log[2]~"Fold change cutoff, 2; FDR 10%"))
mkdir rMATS_analysis
cd bams/
ls -pd "$PWD"/*| grep "WT"| tr '\n'','> Wt.txt
ls -pd "$PWD"/*| grep "Mb"| tr '\n'','> KO.txt
mv *.txt ../rMATS_analysis
python rmats-turbo/rmats.py --b1 KO.txt --b2 Wt.txt --gtf annotation.gtf -t paired --readLength 50 --nthread 8 --od rmats_out/ --tmp rmats_tmp --task pos
library(maser)
mbnl1<-maser("/rmats_out/", c("WT","Mbnl1_KO"), ftype="JCEC")
#Filtering out events by coverage
mbnl1_filt<-filterByCoverage(mbnl1,avg_reads=5)
#Top splicing events at 10% FDR
mbnl1_top<-topEvents(mbnl1_filt,fdr=0.1, deltaPSI=0.1)
mbnl1_top
#Check the gene events for a particular gene
mbnl1_wnk1<-geneEvents(mbnl1_filt,geneS="Wnk1", fdr=0.1, deltaPSI=0.1)
maser::display(mbnl1_wnk1,"SE")
plotGenePSI(mbnl1_wnk1,type="SE", show_replicates
=TRUE)
volcano(mbnl1_filt,fdr=0.1, deltaPSI=0.1,type="SE")
+xlab("deltaPSI")+ylab("Log10 Adj. Pvalue")+ggtitle("Volcano Plot of exon skipping events")
python ./src/rmats2sashimiplot/rmats2sashimiplot.py --b1 ../bams/WT_Thymus_1.bam,../bams/WT_Thymus_2.bam,../bams/WT_Thymus_3.bam --b2 ../bams/Mbnl1KO_Thymus_1.bam,../bams/Mbnl1KO_Thymus_2.bam,../bams/Mbnl1KO_Thymus_3.bam -t SE -e ../rMATS_analysis/rmats_out/SE.MATS.JC.txt --l1 WT --l2 Mbnl1_KO --exon_s 1 --intron_s 5 -o ../rMATS_analysis/rmats2shasmi_output
3. Alternative polyadenylation (APA) analysis using 3' end sequencing
anno<- read.table(file= "flanking60added.pA_annotation.bed",
stringsAsFactors=FALSE, check.names=FALSE, header=FALSE, sep="")
colnames(anno) <- c("chrom", "chromStart", "chromEnd", "name", "score", "strand", "rep", "annotation", "gene_name", "gene_id")
anno<- dplyr::select(anno,name,chrom, chromStart,chromEnd, strand,gene_id,gene_name,rep)
colnames(anno) <- c("GeneID", "Chr", "Start", "End", "Strand", "Ensembl", "Symbol", "repID")
countData<- dir("bamfiles", pattern="sorted.bam$", full.names=TRUE) %>%
# Read all bam files as input for featureCounts
featureCounts(annot.ext=anno, isGTFAnnotationFile= FALSE,minMQS=0,useMetaFeatures= TRUE,allowMultiOverlap=TRUE, largestOverlap= TRUE,strandSpecific=1, countMultiMappingReads =TRUE,primaryOnly= TRUE,isPairedEnd= FALSE,nthreads=12)%T>%
save(file="APA_countData.Rdata")
load(file= "APA_countData.Rdata")# Skip this step if already loaded
# Non-specific filtering: Remove the pA sites not differentially expressed in the samples
countData<-countData$counts%>%as.data.frame%>% .[rowSums(edgeR::cpm(.)>1) >=2, ]
anno%<>% .[.$GeneID%in% rownames(countData), ]
# Remove genes with only 1 site and NA in geneIDs
dnsites<-anno%>%group_by(Symbol)%>%summarise(nsites=n())%>% filter(nsites>1&!is.na(Symbol))
anno<-anno%>% filter(Symbol%in%dnsites$Symbol)
countData<-countData[rownames(countData) %in%anno$GeneID, ]
c("DEXSeq", "GenomicRanges") %>% lapply(library, character.only=TRUE) %>%invisible
sampleTable1<- data.frame(row.names= c("WT_1","WT_2","DKO_1","DKO_2"),
condition= c(rep("WT", 2), rep("DKO", 2)),
libType= rep("single-end", 4))
# Prepare the GRanges object for DEXSeqDataSet object construction
PASinfo <- GRanges(seqnames = anno$Chr,
ranges = IRanges(start = anno$Start, end = anno$End),strand = Rle(anno$Strand))
mcols(PASinfo)$PASID<-anno$repID
mcols(PASinfo)$GeneEns<-anno$Ensembl
mcols(PASinfo)$GeneID<-anno$Symbol
# Prepare the new feature IDs, replace the strand information with letters to match the current pA site clusterID
new.featureID <- anno$Strand %>% as.character %>% replace(. %in% "+", "F") %>% replace(. %in% "-", "R") %>% paste0(as.character(anno$repID), .)
# Select the read counts of the condition WT and DKO
countData1<- dplyr::select(countData, SRR1553129.sorted.bam, SRR1553130.sorted.bam, SRR1553131.sorted.bam, SRR1553132.sorted.bam)
# Rename the columns of countData using sample names in sampleTable
colnames(countData1) <- rownames(sampleTable1)
dxd1<-DEXSeqDataSet(countData=countData1,
sampleData=sampleTable1,
design=~sample+exon+condition:exon,
featureID=new.featureID,
groupID=anno$Symbol,
featureRanges=PASinfo)
dxd1$condition<- factor(dxd1$condition, levels= c("WT", "DKO"))
# The contrast pair will be "DKO - WT"
dxd1 %<>% estimateSizeFactors %>% estimateDispersions %T>% plotDispEsts
dxd1 %<>% testForDEU %>% estimateExonFoldChanges(fitExpToVar = "condition")
dxr1 <- DEXSeqResults(dxd1)
dxr1
mcols(dxr1)$description
table(dxr1$padj<0.1) # Check the number of differential pA sites (FDR < 0.1)
table(tapply(dxr1$padj<0.1, dxr1$groupID, any)) # Check the number of gene overlapped with differential pA site
# Select the top 100 significant differential pA sites ranked by FDR
topdiff.PAS<- dxr1%>%as.data.frame%>%rownames_to_column%>%arrange(padj)%$%groupID[1:100]
# Apply plotDEXSeq for the visualization of differential polyA usage
plotDEXSeq(dxr1,"S100a7a", legend=TRUE, expression=FALSE,splicing=TRUE, cex.axis=1.2, cex=1.3,lwd=2)
# Apply perGeneQValue to check the top genes with differential polyA site usage
dxr1%<>% .[!is.na(.$padj), ]
dgene<- data.frame(perGeneQValue= perGeneQValue (dxr1)) %>%rownames_to_column("groupID")
dePAS_sig1<-dxr1%>% data.frame() %>%
dplyr::select(-matches("dispersion|stat|countData|genomicData"))%>%
inner_join(dgene)%>%arrange(perGeneQValue)%>%distinct()%>%
filter(padj<0.1)
# Apply EnhancedVolcano package to visualise differential polyA site usage
"EnhancedVolcano"%>% lapply(library, character.only=TRUE) %>%invisible
EnhancedVolcano(dePAS_sig1, lab=dePAS_sig1$groupID, x='log2fold_DKO_WT',
y='pvalue',title='Volcano Plot',subtitle='DKO vs WT',
FCcutoff=1,labSize=4, legendPosition="right",
caption= bquote(~Log[2]~"Fold change cutoff, 1; FDR 10%"))
contrast.matrix<-makeContrasts(DKO_vs_WT=DKO-
WT,Ctrl_vs_DKO=Ctrl-DKO,
KD_vs_Ctrl=KD-Ctrl,KD_vs_DKO=KD-DKO,levels=design)
fit2<-fit%>%contrasts.fit(contrast.matrix)%>%eBayes
summary(decideTests(fit2))
ex<-diffSplice(fit2,geneid=anno$Symbol,exonid=new.featureID)
topSplice(ex) #Check the top significant results with topSplice
sig1<-topSplice(ex,n=Inf,FDR=0.1,coef=1, test="t", sort.by="logFC")
sig1.genes<-topSplice(ex,n=Inf,FDR=0.1,coef=1, test="simes")
plotSplice(ex, coef=1,geneid="S100a7a", FDR = 0.1)
plotSplice(ex,coef=1,geneid="Tpm1", FDR = 0.1)
plotSplice(ex,coef=1,geneid="Smc6", FDR = 0.1)
EnhancedVolcano(sig1, lab=sig1$GeneID,xlab= bquote(~Log[2]~'fold change'),
x='logFC', y='P.Value', title='Volcano Plot', subtitle='DKO vs WT',
FCcutoff=1, labSize=6, legendPosition="right")
After running the above step-by-step workflow, the AS and APA analysis outputs and representative results are in the form of tables and data plots, generated as follows.
AS:
The main output of the AS analysis (Supplementary Table 1 for diffSplice; Table 2 for DEXSeq) is a list of exons showing differential usage across conditions, and a list of genes showing significant overall splicing activity of one or more of its constituent exons, ranked by statistical significance. Supplementary Table 1, tab 2 shows significant exons, with columns showing differential FC of exon versus rest, per-exon unadjusted p-value, and adjusted p-value (Benjamini-Hockberg correction). Thresholding on the adjusted p-values will give a set of exons with defined FDR. Supplementary Table 1, tab 3 shows a ranked list of genes showing significance of overall splicing activity, with a column showing gene-level adjusted p-value computed using Simes method. Similar data are shown in Table 2 for DEXSeq. Supplementary Figure 1 and Supplementary Figure 2 show differential splicing pattern in Mbnl1, Tcf7 and Lef1 genes which have been experimentally validated in the published article presented with the data15. The authors have shown experimental validation of five genes- Mbnl1, Mbnl2, Lef1, Tcf7 and Ncor2. Our approach detected differential splicing pattens in all these genes. Here we present the FDR levels for each gene using DEXSeq, diffSplice and rMATS respectively as obtained in Supplementary Tables 1-3: Mbnl1 (0, 6.6E-61 ,0), Mbnl2 (0,0.18,0), Lef1 (1.4E-10, 1.3E-04, 0), Tcf7 (0, 1.1E-6, 0) and Ncor2 (9.2E-11, 1.6E-06, 0).
Figure 2 displays a comparison between outputs obtained from three different tools and illustrates alternative splicing patterns in Wnk1 gene. The volcano plots are shown in Figure 2A (diffSplice) and Figure 2B (DEXSeq). An additional three highly ranked genes are shown in Supplementary Figure 1 (diffSplice) and Supplementary Figure 2 (DEXSeq). The y-axis shows statistical significance (-log10 P-values) and the x-axis shows effect size (fold change). Genes located at the top left or right quadrants indicate substantial FC and strong statistical evidence of genuine differences.
Figure 2. Comparison of alternative splicing results obtained from diffSplice, DEXSeq and rMATS. |
(A) Volcano plot (left) of RNA-Seq from Limma diffSplice analysis: The x-axis shows log exon fold change; the y-axis shows -log10 p-value. Each point corresponds to an exon. Horizontal dashed line at p-value=1E-5; vertical dashed lines at two fold change (FC). Red exons show substantial FC and statistical significance. Differential splicing plot (right): Splicing patterns are exhibited for an example gene Wnk1 where the x-axis shows exon id per transcript; the y-axis shows exon relative log fold change (the difference between the exon's logFC and the overall logFC for all other exons). Exons highlighted in red show statistically significant differential expression (FDR < 0.1).
(B) Volcano plot (left) and Differential exon usage (right) of RNA-Seq obtained from DEXSeq analysis. Wnk1 gene shows significant differential usage of exons between WT and Mbnl1 knock-out highlighted in pink, which correspond to the same differential exons in (A).
(C) Volcano plot (left) and Sashimi plot (right) for Wnk1 obtained from rMATS analysis. Volcano plot depicting significant skipped (cassette) exon (SE) event in wild-type compared to knockout at 10% FDR with change in percent spliced in (PSI or ΔΨ) values > 0.1. The x-axis shows change in PSI values across conditions, and the y-axis shows log P-Value. The Sashimi plot shows a skipped exon event in the Wnk1 gene, corresponding to a significant differential exon in (A) and (B). Each row represents an RNA-Seq sample: three replicates of wild-type and Mbnl1 knock-out. The height shows read coverage in RPKM and the connecting arcs depict junction reads across exons. Annotated gene model alternative isoforms are shown at the bottom of the plot. The bottom panel of C illustrates junction reads used to compute the PSI statistic.
(D) Venn diagram comparing the number of significant differential exons obtained by the different methods. Please click here to view a larger version of this figure.
Figure 2 A (right-panel) shows a diagrammatic display of exon differences of one of the top ranked genes, showing logFC on the y-axis and exon number on the x-axis. This example shows a cassette exon varying between conditions for gene Wnk1. The differential exon usage plot from DEXSeq shows evidence of differential splicing at five exon sites near Wnk1.6.45. The highlighted exons in pink are likely to be spliced out in Mbnl1 KO samples compared to WT. These exons are complementary to the results obtained by diffSplice which shows a similar pattern at the specific genomic position. More examples are shown in Supplementary Figure 1 and Supplementary Figure 2. A more detailed view to confirm interesting results can be given by comparing coverage (wiggle) tracks in RPM (Reads per million) units in the UCSC (University of Santa Cruz) or IGV (Integrative Genomics Viewer) genome browsers (not shown), along with visual correlation with other tracks of interest, such as known gene models, conservation, and other genome-wide assays.
rMATS output table lists significant alternative splicing events categorized by type (Supplementary Table 3). Figure 2C shows a volcano plot of genes that are alternative spliced, with the effect size measured by the differential "percent spliced in" (PSI or ΔΨ) statistic of11.
PSI refers to the percentage of reads consistent with the inclusion of a cassette exon (i.e., reads mapping to the cassette exon itself or junction reads overlapping the exon) compared with reads consistent with exon exclusion i.e. junction reads across adjacent upstream and downstream exons (The bottom panel of Figure 2C). The right panel of Figure 2C shows sashimi plot of Wnk1 gene with differential splicing event overlaid on coverage tracks for the gene, with a skipped exon in Mbnl1 KO. Arcs joining exons show the number of junction reads (reads crossing a spliced-out intron). Different tabs of Supplementary Table 3 show significant reads of each type of event that spans junctions with exon boundaries (junction counts and exon counts (JCEC)). Figure 2D compares the significant differentially spliced exons detected by the three tools.
Figure 3. Alternative splicing events acquired by rMATS analysis. A) Types of AS events. This figure is adapted from rMATS documentation11 explaining the splicing event with constitutive and alternatively spliced exons. Labeled with the number of each event at FDR 10%. B) Sashimi plot of Add3 gene exhibiting skipped exon (SE). C) Sashimi plot of Baz2b gene exhibiting alternative 5' splice site (A5SS). D) Sashimi plot of Lsm14b gene exhibiting alternative 3' splice site (A3SS). E) Sashimi plot of Mta1 gene exhibiting mutually exclusive exons (MXE). F) Sashimi plot of Arpp21 gene exhibiting retained intron (RI). Red rows represent three replicates of wild-type and orange rows represent Mbnl1 knock-out replicates. The x-axis corresponds to genomic coordinates and strand information, y-axis shows coverage in RPKM. Please click here to view a larger version of this figure.
Figure 3 illustrates types of splicing events SE, A5SS, A3SS, MXE and RI with the help of Sashimi plots of the top significant genes of those events. On comparing the three replicates of both WT and Mbnl1_KO, a total of 1272 SE events, 130 A5SS, 116 A3SS, 215 MXE and 313 RI events were detected at FDR 10%. Sashimi plot illustrates the type of event using top genes as an example.
APA:
The output from the APA analysis is similar to the exon-level AS analysis. A table of top genes ranked by differential APA activity in the 3'UTR is provided (Supplementary Table 4 and Supplementary Table 5). Figure 4A shows the volcano plots of genes by differential APA activity in 3'UTRs generated using diffSplice and DEXSeq separately. Figure 4B displays the Venn plot comparing the significantly differential pA site usage results acquired from different pipelines. Figure 4C and 4D show the diagrammatic representation of differential pA site usage in the 3'UTR of gene Fosl2 (Figure 4C) and Papola (Figure 4D) generated using both diffSplice and DEXSeq, which are experimentally validated to show significant distal to proximal shift (Fosl2) and significant proximal to distal shift (Papola) of pA site usage in DKO, respectively. More examples are shown in Supplementary Figure 3 and Supplementary Figure 4.
Figure 4. Alternative polyadenylation plots by diffSplice and DEXSeq. A) Volcano plots of PolyA-seq data generated using diffSplice and DEXSeq. X-axis shows log pA site fold change; y-axis shows -log10 p-value. Each point corresponds to a pA site. Horizontal dashed line at p-value=1E-5; vertical dashed lines at 2-fold FC. Red exons show substantial FC and statistical significance. B) Venn plot comparing the significant differential pA site usage results acquired from different pipelines. C-D) Differential APA plots generated using diffSplice and DEXSeq showing the proximal, internal and distal pA sites for the Fosl2 and Papola gene. Figures are generated by same function as Figure 2 (B) but with pA sites replacing exons. Please click here to view a larger version of this figure.
Figure 5 is a diagnostic plot to confirm the expected read distribution around annotated pA cleavage sites for the PolyA-seq assay used. It shows the mean coverage in flanking regions anchored at known pA cleavage sites on the genome-wide level. In this case, the expected pileup of reads upstream of the sites is visualized. The read distributions anchored at pA sites for all PolyA-seq samples are shown in Supplementary Figure 5.
Figure 5. Mean coverage plot around pA cleavage sites. The cleavage site for PolyA-seq data is shown by vertical dashed line. X-axis shows base position relative to pA cleavage sites, up to 100 nucleotides upstream and downstream; y-axis shows the mean read coverage over all pA cleavage sites, normalized by library size in CPM. Please click here to view a larger version of this figure.
The differential APA results of different contrasts generated by the same pipeline can be compared and verified by visualizing the read coverage of representative significantly differential pA sites in the genome browser. Figure 6A is the Venn plot comparing the significantly differential pA site usage of different contrasts acquired from diffSplice. Figure 6B-D are the IGV snapshots of the read coverage at pA sites for different genes, which show the patterns consistent with those discovered in the APA analysis using diffSplice. Figure 6B validates the significant proximal to distal shift of pA site usage for gene Paip2, which is uniquely detected in the contrast DKO vs WT, but not in other two contrasts KD vs WT, and Ctr vs WT. Figure 6C validates the significant distal to proximal shift of pA site usage for gene Ccl2 detected uniquely in the contrast KD vs WT, while Figure 6D validates the significant differential pA usage of all the contrasts for gene Cacna2d1.
Figure 6. Contrast comparison and verification of diffSplice results. A) Venn diagram comparing significant differential pA site usage results of different contrasts acquired from diffSplice. B-D) IGV snapshot visualizing pA peaks coverage of genes Paip2, Ccl2 and Cacna2d1 across conditions. Each track represents the read coverage in a specific condition. Please click here to view a larger version of this figure.
Supplementary Figure 1. RNA-Seq analysis of differential splicing with Limma diffSplice. (A) Volcano plot of RNA-Seq from Limma diffSplice analysis: The x-axis shows log exon fold change; the y-axis shows -log10 p-value. Each point corresponds to an exon. Horizontal dashed line at p-value=1E-5; vertical dashed lines at two-fold change (FC). Red exons show substantial FC and statistical significance. (B-D) Differential splicing plots: Splicing patterns are exhibited for example genes Mbnl1, Tcf7, and Lef1, respectively, where the x-axis shows exon id per transcript; the y-axis shows exon relative log fold change (the difference between the exon's logFC and the overall logFC for all other exons). Exons highlighted in red show statistically significant differential expression (FDR < 0.1). Please click here to download this File.
Supplementary Figure 2. RNA-Seq analysis of differential exon usage with DEXSeq. (A) Volcano plot. (B-D) Differential exon usage of RNA-Seq obtained from DEXSeq analysis. Genes Mbnl1, Tcf7, and Lef1, respectively, show significant differential usage of exons between WT and Mbnl1 knock-out highlighted in pink, which correspond to the same differential exons in Supplementary figure 1. Please click here to download this File.
Supplementary Figure 3. Alternative polyadenylation plots by diffSplice. A) Volcano plots of PolyA-seq data generated using diffSplice in three contrast pairs obtained from the mouse PolyA-seq data, including double knockout (DKO) vs wild-type (WT), knock-down (KD) vs WT, and control (Ctrl) vs WT. X-axis shows log pA site fold change; y-axis shows -log10 p-value. Each point corresponds to a pA site. Horizontal dashed line at p-value=1E-5; vertical dashed lines at 2-fold FC. Red pA sites show substantial FC and statistical significance. B) Differential APA plots generated using diffSplice showing the proximal, internal and distal pA sites for the highly ranked genes S100a7a, Tpm1, and Smc6. Please click here to download this File.
Supplementary Figure 4. Differential pA usage analysis by DEXSeq pipeline. A) Volcano plots of PolyA-seq data generated using DEXSeq in three pairs obtained from the mouse PolyA-seq data, including double knockout (DKO) vs wild-type (WT), knock-down (KD) vs WT, and control (Ctrl) vs WT. X-axis shows log pA site fold change; y-axis shows -log10 p-value. Each point corresponds to a pA site. Horizontal dashed line at p-value=1E-5; vertical dashed lines at 2-fold FC. Red pA sites show substantial FC and statistical significance. B) Differential APA plots generated using DEXSeq showing the proximal, internal and distal pA sites for the highly ranked genes S100a7a, Tpm1, and Smc6. Please click here to download this File.
Supplementary Figure 5. Mean coverage plot and heatmaps around pA cleavage sites. Coverage shown for four conditions, with genes on forward and reverse strands shown separately. X-axis shows base position relative to pA cleavage sites, up to 100 nucleotides upstream and downstream; y-axis refers to the mean coverage at corresponding relative base positions across all pA cleavage sites. Heatmaps provide an alternative view, with each pA cleavage site shown as a separate row on the x-axis, ordered by coverage. Intensity shows read coverage (see legend). Please click here to download this File.
Supplementary Table 1. diffSplice output of the AS analysis. First tab defines the column names for the original outputs presented in second (exon-level) and third (gene-level) tabs. Please click here to download this Table.
Supplementary Table 2. DEXSeq output of the AS analysis. First tab defines the column names for the original output presented in second (exon-level) and third (gene-level) tabs. Please click here to download this Table.
Supplementary Table 3. rMATS output of the AS analysis. First tab defines the column names for the summary file (tab 2) and the JCEC files for each event (tab 3-7). Please click here to download this Table.
Supplementary Table 4. diffSplice output of the APA analysis. First tab defines the column names for the original output presented in second (pA site-level) and third (gene-level) tabs. Please click here to download this Table.
Supplementary Table 5. DEXSeq output of the APA analysis. First tab defines the column names for the original output presented in second (pA site-level) and third (gene-level) tabs. Please click here to download this Table.
Supplementary Table 6. A summary of number of significantly changed exons for AS and pA sites for APA. Please click here to download this Table.
Supplementary Table 7. A summary of tools and packages used in the AS/APA analysis. Please click here to download this Table.
In this study, we evaluated exon-based and event-based approaches to detect AS and APA in bulk RNA-Seq and 3' end sequencing data. The exon-based AS approaches produce both a list of differentially expressed exons and a gene-level ranking ordered by the statistical significance of overall gene-level differential splicing activity (Tables 1-2, 4-5). For the diffSplice package, differential usage is determined by fitting weighted linear models at an exon-level to estimate the differential log fold-change of an exon against the average log fold-change of the other exons within the same gene (called per exon FC). The gene-level statistical significance is computed by combining individual exon-level significance tests into a gene-wise test by the Simes method10. This ranking by gene-level differential splicing activity can subsequently be used to perform a gene set enrichment analysis (GSEA) of key pathways involved10. DEXSeq uses a similar strategy, by fitting a generalized linear model to measure differential exon usage, though differing in certain steps such as filtering, normalization and dispersion estimation. On comparing the top 500 ranked exons showing AS activity and APA using DEXSeq and DiffSplice, we found an overlap of 310 exons and 300 pA sites, respectively, demonstrating the concordance of the two exon-based approaches, which was also demonstrated in a previous study 29. It is recommended to use a combination of both an exon-based (either DEXSeq or diffSplice) and an event-based approach for comprehensive detection and classification of AS. For APA, users can choose either DEXSeq or diffSplice: both methods have been shown to perform well across a wide range of transcriptomics experiments29.
In preparing the RNA-seq library for an AS analysis, it is important to use a strand-specific bulk RNA-seq protocol8, as a large fraction of genes in vertebrate genomes overlap on different strands, and a non-strand-specific protocol is unable to distinguish these overlapping regions, confounding final exon detection. Another consideration is read depth, with splicing analyses requiring deeper sequencing than DGE, e.g. 30-60 million reads per sample, versus 5-25 million reads per sample for DGE (https://sapac.support.illumina.com/bulletins/2017/04/considerations-for-rna-seq-read-length-and-coverage-.html). All the tools demonstrated in the protocol support both single-end and paired-end sequencing data. If only known gene annotations are used to detect junction reads then single-ended shorter reads (≥ 50 bp) can be used, though de novo detection of novel splice junctions benefits from paired-end and longer (≥ 100bp) reads30,31. The choice of RNA extraction protocol - either polyA selection or rRNA depletion-- depends on the quality of RNA and the experimental question- see31 for a discussion. Depending on the details of the library construction, modifications will be required to the example scripts given here for the parameters of read alignment, feature counting and rMATS. In computing the initial exon level read counts using featureCounts, or similar methods, care must be taken to configure the function options correctly for counts and strandedness: in featureCounts, we set the "strandSpecific" argument appropriately for the strand-specific RNA-seq protocol used; and for exon-level quantification it is expected that a read will map over adjacent exons, and so we set the allowMultiOverlap parameter to TRUE. For APA, there are different 3' end sequencing protocols6 which vary in the precise location of peaks relative to the pA site. For our example data we determine the peak is 60 bp upstream of the pA site as shown by Figure 5, and this analysis will need to be adapted for other 3' end sequencing protocols.
In this protocol we limit the scope to the discussion of differential analyses at the level of individual exons, and splicing events consisting of adjacent exon-intron combinations. We do not discuss the class of analyses based on isoform de novo reconstruction such as Cufflinks, Cuffdiff32, RSEM33, Kallisto34 which aim to detect and quantify the absolute and relative expression of entire alternative isoforms. The exon and event-based methods are more sensitive for detecting individual splicing events30 and in many cases provide all the information needed for further analysis, without a need for isoform-level quantification.
The latest version of the source files in this protocol are available at https://github.com/jiayuwen/AS_APA_JoVE
The authors have nothing to disclose.
This study was supported by an Australian Research Council (ARC) Future Fellowship (FT16010043) and ANU Futures Scheme.
Name | Company | Catalog Number | Comments |
Not relevent for computational study |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved
We use cookies to enhance your experience on our website.
By continuing to use our website or clicking “Continue”, you are agreeing to accept our cookies.