A Bioinformatics Pipeline for Investigating Molecular Evolution and Gene Expression using RNA-seq

Aide Macias-Muñoz; Ali Mortazavi

doi:10.3791/61633

Summary

The purpose of this protocol is to investigate the evolution and expression of candidate genes using RNA sequencing data.

Abstract

Distilling and reporting large datasets, such as whole genome or transcriptome data, is often a daunting task. One way to break down results is to focus on one or more gene families that are significant to the organism and study. In this protocol, we outline bioinformatic steps to generate a phylogeny and to quantify the expression of genes of interest. Phylogenetic trees can give insight into how genes are evolving within and between species as well as reveal orthology. These results can be enhanced using RNA-seq data to compare the expression of these genes in different individuals or tissues. Studies of molecular evolution and expression can reveal modes of evolution and conservation of gene function between species. The characterization of a gene family can serve as a springboard for future studies and can highlight an important gene family in a new genome or transcriptome paper.

Introduction

Advances in sequencing technologies have facilitated the sequencing of genomes and transcriptomes of non-model organisms. In addition to the increased feasibility of sequencing DNA and RNA from many organisms, an abundance of data is publicly available to study genes of interest. The purpose of this protocol is to provide bioinformatic steps to investigate the molecular evolution and expression of genes that may play an important role in the organism of interest.

Investigating the evolution of a gene or gene family can provide insight into the evolution of biological systems. Members of a gene family are typically determined by identifying conserved motifs or homologous gene sequences. Gene family evolution was previously investigated using genomes from distantly related model organisms¹. A limitation to this approach is that it is not clear how these gene families evolve in closely related species and the role of different environmental selective pressures. In this protocol, we include a search for homologs in closely related species. By generating a phylogeny at a phylum level, we can note trends in gene family evolution such as that of conserved genes or lineage-specific duplications. At this level, we can also investigate whether genes are orthologs or paralogs. While many homologs likely function similarly to each other, that is not necessarily the case². Incorporating phylogenetic trees in these studies is important to resolve whether these homologous genes are orthologs or not. In eukaryotes, many orthologs retain similar functions within the cell as evidenced by the ability of mammalian proteins to restore the function of yeast orthologs³. However, there are instances where a non-orthologous gene carries out a characterized function⁴.

Phylogenetic trees begin to delineate relationships between genes and species, yet function cannot be assigned solely based on genetic relationships. Gene expression studies combined with functional annotations and enrichment analysis provide strong support for gene function. Cases where gene expression can be quantified and compared across individuals or tissue types can be more telling of potential function. The following protocol follows methods used in investigating opsin genes in Hydra vulgaris⁷, but they can be applied to any species and any gene family. The results of such studies provide a foundation for further investigation into gene function and gene networks in non-model organisms. As an example, the investigation of the phylogeny of opsins, which are proteins that initiate the phototransduction cascade, gives context to the evolution of eyes and light detection⁸^,⁹^,¹⁰^,¹¹. In this case, non-model organisms especially basal animal species such as cnidarians or ctenophores can elucidate conservation or changes in the phototransduction cascade and vision across clades¹²^,¹³^,¹⁴. Similarly, determining the phylogeny, expression, and networks of other gene families will inform us about the molecular mechanisms underlying adaptations.

Protocol

This protocol follows UC Irvine animal care guidelines.

1. RNA-seq library preparation

Isolate RNA using the following methods.
1. Collect samples. If RNA is to be extracted at a later time, flash freeze the sample or place in RNA storage solution¹⁵ (Table of Materials).
2. Euthanize and dissect the organism to separate tissues of interest.
3. Extract total RNA using an extraction kit and purify the RNA using an RNA purification kit (Table of Materials)
  NOTE: There are protocols and kits that may work better for different species and tissue types¹⁶^,¹⁷. We have extracted RNA from different body tissues of a butterfly¹⁸ and a gelatinous Hydra¹⁹ (see discussion).
4. Measure the concentration and quality of the RNA of each sample (Table of Materials). Use samples with RNA integrity numbers (RIN) higher than 8, ideally closer to 9²⁰ to construct cDNA libraries.
Construct cDNA library and sequence as follows.
1. Build cDNA libraries according to the library prep instruction manual (see discussion).
2. Determine cDNA concentration and quality (Table of Materials).
3. Multiplex the libraries and sequence them.

2. Access a computer cluster

NOTE: RNA-seq analysis requires manipulation of large files and is best done on a computer cluster (Table of Materials).

Login to the computer cluster account using the command ssh username@clusterlocation on a terminal (Mac) or PuTTY (Windows) application window.

3. Obtain RNA-seq reads

Obtain RNA-seq reads from the sequencing facility or, for data generated in a publication, from the data repository where it was deposited (3.2 or 3.3).
To download data from repositories such as ArrayExpress do the following:
1. Search the site using the accession number.
2. Find the link to download the data, then left-click and select Copy Link.
3. On the terminal window, type wget and select Paste link to copy the data to the directory for analysis.
To download NCBI Short Read Archive (SRA) data follow these alternative steps:
1. On the terminal download SRA Toolkit v. 2.8.1 using wget.
  NOTE: Downloading and installing programs to the computer cluster may require root access, contact your computer cluster administrator if installation fails.
2. Finish installing the program by typing tar -xvf $TARGZFILE.
3. Search NCBI for the SRA accession number for the samples you want to download, it should have the format SRRXXXXXX.
4. Obtain the RNA-seq data by typing [sratoolkit location]/bin/prefetch SRRXXXXXX in the terminal window.
5. For paired-end files type [sratoolkit location]/bin/fastq-dump --split-files SRRXXXXXX to get two fastq files (SRRXXXXXX_1.FASTQ and SRRXXXXXX_2.FASTQ).
  NOTE: To do a Trinity de novo assembly use the command [sratoolkit location]/bin/fastq-dump --defline-seq '@$sn[_$rn]/$ri' --split-files SRRXXXXXX

4. Trim adapters and low-quality reads (optional)

Install or load Trimmomatic²¹ v. 0.35 on the computing cluster.
In the directory where the RNA-seq data files are located, type a command that includes the location of the trimmomatic jar file, the input FASTQ files, output FASTQ files, and optional parameters such as read length and quality.
NOTE: The command will vary by the raw and desired quality and length of the reads. For Illumina 43 bp reads with Nextera primers, we used: java -jar /data/apps/trimmomatic/0.35/trimmomatic-0.35.jar PE $READ1.FASTQ $READ2.FASTQ paired_READ1.FASTQ unpaired_READ1.FASTQ paired_READ2.FASTQ unpaired_READ2.FASTQ ILLUMINACLIP:adapters.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:17 MINLEN:30.

5. Obtain reference assembly

Search google, EnsemblGenomes, and NCBI Genomes and Nucleotide TSA (Transcriptome Shotgun Assembly) for a reference genome or assembled transcriptome for the species of interest (Figure 1).
NOTE: If a reference genome or transcriptome are not available or low-quality, proceed to STEP 6 to generate a de novo assembly.
If a reference genome or assembled transcriptome exists, download it as a fasta file to where the analysis will be performed following the steps below.
1. Find the link to download the genome, left-click and Copy Link.
2. On the terminal window type wget and paste the link address. If available, also copy the GTF file and protein FASTA file for the reference genome.

6. Generate a de novo assembly (Alternative to Step 5)

Combine the RNA-seq READ1 and READ2 fastq files for all samples by typing cat *READ1.FASTQ > $all_READ1.FASTQ and cat *READ2.FASTQ > all_READ2.FASTQ on the terminal window.
Install or load Trinity²² v.2.8.5 on the computing cluster.
Generate and assembly by typing on the terminal: Trinity --seqType fq --max_memory 20G --left $all_READ1.FASTQ --right $all_READ2.FASTQ.

7. Map reads to the genome (7.1) or de novo transcriptome (7.2)

Map reads to the reference genome using STAR²³ v. 2.6.0c and RSEM²⁴ v. 1.3.0.
1. Install or load STAR v. 2.6.0c. and RSEM v. 1.3.0 to the computing cluster.
2. Index the genome by typing rsem-prepare-reference --gtf $GENOME.GTF --star -p 16 $GENOME.FASTA $OUTPUT.
3. Map reads and calculate expression for each sample by typing rsem-calculate-expression -p 16 --star --paired-end $READ1.FASTQ $READ2.FASTQ $INDEX $OUTPUT.
4. Rename the results file to something descriptive using mv RSEM.genes.results $sample.genes.results.
5. Generate a matrix of all counts by typing rsem-generate-data-matrix *[genes/isoforms.results] > $OUTPUT.
Map RNA-seq to the Trinity de novo assembly using RSEM and bowtie.
1. Install or load Trinity²² v.2.8.5, Bowtie²⁵ v. 1.0.0, and RSEM v. 1.3.0.
2. Map reads and calculate expression for each sample by typing [trinity_location]/align_and_estimate_abundance.pl --prep-reference --transcripts $TRINITY.FASTA --seqType fq --left $READ1.FASTQ --right $READ2.FASTQ --est_method RSEM --aln_method bowtie --trinity_mode --output_dir $OUTPUT.
3. Rename the results file to something descriptive using mv RSEM.genes.results $sample.genes.results.
4. Generate a matrix of all counts by typing [trinity_location]/abundance_estimates_to_matrix.pl --est_method RSEM *[genes/isoforms].results

8. Identify genes of interest

NOTE: The following steps can be done with nucleotide or protein FASTA files but work best and are more straightforward with protein sequences. BLAST searches using protein to protein is more likely to give results when searching between different species.

For a reference genome, use the protein FASTA file from STEP 5.2.2 or see Supplemental Materials to generate a custom gene feature GTF.
For a de novo transcriptome, generate a protein FASTA using TransDecoder.
1. Install or load TransDecoder v. 5.5.0 on the computer cluser.
2. Find the longest open reading frame and predicted peptide sequence by typing [Transdecoder location]/TransDecoder.LongOrfs -t $TRINITY.FASTA.
Search NCBI Genbank for homologs in closely related species.
1. Open an internet browser window and go to https://www.ncbi.nlm.nih.gov/genbank/.
2. On the search bar type the name of the gene of interest and the name of closely related species which have been sequenced or genus or phylum. On the left of the search bar select protein then click search.
3. Extract sequences by clicking Send to and then select File. Under Format, select FASTA then click Create File.
4. Move FASTA file of homologs to the computer cluster by typing scp $FASTA username@clusterlocation:/$DIR on a local terminal window or use FileZilla to transfer files to and from computer and cluster.
Search for candidate genes using BLAST+²⁶.
1. Install or load BLAST+ v. 2.8.1 on the computer cluster.
2. On the computer cluster, make a BLAST database from the genome or transcriptome translated protein FASTA by typing [BLAST+ location]/makeblastdb -in $PEP.FASTA -dbtype prot -out $OUTPUT
3. BLAST the homologous gene sequences from NCBI to the database of the species of interest by typing [BLAST+ location]/blastp -db $DATABASE -query $FASTA -evalue 1e-10 -outfmt 6 -max_target_seqs 1 -out $OUTPUT.
4. View the output file using the command more. Copy unique gene IDs from the species of interest to a new text file.
5. Extract the sequences of candidate genes by typing perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' $gene_id.txt $PEP.FASTA > $OUTPUT.
Confirm gene annotation using reciprocal BLAST.
1. On the internet browser go to https://blast.ncbi.nlm.nih.gov/Blast.cgi.
2. Select tblastn, then paste the candidate sequences, select the Non-redundant protein sequence database and click BLAST.
Identify additional genes by annotating all genes in the genome or transcriptome with gene ontology (GO) terms (see discussion).
1. Transfer the protein FASTA to the local computer.
2. Download and install Blast2GO²⁷^,²⁸^,²⁹ v. 5.2 to the local computer.
3. Open Blast2GO, click File, go to Load, go to Load Sequences, click Load Fasta File (fasta). Select the FASTA file and click Load.
4. Click on Blast, choose NCBI Blast, and click Next. Edit parameters or click Next, edit parameters and click Run to find the most similar gene description.
5. Click mapping then click Run to search Gene Ontology annotations for similar proteins.
6. Next click interpro, select EMBL-EBI InterPro, and click Next. Edit parameters or click Next, and click Run to search for signatures of known gene families and domains.
7. Export the annotations by clicking File, select Export, click Export Table. Click Browse, name the file, click Save, click Export.
8. Search the annotation table for GO terms of interest to identify additional candidate genes. Extract the sequences from the FASTA file (STEP 8.4.5)

9. Phylogenetic trees

Download and install MEGA³⁰ v. 7.0.26 to your local computer.
Open MEGA, click on Align, click Edit/Build Alignment, select Create a new alignment click OK, select Protein.
When the alignment window opens, click on Edit, click Insert sequences from file and select the FASTA with protein sequences of candidate genes and probable homologs.
Select all sequences. Find the arm symbol and hover over it. It should say Align sequences using MUSCLE³¹ algorithm. Click on the arm symbol and then click Align Protein to align the sequences. Edit parameters or click OK to align using default parameters.
Visually inspect and make any manual changes then Save and close the alignment window.
In the main MEGA window, click on Models, click Find Best DNA/Protein models (ML), select the alignment file and select corresponding parameters such as: Analysis: Model Selection (ML), Tree to use: Automatic (neighbor-joining tree), Statistical Method: Maximum Likelihood, Substitution Type: Amino Acid, Gap/missing data treatment: Use all sites, Branch site filter: None.
Once the best model for the data is determined, go to the main MEGA window. Click Phylogeny and click Contruct/Test Maximum Likelihood Tree and then select the alignment, if necessary. Select the appropriate parameters for the tree: Statistical method: Maximum Likelihood, Test of Phylogeny: Bootstrap method with 100 replicates, substitution type: amino acid, model: LG with Freqs. (+F), rates among sites: gamma distributed (G) with 5 discrete gamma categories, gap/missing data treatment: use all sites, ML heuristic method: Nearest-Neighbor-Interchange (NNI).

10. Visualize gene expression using TPM

For Trinity, on the computer cluster go to the directory where abundance_estimates_to_matrix.pl was run and one of the outputs should be matrix.TPM.not_cross_norm. Transfer this file to your local computer.
NOTE: See Supplemental Materials for cross sample normalization.
For TPMs from a genome analysis follow the steps below.
1. On the computer cluster, go to the RSEM installation location. Copy rsem-generate-data-matrix by typing scp rsem-generate-data-matrix rsem-generate-TPM-matrix. Use nano to edit the new file and change “my $offsite = 4” from 4 to 5 for TPM, it should now read “my $offsite = 5”.
Go to the directory where the RSEM output files .genes.results are and now use rsem-generate-TPM-matrix *[genes/isoforms.results] > $OUTPUT to generate a TPM matrix. Transfer results to a local computer.
Visualize the results in ggplot2.
1. Download R v. 4.0.0 and RStudio v. 1.2.1335 to a local computer.
2. Open RStudio on the right of the screen go to the Packages tab and click Install. Type ggplot2 and click install.
3. On the R script window read in the TPM table by typing data<-read.table("$tpm.txt",header = T)
4. For bar graphs similar to Figure 4 type something similar to: p<- ggplot() + geom_bar(aes(y=TPM, x=Symbol, fill=Tissue), data=data, stat="identity")
  fill<-c("#d7191c","#fdae61", "#ffffbf", "#abd9e9", "#2c7bb6")
  p<-p+scale_fill_manual(values=fill)
  p + theme(axis.text.x = element_text(angle = 90))

Results

The methods above are summarized in Figure 1 and were applied to a data set of Hydra vulgaris tissues. H. vulgaris is a fresh-water invertebrate that belongs to the phylum Cnidaria which also includes corals, jellyfish, and sea anemones. H. vulgaris can reproduce asexually by budding and they can regenerate their head and foot when bisected. In this study, we aimed to investigate the evolution and expression of opsin genes in Hydra⁷. While Hydra lack eyes, they exhibit light-dependent behavior³². Opsin genes encode proteins that are important in vision to detect different wavelengths of light and begin the phototransduction cascade. Investigating the molecular evolution and expression of this gene family in a basal species can provide insight into the evolution of eyes and light detection in animals.

We generated a guided assembly using the Hydra 2.0³³ reference genome and publicly available RNA-seq data (GEO accession GSE127279) Figure 1. This step took approximately 3 days. Although we did not generate a de novo transcriptome in this case, a Trinity assembly can take up to 1 week to generate and each library can take a few hours for read mapping depending on the mapper. The merged Hydra assembly (~50,000 transcripts) was annotated using Blast2GO which took about 1-week Figure 1. Sequences for opsin-related genes were extracted into a fasta file. Sequences for opsin genes from other species were also extracted from NCBI GenBank. We used opsins from cnidarians Podocoryna carnea, Cladonema radiatum, Tripedelia cystophora, and Nematostella vectensis, and we also included outgroups Mnemiopsis leidyi,Trichoplax adhaerens, Drosophila melanogaster and Homo sapiens. Opsin genes were aligned in MEGA7 Figure 2. By viewing the alignment, we were able to identify Hydra opsins that were missing a conserved lysine amino acid necessary to bind a light sensitive molecule. After visual inspection, we determined the best model by doing a model selection analysis. We generated a maximum-likelihood tree using the model LG + G + F with bootstrap value of 100 Figure 3. For 149 opsin genes, the tree was finished in approximately 3 days. The phylogeny suggests opsin genes are evolving by lineage-specific duplications in cnidarians and potentially by tandem duplication in H. vulgaris⁷.

We performed a differential expression analysis in edgeR and looked at absolute expression of opsin genes. We hypothesized that one or more opsins would be upregulated in the head (hypostome) and performed pair-wise comparisons of hypostome versus the body column, budding zone, foot and tentacles. As an example of a pair-wise comparison, 1,774 transcripts were differentially expressed between the hypostome and body column. We determined the genes that were upregulated across multiple comparisons and did a functional enrichment in Blast2GO Table 1. Grouping of G-protein coupled receptor activity included opsin genes. Finally, we looked at the absolute expression of opsin genes in different tissues, during budding and during regeneration by plotting their TPM values using ggplot Figure 4. Using the methods outlined here, we identified 2 opsin genes that did not group with the other opsins in the phylogeny, found one opsin that was expressed almost 200 times more than others, and we found a few opsin genes co-expressed with phototransduction genes that may be used for light detection.

figure-results-3999
Figure 1: Workflow schematic. Programs used to analyze data on the computer cluster are in blue, in magenta are those that we used on a local computer and in orange is a web-based program. (1) Trim RNA-seq reads using trimmomatic v. 0.35. If a genome is available but gene models are missing, generate a guided assembly using STAR v. 2.6.0c and StringTie v. 1.3.4d. (Optional see Supplemental Materials) (2) Without a reference genome, use trimmed reads to make a de novo assembly using Trinity v 2.8.5. (3) To quantify gene expression using a reference genome, map reads using STAR and quantify using RSEM v. 1.3.1. Extract TPMs using RSEM and visualize them in RStudio. (4) Bowtie and RSEM can be used to map and quantify reads mapped to a trinity transcriptome. A Trinity script can be used to generate a TPM matrix to visualize counts in RStudio. (5) Use web-based NCBI BLAST and command-line BLAST+ to search for homologous sequences and confirm using reciprocal BLAST. Annotate genes further using Blast2GO. Use MEGA to align genes and generate a phylogenetic tree using the best fit model. Please click here to view a larger version of this figure.

figure-results-5431
Figure 2: Example of aligned genes. Snapshot shows a portion of Hydra opsin genes aligned using MUSCLE. The arrow indicates the location of a retinal-binding conserved lysine. Please click here to view a larger version of this figure.

figure-results-5942
Figure 3: Cnidarian opsin phylogenetic tree. Maximum-likelihood tree generated in MEGA7 using opsin sequences from Hydra vulgaris, Podocoryna carnea, Cladonema radiatum, Tripedelia cystophora, Nematostella vectensis, Mnemiopsis leidyi, Trichoplax adhaerens, Drosophila melanogaster and Homo sapiens. Please click here to view a larger version of this figure.

figure-results-6640
Figure 4: Expression of Opsin genes in Hydra vulgaris. (A) Expression in transcripts per million (TPM) of Hydra vulgaris opsin genes in the body column, budding zone, foot, hypostome and tentacles. (B) Expression of opsin genes during different stages of Hydra budding. (C) Expression of opsin genes of the Hydra hypostome during different time points of regeneration. Please click here to view a larger version of this figure.

GO ID	GO Name	GO Category	FDR
GO:0004930	G-protein coupled receptor activity	MOLECULAR FUNCTION	0.0000000000704
GO:0007186	G-protein coupled receptor signaling pathway	BIOLOGICAL PROCESS	0.00000000103
GO:0016055	Wnt signaling pathway	BIOLOGICAL PROCESS	0.0000358
GO:0051260	protein homooligomerization	BIOLOGICAL PROCESS	0.000376
GO:0004222	metalloendopeptidase activity	MOLECULAR FUNCTION	0.000467
GO:0008076	voltage-gated potassium channel complex	CELLULAR COMPONENT	0.000642
GO:0005249	voltage-gated potassium channel activity	MOLECULAR FUNCTION	0.00213495
GO:0007275	multicellular organism development	BIOLOGICAL PROCESS	0.00565048
GO:0006813	potassium ion transport	BIOLOGICAL PROCESS	0.01228182
GO:0018108	peptidyl-tyrosine phosphorylation	BIOLOGICAL PROCESS	0.02679662

Table 1: Functional enrichment of genes upregulated in the hypostome

Supplemental Materials. Please click here to download these materials.

Discussion

The purpose of this protocol is to provide an outline of the steps for characterizing a gene family using RNA-seq data. These methods have been proven to work for a variety of species and datasets⁴^,³⁴^,³⁵. The pipeline established here has been simplified and should be easy enough to be followed by a novice in bioinformatics. The significance of the protocol is that it outlines all the steps and necessary programs to complete a publishable analysis. A crucial step in the protocol is having properly assembled full length transcripts, this comes from high quality genomes or transcriptomes. To obtain proper transcripts, one needs high quality RNA and/or DNA and good annotations discussed below.

For RNA-seq library preparation, we include list kits that worked for small body parts of Hydra¹⁹ and butterflies¹⁸ (Table of Materials). We note that for low input RNA we used a modified protocol approach³⁶. Methods for RNA extraction have been compared in multiple sample types including yeast cells¹⁷, neuroblastoma³⁷, plants³⁸, and insect larvae¹⁶ to name a few. We recommend the reader acquire a protocol that works for their species of interest, if any exist, or troubleshoot using commonly commercially available kits to start. For proper gene quantification, we recommend treating the RNA sample with DNase. The presence of DNA will affect proper gene quantification. We also recommend using a cDNA library prep kit that includes a polyA tail selection to select for mature mRNA. While rRNA depletion results in more read depth, the percentage of exon coverage is much lower than the exon coverage of RNA using polyA+ selection³⁹. Finally, when possible it is best to use paired-end and stranded⁴⁰^,⁴¹. In the protocol above the read mapping commands will have to be modified when using single end reads.

As mentioned above it is important to be able to identify genes of interest and also to differentiate between recent gene duplications, alternative splicing, and haplotypes in sequencing. In some instances, having a reference genome can help by determining where genes and exons are located relative to each other. One thing to note is that if a transcriptome is obtained from a public database and is not high quality, it may be best to generate using Trinity⁴² and combining RNA-seq libraries from tissues of interest. Likewise, if a reference genome does not have good gene models, RNA-seq libraries can be used to generate new GTFs using StringTie⁴³(see Supplemental Materials). In addition, in cases where genes are incomplete and there is access to a genome, genes can be manually edited using homolog sequences then aligned to the genome using tblastn. The BLAST output can be used to determine the actual sequence, which may be different from the correction done using homologs. If there is no match, leave the sequence as was originally. When checking output pay attention to the genome coordinates to make sure the missing exon is indeed part of the gene.

Although we focus on software and programs that we used, modifications to this protocol exist due to many programs available which might work better for different datasets. As an example, we show commands for mapping reads to the transcriptome using bowtie and RSEM, but Trinity now has the option for much faster aligners such as kallisto⁴⁴ and salmon⁴⁵. Similarly, we describe annotations using Blast2GO (now OmicsBox) but there are other mapper tools that can be found free and online. Some that we have tried include: GO FEAT⁴⁶, eggNOG-mapper⁴⁷^,⁴⁸, and a very fast aligner PANNZER2⁴⁹. To use these web-based annotation tools simply upload the peptide FASTA and submit. Standalone versions of PANNZER and eggNOG-mapper are also available to be downloaded to the computer cluster. Another modification is that we used MEGA and R on a local computer and used the online NCBI BLAST tool to do reciprocal BLASTs however all of these programs can be used on the computer cluster by downloading the necessary programs and databases. Likewise, aligners kallisto and salmon can be used on a local computer as long as a user has enough RAM and storage. However, FASTQ and FASTA files tend to be very large and we highly recommend using a computer cluster for ease and speed. In addition, while we provide instructions and links to download programs from their developers many of them can be installed from bioconda: https://anaconda.org/bioconda.

A common problem faced when doing bioinformatic analyses is shell scripts failing. This can be due to a variety of reasons. If an error file is created, these error file should be checked before troubleshooting. A few common reasons for an error are typos, missing key parameters, and compatibility issues between software versions. In this protocol, we include parameters for the data, but software manuals can provide more detailed guidelines for individual parameters. In general, it is best to use the most up to date versions of software and to consult the manual corresponding to that version.

Enhancements to this protocol include doing a transcriptome-wide differential expression analysis and functional enrichment analysis. We recommend edgeR⁵⁰ for differential expression analysis a package available in Bioconductor. For functional enrichment analysis, we have used Blast2GO²⁹ and web-based DAVID⁵¹^,⁵². We also recommend further editing the phylogeny by extracting it as a newick file and using web-based iTOL⁵³. Furthermore, while this protocol will investigate the molecular evolution and expression patterns of genes, additional experiments can be used to validate gene or protein locations and functions. mRNA expression can be confirmed by RT-qPCR or in situ hybridization. Proteins can be localized using immunohistochemistry. Depending on the species, knockout experiments can be used to confirm gene function. This protocol can be used for a variety of objectives including, as shown above, to explore a gene family typically associated with photoreception in a basal species⁷. Another application of these methods is to identify changes in a conserved pathway under different selective pressures. As an example, these methods were used to discover variation in the expression of vision transient receptor potential channels between diurnal butterflies and nocturnal moths³⁴.

Disclosures

The authors have nothing to disclose.

Acknowledgements

We thank Adriana Briscoe, Gil Smith, Rabi Murad and Aline G. Rangel for advice and guidance in incorporating some of these steps into our workflow. We are also grateful to Katherine Williams, Elisabeth Rebboah, and Natasha Picciani for comments on the manuscript. This work was supported in part by a George E. Hewitt Foundation for Medical research fellowship to A.M.M.

Materials

Name	Company	Catalog Number	Comments
Bioanalyzer-DNA kit	Agilent	5067-4626	wet lab materials
Bioanalyzer-RNA kit	Agilent	5067-1513	wet lab materials
BLAST+ v. 2.8.1			On computer cluster* https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
Blast2GO (on your PC)			On local computer https://www.blast2go.com/b2g-register-basic
boost v. 1.57.0			On computer cluster
Bowtie v. 1.0.0			On computer cluster https://sourceforge.net/projects/bowtie-bio/files/bowtie/1.3.0/
Computing cluster (highly recommended)			NOTE: Analyses of genomic data are best done on a high-performance computing cluster because files are very large.
Cufflinks v. 2.2.1			On computer cluster
edgeR v. 3.26.8 (in R)			In Rstudio https://bioconductor.org/packages/release/bioc/html/edgeR.html
gcc v. 6.4.0			On computer cluster
Java v. 11.0.2			On computer cluster
MEGA7 (on your PC)			On local computer https://www.megasoftware.net
MEGAX v. 0.1			On local computer https://www.megasoftware.net
NucleoSpin RNA II kit	Macherey-Nagel	740955.5	wet lab materials
perl 5.30.3			On computer cluster
python			On computer cluster
Qubit 2.0 Fluorometer	ThermoFisher	Q32866	wet lab materials
R v.4.0.0			On computer cluster https://cran.r-project.org/src/base/R-4/
RNAlater	ThermoFisher	AM7021	wet lab materials
RNeasy kit	Qiagen	74104	wet lab materials
RSEM v. 1.3.0			Computer software https://deweylab.github.io/RSEM/
RStudio v. 1.2.1335			On local computer https://rstudio.com/products/rstudio/download/#download
Samtools v. 1.3			Computer software
SRA Toolkit v. 2.8.1			On computer cluster https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit
STAR v. 2.6.0c			On computer cluster https://github.com/alexdobin/STAR
StringTie v. 1.3.4d			On computer cluster https://ccb.jhu.edu/software/stringtie/
Transdecoder v. 5.5.0			On computer cluster https://github.com/TransDecoder/TransDecoder/releases
Trimmomatic v. 0.35			On computer cluster http://www.usadellab.org/cms/?page=trimmomatic
Trinity v.2.8.5			On computer cluster https://github.com/trinityrnaseq/trinityrnaseq/releases
TRIzol	ThermoFisher	15596018	wet lab materials
TruSeq RNA Library Prep Kit v2	Illumina	RS-122-2001	wet lab materials
TURBO DNA-free Kit	ThermoFisher	AM1907	wet lab materials

*Downloads and installation on the computer cluster may require root access. Contact your network administrator.

References

Lespinet, O., Wolf, Y. I., Koonin, E. V., Aravind, L. The role of lineage-specific gene family expansion in the evolution of eukaryotes. Genome Research. 12 (7), 1048-1059 (2002).
Gabaldón, T., Koonin, E. V. Functional and evolutionary implications of gene orthology. Nature Reviews Genetics. 14 (5), 360-366 (2013).
Dolinski, K., Botstein, D. Orthology and Functional Conservation in Eukaryotes. Annual Review of Genetics. 41 (1), (2007).
Macias-Muñoz, A., McCulloch, K. J., Briscoe, A. D. Copy number variation and expression analysis reveals a non-orthologous pinta gene family member involved in butterfly vision. Genome Biology and Evolution. 9 (12), 3398-3412 (2017).
Cannon, S. B., Mitra, A., Baumgarten, A., Young, N. D., May, G. The roles of segmental and tandem gene duplication in the evolution of large gene families in Arabidopsis thaliana. BMC plant biology. 4, 10 (2004).
Eastman, S. D., Chen, T. H. P., Falk, M. M., Mendelson, T. C., Iovine, M. K. Phylogenetic analysis of three complete gap junction gene families reveals lineage-specific duplications and highly supported gene classes. Genomics. 87 (2), 265-274 (2006).
Macias-Munõz, A., Murad, R., Mortazavi, A. Molecular evolution and expression of opsin genes in Hydra vulgaris. BMC Genomics. 20 (1), 1-19 (2019).
Hisatomi, O., Tokunaga, F. Molecular evolution of proteins involved in vertebrate phototransduction. Comparative Biochemistry and Physiology - B Biochemistry and Molecular Biology. 133 (4), 509-522 (2002).
Arendt, D. Evolution of eyes and photoreceptor cell types. International Journal of Developmental Biology. 47, 563-571 (2003).
Shichida, Y., Matsuyama, T. Evolution of opsins and phototransduction. Philosophical Transactions of the Royal Society B: Biological Sciences. 364 (1531), 2881-2895 (2009).
Porter, M. L., et al. Shedding new light on opsin evolution. Proceedings of the Royal Society B: Biological Sciences. 279 (1726), 3-14 (2012).
Plachetzki, D. C., Degnan, B. M., Oakley, T. H. The origins of novel protein interactions during animal opsin evolution. PLoS ONE. 2 (10), 1054 (2007).
Ramirez, M. D., et al. The last common ancestor of most bilaterian animals possessed at least nine opsins. Genome Biology and Evolution. 8 (12), 3640-3652 (2016).
Schnitzler, C. E., et al. Genomic organization, evolution, and expression of photoprotein and opsin genes in Mnemiopsis leidyi: a new view of ctenophore photocytes. BMC Biology. 10, 107 (2012).
Pedersen, K. B., Williams, A., Watt, J., Ronis, M. J. Improved method for isolating high-quality RNA from mouse bone with RNAlater at room temperature. Bone Reports. 11, 100211 (2019).
Ridgeway, J. A., Timm, A. E., Fallon, A. Comparison of RNA isolation methods from insect larvae. Journal of Insect Science. 14 (1), 4-8 (2014).
Scholes, A. N., Lewis, J. A. Comparison of RNA isolation methods on RNA-Seq: Implications for differential expression and meta-Analyses. BMC Genomics. 21 (1), 1-9 (2020).
Briscoe, A. D., et al. Female behaviour drives expression and evolution of gustatory receptors in butterflies. PLoS genetics. 9 (7), 1003620 (2013).
Murad, R., Macias-Muñoz, A., Wong, A., Ma, X., Mortazavi, A. Integrative analysis of Hydra head regeneration reveals activation of distal enhancer-like elements. bioRxiv. , 544049 (2019).
Gallego Romero, I., Pai, A. A., Tung, J., Gilad, Y. Impact of RNA degradation on measurements of gene expression. BMC Biology. 12, 42 (2014).
Bolger, A. M., Lohse, M., Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 30 (15), 2114-2120 (2014).
Trinity. . RNA-Seq De novo Assembly Using Trinity. , 1-7 (2014).
Dobin, A., et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 29, 15-21 (2013).
Li, B., Dewey, C. N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics. 12, 323 (2011).
Langmead, B., Trapnell, C., Pop, M., Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology. 10, 25 (2009).
Camacho, C., et al. BLAST+: architecture and applications. BMC Bioinformatics. 10, 421 (2009).
Conesa, A., Götz, S. Blast2GO: A comprehensive suite for functional analysis in plant genomics. International Journal of Plant Genomics. 619832, (2008).
Conesa, A., et al. Blast2GO: A universal tool for annotation, visualization and analysis in functional genomics research. Bioinformatics. 21 (18), 3674-3676 (2005).
Götz, S., et al. High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Research. 36 (10), 3420-3435 (2008).
Kumar, S., Stecher, G., Tamura, K. MEGA7: Molecular Evolutionary Genetics Analysis version 7.0 for bigger datasets. Molecular biology and evolution. 33 (7), 1870-1874 (2016).
Edgar, R. C. MUSCLE: Multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 32 (5), 1792-1797 (2004).
Taddei-Ferretti, C., Musio, C., Santillo, S., Cotugno, A. The photobiology of Hydra's periodic activity. Hydrobiologia. 530, 129-134 (2004).
Chapman, J. A., et al. The dynamic genome of Hydra. Nature. 464 (7288), 592-596 (2010).
Macias-Muñoz, A., Rangel Olguin, A. G., Briscoe, A. D. Evolution of phototransduction genes in Lepidoptera. Genome Biology and Evolution. 11 (8), 2107-2124 (2019).
Macias-Munõz, A., Murad, R., Mortazavi, A. Molecular evolution and expression of opsin genes in Hydra vulgaris. BMC Genomics. 20 (1), (2019).
Picelli, S., et al. Full-length RNA-seq from single cells using Smart-seq2. Nature Protocols. 9 (1), 171-181 (2014).
Tavares, L., Alves, P. M., Ferreira, R. B., Santos, C. N. Comparison of different methods for DNA-free RNA isolation from SK-N-MC neuroblastoma. BMC research notes. 4, 3 (2011).
Johnson, M. T. J., et al. Evaluating Methods for Isolating Total RNA and Predicting the Success of Sequencing Phylogenetically Diverse Plant Transcriptomes. PLoS ONE. 7 (11), (2012).
Zhao, S., Zhang, Y., Gamini, R., Zhang, B., Von Schack, D. Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: PolyA+ selection versus rRNA depletion. Scientific Reports. 8 (1), 1-12 (2018).
Zhao, S., et al. Comparison of stranded and non-stranded RNA-seq transcriptome profiling and investigation of gene overlap. BMC Genomics. 16 (1), 1-14 (2015).
Corley, S. M., MacKenzie, K. L., Beverdam, A., Roddam, L. F., Wilkins, M. R. Differentially expressed genes from RNA-Seq and functional enrichment results are affected by the choice of single-end versus paired-end reads and stranded versus non-stranded protocols. BMC Genomics. 18 (1), 1-13 (2017).
Haas, B. J., et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature Protocols. 8 (8), 1494-1512 (2013).
Pertea, M., et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nature biotechnology. 33 (3), 290-295 (2015).
Bray, N. L., Pimentel, H., Melsted, P., Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology. 34 (5), 525-527 (2016).
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A., Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods. 14 (4), 417-419 (2017).
Araujo, F. A., Barh, D., Silva, A., Guimarães, L., Thiago, R. . OPEN GO FEAT a rapid web-based functional annotation tool for genomic and transcriptomic data. , 8-11 (2018).
Huerta-Cepas, J., et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Molecular Biology and Evolution. 34 (8), 2115-2122 (2017).
Huerta-Cepas, J., et al. EggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research. 47, 309-314 (2019).
Törönen, P., Medlar, A., Holm, L. PANNZER2: A rapid functional annotation web server. Nucleic Acids Research. 46, 84-88 (2018).
Robinson, M., Mccarthy, D., Chen, Y., Smyth, G. K. . edgeR differential expression analysis of digital gene expression data User's Guide. , (2013).
Huang, D. W., Sherman, B. T., Lempicki, R. A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols. 4 (1), 44-57 (2009).
Huang, D. W., Sherman, B. T., Lempicki, R. A. Bioinformatics enrichment tools: Paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Research. 37 (1), 1-13 (2009).
Letunic, I., Bork, P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic acids research. 44, 242-245 (2016).

Reprints and Permissions

Request permission to reuse the text or figures of this JoVE article

Request Permission

Explore More Articles

Bioinformatics Pipeline Molecular Evolution Gene Expression RNA seq Shell Scripts SRA Toolkit FASTQ Files Reference Genome GTF File Protein FASTA File Gene Annotation BLAST Database Homologous Genes NCBI GenBank Reciprocal BLAST MEGA Software

This article has been published

Video Coming Soon

Keep me updated: