Method Article
The purpose of this protocol is to investigate the evolution and expression of candidate genes using RNA sequencing data.
Distilling and reporting large datasets, such as whole genome or transcriptome data, is often a daunting task. One way to break down results is to focus on one or more gene families that are significant to the organism and study. In this protocol, we outline bioinformatic steps to generate a phylogeny and to quantify the expression of genes of interest. Phylogenetic trees can give insight into how genes are evolving within and between species as well as reveal orthology. These results can be enhanced using RNA-seq data to compare the expression of these genes in different individuals or tissues. Studies of molecular evolution and expression can reveal modes of evolution and conservation of gene function between species. The characterization of a gene family can serve as a springboard for future studies and can highlight an important gene family in a new genome or transcriptome paper.
Advances in sequencing technologies have facilitated the sequencing of genomes and transcriptomes of non-model organisms. In addition to the increased feasibility of sequencing DNA and RNA from many organisms, an abundance of data is publicly available to study genes of interest. The purpose of this protocol is to provide bioinformatic steps to investigate the molecular evolution and expression of genes that may play an important role in the organism of interest.
Investigating the evolution of a gene or gene family can provide insight into the evolution of biological systems. Members of a gene family are typically determined by identifying conserved motifs or homologous gene sequences. Gene family evolution was previously investigated using genomes from distantly related model organisms1. A limitation to this approach is that it is not clear how these gene families evolve in closely related species and the role of different environmental selective pressures. In this protocol, we include a search for homologs in closely related species. By generating a phylogeny at a phylum level, we can note trends in gene family evolution such as that of conserved genes or lineage-specific duplications. At this level, we can also investigate whether genes are orthologs or paralogs. While many homologs likely function similarly to each other, that is not necessarily the case2. Incorporating phylogenetic trees in these studies is important to resolve whether these homologous genes are orthologs or not. In eukaryotes, many orthologs retain similar functions within the cell as evidenced by the ability of mammalian proteins to restore the function of yeast orthologs3. However, there are instances where a non-orthologous gene carries out a characterized function4.
Phylogenetic trees begin to delineate relationships between genes and species, yet function cannot be assigned solely based on genetic relationships. Gene expression studies combined with functional annotations and enrichment analysis provide strong support for gene function. Cases where gene expression can be quantified and compared across individuals or tissue types can be more telling of potential function. The following protocol follows methods used in investigating opsin genes in Hydra vulgaris7, but they can be applied to any species and any gene family. The results of such studies provide a foundation for further investigation into gene function and gene networks in non-model organisms. As an example, the investigation of the phylogeny of opsins, which are proteins that initiate the phototransduction cascade, gives context to the evolution of eyes and light detection8,9,10,11. In this case, non-model organisms especially basal animal species such as cnidarians or ctenophores can elucidate conservation or changes in the phototransduction cascade and vision across clades12,13,14. Similarly, determining the phylogeny, expression, and networks of other gene families will inform us about the molecular mechanisms underlying adaptations.
This protocol follows UC Irvine animal care guidelines.
1. RNA-seq library preparation
2. Access a computer cluster
NOTE: RNA-seq analysis requires manipulation of large files and is best done on a computer cluster (Table of Materials).
3. Obtain RNA-seq reads
4. Trim adapters and low-quality reads (optional)
5. Obtain reference assembly
6. Generate a de novo assembly (Alternative to Step 5)
7. Map reads to the genome (7.1) or de novo transcriptome (7.2)
8. Identify genes of interest
NOTE: The following steps can be done with nucleotide or protein FASTA files but work best and are more straightforward with protein sequences. BLAST searches using protein to protein is more likely to give results when searching between different species.
9. Phylogenetic trees
10. Visualize gene expression using TPM
The methods above are summarized in Figure 1 and were applied to a data set of Hydra vulgaris tissues. H. vulgaris is a fresh-water invertebrate that belongs to the phylum Cnidaria which also includes corals, jellyfish, and sea anemones. H. vulgaris can reproduce asexually by budding and they can regenerate their head and foot when bisected. In this study, we aimed to investigate the evolution and expression of opsin genes in Hydra7. While Hydra lack eyes, they exhibit light-dependent behavior32. Opsin genes encode proteins that are important in vision to detect different wavelengths of light and begin the phototransduction cascade. Investigating the molecular evolution and expression of this gene family in a basal species can provide insight into the evolution of eyes and light detection in animals.
We generated a guided assembly using the Hydra 2.033 reference genome and publicly available RNA-seq data (GEO accession GSE127279) Figure 1. This step took approximately 3 days. Although we did not generate a de novo transcriptome in this case, a Trinity assembly can take up to 1 week to generate and each library can take a few hours for read mapping depending on the mapper. The merged Hydra assembly (~50,000 transcripts) was annotated using Blast2GO which took about 1-week Figure 1. Sequences for opsin-related genes were extracted into a fasta file. Sequences for opsin genes from other species were also extracted from NCBI GenBank. We used opsins from cnidarians Podocoryna carnea, Cladonema radiatum, Tripedelia cystophora, and Nematostella vectensis, and we also included outgroups Mnemiopsis leidyi,Trichoplax adhaerens, Drosophila melanogaster and Homo sapiens. Opsin genes were aligned in MEGA7 Figure 2. By viewing the alignment, we were able to identify Hydra opsins that were missing a conserved lysine amino acid necessary to bind a light sensitive molecule. After visual inspection, we determined the best model by doing a model selection analysis. We generated a maximum-likelihood tree using the model LG + G + F with bootstrap value of 100 Figure 3. For 149 opsin genes, the tree was finished in approximately 3 days. The phylogeny suggests opsin genes are evolving by lineage-specific duplications in cnidarians and potentially by tandem duplication in H. vulgaris7.
We performed a differential expression analysis in edgeR and looked at absolute expression of opsin genes. We hypothesized that one or more opsins would be upregulated in the head (hypostome) and performed pair-wise comparisons of hypostome versus the body column, budding zone, foot and tentacles. As an example of a pair-wise comparison, 1,774 transcripts were differentially expressed between the hypostome and body column. We determined the genes that were upregulated across multiple comparisons and did a functional enrichment in Blast2GO Table 1. Grouping of G-protein coupled receptor activity included opsin genes. Finally, we looked at the absolute expression of opsin genes in different tissues, during budding and during regeneration by plotting their TPM values using ggplot Figure 4. Using the methods outlined here, we identified 2 opsin genes that did not group with the other opsins in the phylogeny, found one opsin that was expressed almost 200 times more than others, and we found a few opsin genes co-expressed with phototransduction genes that may be used for light detection.
Figure 1: Workflow schematic. Programs used to analyze data on the computer cluster are in blue, in magenta are those that we used on a local computer and in orange is a web-based program. (1) Trim RNA-seq reads using trimmomatic v. 0.35. If a genome is available but gene models are missing, generate a guided assembly using STAR v. 2.6.0c and StringTie v. 1.3.4d. (Optional see Supplemental Materials) (2) Without a reference genome, use trimmed reads to make a de novo assembly using Trinity v 2.8.5. (3) To quantify gene expression using a reference genome, map reads using STAR and quantify using RSEM v. 1.3.1. Extract TPMs using RSEM and visualize them in RStudio. (4) Bowtie and RSEM can be used to map and quantify reads mapped to a trinity transcriptome. A Trinity script can be used to generate a TPM matrix to visualize counts in RStudio. (5) Use web-based NCBI BLAST and command-line BLAST+ to search for homologous sequences and confirm using reciprocal BLAST. Annotate genes further using Blast2GO. Use MEGA to align genes and generate a phylogenetic tree using the best fit model. Please click here to view a larger version of this figure.
Figure 2: Example of aligned genes. Snapshot shows a portion of Hydra opsin genes aligned using MUSCLE. The arrow indicates the location of a retinal-binding conserved lysine. Please click here to view a larger version of this figure.
Figure 3: Cnidarian opsin phylogenetic tree. Maximum-likelihood tree generated in MEGA7 using opsin sequences from Hydra vulgaris, Podocoryna carnea, Cladonema radiatum, Tripedelia cystophora, Nematostella vectensis, Mnemiopsis leidyi, Trichoplax adhaerens, Drosophila melanogaster and Homo sapiens. Please click here to view a larger version of this figure.
Figure 4: Expression of Opsin genes in Hydra vulgaris. (A) Expression in transcripts per million (TPM) of Hydra vulgaris opsin genes in the body column, budding zone, foot, hypostome and tentacles. (B) Expression of opsin genes during different stages of Hydra budding. (C) Expression of opsin genes of the Hydra hypostome during different time points of regeneration. Please click here to view a larger version of this figure.
GO ID | GO Name | GO Category | FDR |
GO:0004930 | G-protein coupled receptor activity | MOLECULAR FUNCTION | 0.0000000000704 |
GO:0007186 | G-protein coupled receptor signaling pathway | BIOLOGICAL PROCESS | 0.00000000103 |
GO:0016055 | Wnt signaling pathway | BIOLOGICAL PROCESS | 0.0000358 |
GO:0051260 | protein homooligomerization | BIOLOGICAL PROCESS | 0.000376 |
GO:0004222 | metalloendopeptidase activity | MOLECULAR FUNCTION | 0.000467 |
GO:0008076 | voltage-gated potassium channel complex | CELLULAR COMPONENT | 0.000642 |
GO:0005249 | voltage-gated potassium channel activity | MOLECULAR FUNCTION | 0.00213495 |
GO:0007275 | multicellular organism development | BIOLOGICAL PROCESS | 0.00565048 |
GO:0006813 | potassium ion transport | BIOLOGICAL PROCESS | 0.01228182 |
GO:0018108 | peptidyl-tyrosine phosphorylation | BIOLOGICAL PROCESS | 0.02679662 |
Table 1: Functional enrichment of genes upregulated in the hypostome
Supplemental Materials. Please click here to download these materials.
The purpose of this protocol is to provide an outline of the steps for characterizing a gene family using RNA-seq data. These methods have been proven to work for a variety of species and datasets4,34,35. The pipeline established here has been simplified and should be easy enough to be followed by a novice in bioinformatics. The significance of the protocol is that it outlines all the steps and necessary programs to complete a publishable analysis. A crucial step in the protocol is having properly assembled full length transcripts, this comes from high quality genomes or transcriptomes. To obtain proper transcripts, one needs high quality RNA and/or DNA and good annotations discussed below.
For RNA-seq library preparation, we include list kits that worked for small body parts of Hydra19 and butterflies18 (Table of Materials). We note that for low input RNA we used a modified protocol approach36. Methods for RNA extraction have been compared in multiple sample types including yeast cells17, neuroblastoma37, plants38, and insect larvae16 to name a few. We recommend the reader acquire a protocol that works for their species of interest, if any exist, or troubleshoot using commonly commercially available kits to start. For proper gene quantification, we recommend treating the RNA sample with DNase. The presence of DNA will affect proper gene quantification. We also recommend using a cDNA library prep kit that includes a polyA tail selection to select for mature mRNA. While rRNA depletion results in more read depth, the percentage of exon coverage is much lower than the exon coverage of RNA using polyA+ selection39. Finally, when possible it is best to use paired-end and stranded40,41. In the protocol above the read mapping commands will have to be modified when using single end reads.
As mentioned above it is important to be able to identify genes of interest and also to differentiate between recent gene duplications, alternative splicing, and haplotypes in sequencing. In some instances, having a reference genome can help by determining where genes and exons are located relative to each other. One thing to note is that if a transcriptome is obtained from a public database and is not high quality, it may be best to generate using Trinity42 and combining RNA-seq libraries from tissues of interest. Likewise, if a reference genome does not have good gene models, RNA-seq libraries can be used to generate new GTFs using StringTie43 (see Supplemental Materials). In addition, in cases where genes are incomplete and there is access to a genome, genes can be manually edited using homolog sequences then aligned to the genome using tblastn. The BLAST output can be used to determine the actual sequence, which may be different from the correction done using homologs. If there is no match, leave the sequence as was originally. When checking output pay attention to the genome coordinates to make sure the missing exon is indeed part of the gene.
Although we focus on software and programs that we used, modifications to this protocol exist due to many programs available which might work better for different datasets. As an example, we show commands for mapping reads to the transcriptome using bowtie and RSEM, but Trinity now has the option for much faster aligners such as kallisto44 and salmon45. Similarly, we describe annotations using Blast2GO (now OmicsBox) but there are other mapper tools that can be found free and online. Some that we have tried include: GO FEAT46, eggNOG-mapper47,48, and a very fast aligner PANNZER249. To use these web-based annotation tools simply upload the peptide FASTA and submit. Standalone versions of PANNZER and eggNOG-mapper are also available to be downloaded to the computer cluster. Another modification is that we used MEGA and R on a local computer and used the online NCBI BLAST tool to do reciprocal BLASTs however all of these programs can be used on the computer cluster by downloading the necessary programs and databases. Likewise, aligners kallisto and salmon can be used on a local computer as long as a user has enough RAM and storage. However, FASTQ and FASTA files tend to be very large and we highly recommend using a computer cluster for ease and speed. In addition, while we provide instructions and links to download programs from their developers many of them can be installed from bioconda: https://anaconda.org/bioconda.
A common problem faced when doing bioinformatic analyses is shell scripts failing. This can be due to a variety of reasons. If an error file is created, these error file should be checked before troubleshooting. A few common reasons for an error are typos, missing key parameters, and compatibility issues between software versions. In this protocol, we include parameters for the data, but software manuals can provide more detailed guidelines for individual parameters. In general, it is best to use the most up to date versions of software and to consult the manual corresponding to that version.
Enhancements to this protocol include doing a transcriptome-wide differential expression analysis and functional enrichment analysis. We recommend edgeR50 for differential expression analysis a package available in Bioconductor. For functional enrichment analysis, we have used Blast2GO29 and web-based DAVID51,52. We also recommend further editing the phylogeny by extracting it as a newick file and using web-based iTOL53. Furthermore, while this protocol will investigate the molecular evolution and expression patterns of genes, additional experiments can be used to validate gene or protein locations and functions. mRNA expression can be confirmed by RT-qPCR or in situ hybridization. Proteins can be localized using immunohistochemistry. Depending on the species, knockout experiments can be used to confirm gene function. This protocol can be used for a variety of objectives including, as shown above, to explore a gene family typically associated with photoreception in a basal species7. Another application of these methods is to identify changes in a conserved pathway under different selective pressures. As an example, these methods were used to discover variation in the expression of vision transient receptor potential channels between diurnal butterflies and nocturnal moths34.
The authors have nothing to disclose.
We thank Adriana Briscoe, Gil Smith, Rabi Murad and Aline G. Rangel for advice and guidance in incorporating some of these steps into our workflow. We are also grateful to Katherine Williams, Elisabeth Rebboah, and Natasha Picciani for comments on the manuscript. This work was supported in part by a George E. Hewitt Foundation for Medical research fellowship to A.M.M.
Name | Company | Catalog Number | Comments |
Bioanalyzer-DNA kit | Agilent | 5067-4626 | wet lab materials |
Bioanalyzer-RNA kit | Agilent | 5067-1513 | wet lab materials |
BLAST+ v. 2.8.1 | On computer cluster* https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ | ||
Blast2GO (on your PC) | On local computer https://www.blast2go.com/b2g-register-basic | ||
boost v. 1.57.0 | On computer cluster | ||
Bowtie v. 1.0.0 | On computer cluster https://sourceforge.net/projects/bowtie-bio/files/bowtie/1.3.0/ | ||
Computing cluster (highly recommended) | NOTE: Analyses of genomic data are best done on a high-performance computing cluster because files are very large. | ||
Cufflinks v. 2.2.1 | On computer cluster | ||
edgeR v. 3.26.8 (in R) | In Rstudio https://bioconductor.org/packages/release/bioc/html/edgeR.html | ||
gcc v. 6.4.0 | On computer cluster | ||
Java v. 11.0.2 | On computer cluster | ||
MEGA7 (on your PC) | On local computer https://www.megasoftware.net | ||
MEGAX v. 0.1 | On local computer https://www.megasoftware.net | ||
NucleoSpin RNA II kit | Macherey-Nagel | 740955.5 | wet lab materials |
perl 5.30.3 | On computer cluster | ||
python | On computer cluster | ||
Qubit 2.0 Fluorometer | ThermoFisher | Q32866 | wet lab materials |
R v.4.0.0 | On computer cluster https://cran.r-project.org/src/base/R-4/ | ||
RNAlater | ThermoFisher | AM7021 | wet lab materials |
RNeasy kit | Qiagen | 74104 | wet lab materials |
RSEM v. 1.3.0 | Computer software https://deweylab.github.io/RSEM/ | ||
RStudio v. 1.2.1335 | On local computer https://rstudio.com/products/rstudio/download/#download | ||
Samtools v. 1.3 | Computer software | ||
SRA Toolkit v. 2.8.1 | On computer cluster https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit | ||
STAR v. 2.6.0c | On computer cluster https://github.com/alexdobin/STAR | ||
StringTie v. 1.3.4d | On computer cluster https://ccb.jhu.edu/software/stringtie/ | ||
Transdecoder v. 5.5.0 | On computer cluster https://github.com/TransDecoder/TransDecoder/releases | ||
Trimmomatic v. 0.35 | On computer cluster http://www.usadellab.org/cms/?page=trimmomatic | ||
Trinity v.2.8.5 | On computer cluster https://github.com/trinityrnaseq/trinityrnaseq/releases | ||
TRIzol | ThermoFisher | 15596018 | wet lab materials |
TruSeq RNA Library Prep Kit v2 | Illumina | RS-122-2001 | wet lab materials |
TURBO DNA-free Kit | ThermoFisher | AM1907 | wet lab materials |
*Downloads and installation on the computer cluster may require root access. Contact your network administrator. |
Request permission to reuse the text or figures of this JoVE article
Request PermissionThis article has been published
Video Coming Soon
Copyright © 2025 MyJoVE Corporation. All rights reserved