Method Article
This protocol describes a useful tool for identifying significant molecular changes in cancer and leads to the development of new diagnostic and therapeutic approaches for esophageal squamous cell carcinoma.
Esophageal cancer (EC) ranks as the 8th most aggressive malignancy, and its treatment remains challenging due to the lack of biomarkers facilitating early detection. EC manifests in two major histological forms - adenocarcinoma (EAD) and squamous cell carcinoma (ESCC) - both exhibiting variations in incidence across geographically distinct populations. High-throughput technologies are transforming the understanding of diseases, including cancer. A significant challenge for the scientific community is dealing with scattered data in the literature. To address this, a simple pipeline is proposed for the analysis of publicly available microarray datasets and the collection of differentially regulated molecules between cancer and normal conditions. The pipeline can serve as a standard approach for differential gene expression analysis, identifying genes differentially expressed between cancer and normal tissues or among different cancer subtypes. The pipeline involves several steps, including Data preprocessing (involving quality control and normalization of raw gene expression data to remove technical variations between samples), Differential expression analysis (identifying genes differentially expressed between two or more groups of samples using statistical tests such as t-tests, ANOVA, or linear models), Functional analysis (using bioinformatics tools to identify enriched biological pathways and functions in differentially expressed genes), and Validation (involving validation using independent datasets or experimental methods such as qPCR or immunohistochemistry). Using this pipeline, a collection of differentially expressed molecules (DEMs) can be generated for any type of cancer, including esophageal cancer. This compendium can be utilized to identify potential biomarkers and drug targets for cancer and enhance understanding of the molecular mechanisms underlying the disease. Additionally, population-specific screening of esophageal cancer using this pipeline will help identify specific drug targets for distinct populations, leading to personalized treatments for the disease.
It is alarming that EC is the eighth most common cancer worldwide and the sixth leading cause of death worldwide. China, India, and Iran have alarmingly high incidence and mortality rates. There are two main types of EC: esophageal adenocarcinoma (EAC or EAD), and esophageal squamous cell carcinoma (ESCC)1. EAC is more common in the Western world, whereas ESCC is more common in Eastern countries, especially China and Iran2. Several risk factors are associated with EC, including tobacco and alcohol use, obesity, and gastroesophageal reflux disease (GERD). Additionally, dietary factors such as lack of fruits and vegetables and consumption of hot drinks and foods are associated with ESCC risk in high-risk areas. Early diagnosis and treatment are important for improving the outcomes of patients with EC3,4. Therefore, it is important to raise awareness of the risk factors, signs, and symptoms of EC, and to encourage regular screening of high-risk individuals. Furthermore, efforts to address modifiable risk factors, such as tobacco and alcohol use and unhealthy dietary habits, may help reduce the incidence of EC. EAD occurs in the cells of mucus-producing glands in the lower part of the esophagus, near the stomach. It is often associated with GERD, in which stomach acid and contents return into the esophagus. In contrast, ESCC arises from flat, thin cells that line the upper part of the esophagus5. It is more common in areas where tobacco and alcohol use are widespread, such as China and Iran.
Among various conditions related to the esophagus, Barrett's esophagus (BE), a condition in which the lining of the esophagus is replaced by glandular cells, is a known precursor of EAC6. It is worth noting that BE can develop without GERD, but the presence of GERD increases the risk of developing BE by 3 to 5-fold. Additionally, the presence of BE increases the risk of developing EAC by 50-100 fold7. Furthermore, hot or spicy foods and liquids have been linked to ESCC, but not to EAC. Understanding the risk factors for EC is important for it's prevention and early detection. Efforts to address modifiable risk factors, such as tobacco use, alcohol consumption, obesity, and unhealthy dietary habits, may help reduce the incidence of EC. Furthermore, routine screening and surveillance for high-risk individuals, such as those with dysphagia, or BE, may improve outcomes by enabling early detection and treatment.
It is certainly true that omics-driven studies, including genomics, transcriptomics, proteomics, methylomics, miRNAomics, and metabolomics, have contributed greatly to our understanding of ECs, especially ESCC8,9,10,11,12,13. These studies have allowed the identification of novel biomarkers, potential therapeutic targets, and new pathways involved in the development and progression of ESCC. However, the data generated from these studies is scattered throughout the literature, making it difficult for the scientific community to access and use this information. Therefore, it is important to create a repository or database that compiles data obtained from high- or low-throughput studies on specific cancers. Such a package can be streamlined and made by implementing some basic guidelines. These guidelines include selecting relevant studies, extracting and organizing data from these studies, and ensuring data quality and consistency. In addition, the compendium should be updated regularly to include new studies and data as they become available. Researchers can use a single platform to retrieve and analyze data on a specific cancer by creating a compendium or database that combines data from different studies. This will help accelerate research efforts and ultimately lead to more effective treatments and better outcomes for cancer patients.
The development of the cancer compendium incorporates data from both low-throughput and high-throughput studies. This compendium will be a valuable resource for researchers looking to identify potential diagnostic or therapeutic targets for cancer. One way to build this collection is by reviewing microarray studies available in publicly accessible repositories such as Gene Expression Omnibus (GEO). Microarray studies can provide information about gene expression levels in cancer cells, and these data can be used to identify differentially expressed genes (DEGs) that may play a role in cancer development and progression.
However, it should be noted that different studies might have used different methods to analyze their data, which may have led to the identification of different DEGs. Therefore, it is important to carefully review each study and consider any potential bias or limitations when pooling data for the compendium. Once the data is gathered at a common platform, researchers can use it to identify potential molecular targets for further study. These include examining the expression of a particular gene in clinical samples or conducting mechanistic studies to understand how a particular gene or protein is involved in cancer development and progression. Overall, the creation of a cancer data set will be a valuable resource for cancer researchers and help identify new targets for diagnosis and therapeutic interventions.
1. Manual curation of the differentially regulated molecules in ESCC
2. Finding relevant studies using PubMed
3. Finding relevant studies using gene expression omnibus (GEO)
NOTE: Gene expression omnibus (GEO) is a freely available repository for storing data on DNA microarrays. The plethora of data available in GEO is a good resource for data mining to identify differentially regulated molecules between cancer/diseases versus normal conditions.
4. Microarray analysis using GEO2R
NOTE: The first thing is to find relevant studies using Boolean operators (AND, OR, NOT). These will be used in combination with the keywords 'esophageal squamous cell carcinoma', 'ESCC', or 'oesophageal squamous cell carcinoma'. GEO2R (see Table of Materials) is a freely available R-language package that is integrated with GEO, enabling users to analyze data from microarray studies in a user-friendly manner. It interacts with GEO entry IDs and provides an interface for performing complex R-based analysis to identify DEGs using Bioconductor R packages for the back end. This package not only transforms the GEO data but also presents its output in form of .txt tables, which can be further modified according to the users' needs16. The GEO2R package presents genes in an order of statistical significance based on p-value, but the order can be sorted based on log2-fold change. Additionally, users can view gene expression profiles as GEO profile images. Unlike other analysis tools, GEO2R is independent of selected dataset records and can interrogate actual data submitted by the investigators directly. More than 90% of GEO studies can be analyzed using this method17. The workflow of GEO2R with steps involved in analysis of microarray data using GEO2R is shown in Figure 1.
5. Finding alias for a gene/protein
6. Finding official gene symbol for the DEGs
7. Finding gene locus of the DEGs
8. Finding information about DEGs on OMIM Pagegene locus of the DEGs
9. Finding protein localization, domain, and motif, and secretory nature of the protein encoded by the gene
10. Cherry picking for the protein for validation, and further assessment for diagnosis or prognosis of the malignancy of interest
NOTE: Once unique molecules are identified, the biggest challenge is how to validate them. Usually, microarray study provides expression at the mRNA levels, but for disease diagnosis or prognosis, readout of protein levels is crucial. For the same, patients' or patients derived samples or cell lines of same cancer must be screened to know if the molecule is actually expressed there and if it is able to discriminate between cancer vs. normal, or good vs. bad prognosis, or differentiate between early to late stages of the diseases. To validate the candidate molecule, Western blot, enzyme-linked immunosorbent assays i.e., ELISA, immunoprecipitation, immunohistochemistry, immunocytochemistry, or assay are useful techniques18,19,20. At the same time, all these assays require, antibodies to detect the antigen present in the samples. Antibody is costly items, so it's always better to select antibodies based on the following points:
As an example, GEO accession GSE161533 was used to study differentially explored genes in ESCC. The representative results of the analysis have been shown in the Figure 3. GEO2R generates a volcano plot that is useful for identifying events that differ significantly between two groups of experimental subjects. Volcano plot presents overall gene distribution with -log10 transformed significance (p-value) on the y-axis, and fold changes (with log2 transformed fold change) on the x-axis (Figure 3A), and it is useful for visualizing the genes which are differentially expressed. Highlighted genes are significantly differentially expressed at a default adj. p-value cut-off of 0.05 (blue = downregulated, red = upregulated).
A mean difference (MD) plot displays log2 fold change vs. average log2 expression values and is useful for visualizing the genes which are differentially expressed. In MD-plot, the genes where log2 transformed fold changes on the y-axis, and logs average value expression on the x-axis (Figure 3B). The highlighted genes are significantly differentially expressed at a default adj. p-value cutoff of 0.05 (blue = downregulated, red = upregulated). Volcano plots encounter the same issues as MA plots in terms of displaying information from only two treatments at once21.
Further, Uniform Manifold Approximation and Projection (UMAP)22 was used to assess the relatedness between ESCC and normal samples (Figure 3C). Though most of the samples were in the respective categories, two ESCC samples were found in the normal samples.
GEO2R presented a 2D interactive expression density plot (Figure 3D), which effectively demonstrated the density of expression in the dataset. This plot is useful for determining whether normalization is necessary for DEGs. In this plot, the y-axis denotes density, while the x-axis denotes intensity for both ESCC (green color) and normal (violet color).
The distribution of the values across different samples including ESCC and normal has been shown in the box plot. These distributions give hint if the samples are actually suitable for differential expression analysis. The median-centric values are clearly indicating that the data are normalized and cross-comparable (Figure 3E).
The identified genes are filtered based on p < 0.05 and fold-change criteria. The unchanged genes (with fold-change between <2.0->0.50) removed from the analysis. Further, when compared with a previously published study, the common genes found are only 514, but the unique number of genes obtained is 1193. It is important to note as identifying unique genes using GEO2R can help not only decrease the redundancy but also enrich the compendium.
A partial list of DEGs has been mentioned in the Table 1, while a complete list of DEGs is provided in Supplementary File 1. Some of the upregulated genes belong to the extracellular matrix, such as MMP18,23,24, MMP1223,25, SPP18,26, POSTN9, and VCAN8,27. Among other genes that are listed in Table 1 include CMPK2, AURKA28,29, CHEK127, and CDK130 are upregulated and EMP127, PTK631,32, GPX327, DPT33, FHL134,35, and CRNN8,36 are downregulated in ESCC as compared with normal epithelia. POSTN (Periostin) has been upregulated in ESCC and also has been reported in the case of esophageal adenocarcinoma. A previous study on ESCC reported that POSTN protein expression was not only observed in the stromal region but also in the tumor cells suggesting that there is an interaction of between tumor-microenvironment9. Periostin is a protein that is primarily secreted by mesenchymal cells and plays a crucial role in the regulation, adhesion, and differentiation of osteoblasts, as well as in wound repair. In addition, periostin has been implicated in tumor progression and metastasis in various cancers, including ESCC. Studies have shown that periostin is involved in epithelial-to-mesenchymal transition (EMT) in cancers and tumor angiogenesis, promoting cell migration, motility, adhesion, and metastatic cell growth of tumors. In Barrett's esophagus, a precancerous condition of the esophagus, there is a significant upregulation of POSTN, the gene that codes for periostin, compared to normal esophageal tissue37. In eosinophilic esophagitis, an inflammatory disease of the esophagus, both periostin mRNA and protein expression levels are upregulated compared to normal esophageal epithelium. Similarly, in ESCC, POSTN was found to be 11-fold upregulated in gene expression analysis9. These findings suggest that POSTN may serve as a potential biomarker for ESCC and other cancers. Furthermore, the increased levels of serum POSTN are reported in breast cancer patients diagonalized with bone metastases, reflecting that POSTN could also be further investigated as a potential metastatic biomarker in the sera of ESCC patients. Overall, POSTN appear to play important roles in tumor progression and may have potential clinical implications for cancer diagnosis, prognosis, and treatment.
The chromosomal distribution of DEGs on individual chromosomes shows that the maximum numbers of genes were from chromosomes 1-6, and X (Figure 4). The ShinyGO-based pathway analysis showed that a number of crucial pathways pop up when DEGs analysis. Some of these were the IL-17 signaling pathway, protein digestion and absorption, ECM-receptor interaction, TNF-signaling pathway, Toll-like receptor signaling pathway, chemokine signaling pathway, cytokine-cytokine receptor interaction, alcohol liver disease, microRNAs in cancer, transcriptional dysregulation in cancer, cell cycle, and NONO-like receptor signaling pathway in ESCC. Further, enrichment of GO-terms in DEGs was done by using g: Profiler analysis. Different GO terms for molecular function (GO: MF), cellular components (GO: CC), and biological processes (GO: BP) were enriched (Figure 5). The list of these GO-terms has been provided in Table 2.
Figure 1: Schematic representation for processing of the studies on esophageal squamous cell carcinoma available in gene expression omnibus using GEO2R program. Different steps involved in the identification of differentially Regulated Genes (DEGs) or differentially regulated molecules (DEMs) have been shown in the schema including selection criteria for the DEGs based on the fold change >2.0-fold and p < 0.05 for upregulated, and <0.5 and p-value <0.05 for downregulated. Please click here to view a larger version of this figure.
Figure 2: Schematic representation for finding additional information on differentially regulated genes in esophageal squamous cell carcinoma available using other publicly available resources. Additionally, information DEGs is crucial in deciding on DEGs need to be selected for further validation and assessment in the clinical setting. Information such as extraction of alias, official gene symbol, chromosome location/gene locus, OMIM, domain/motif, secretory nature of the protein and availability of suitable antibody for validation at protein levels can be obtained from different online resources. Please click here to view a larger version of this figure.
Figure 3: Distribution of the study with GEO accession GSE161533 using GEO2R program for identification of DEGs between ESCC vs. normal. The GEO2R program was used with default parameters that give rise to (A) Volcano plot representing the gene distribution with -log10 transformed significance (p-value) on the y-axis, and fold changes (with log2 transformed fold change) on the x-axis, (B) MD-plot plot displaying log2 fold change vs. average log2 expression values for visualizing differentially expressed genes, (C) UMAP (Uniform Manifold Approximation and Projection) shows the segregation of samples based on their types, (D) Expression density plot complements as it checks the normalization of data before differential expression analysis, (E) Box plot showing median-centered values across the samples to indicate that the normalization of the data is cross comparable. Please click here to view a larger version of this figure.
Figure 4: Distribution of DEGs on different chromosomal loci using the ShinyGO enrichment tool. (A) Unique genes were identified by generating a Venn diagram to compare current vs. previously published studies. (B) The distribution of DEGs on the different chromosomes in the genome. (C) Pathway enrichment for DEGs using ShinyGO based enrichment analysis. Please click here to view a larger version of this figure.
Figure 5: Manhattan plots to illustrate GO term enrichments of target genes using g: Profiler. The differentially expressed genes were analyzed by g: Profiler and the enrichment in GO terms (MF: molecular function; BP: biological process; CC: cellular component) and KEGG pathways across Reactome pathways (REAC), WiKi-Pathways (WP), transcription factor (TF), and microRNA target base (MIRNA) were graphically depicted in Manhattan plot where the x-axis is the GO functional terms colored by category. Each colored dot represents a GO term. The y-axis shows the adjusted -log10p-values. The GO terms that are statistically significant for ESCC are shown on the x-axis. MF: Molecular Function; BP: Biological process; CC: Cellular component; MIRNA: MicroRNA; HP: Human Phenotype. Please click here to view a larger version of this figure.
Table 1: Partial list of differentially expressed genes in ESCC. Please click here to download this Table.
Table 2: Enrichment of GO-terms in ESCC using g: Profiler. Please click here to download this Table.
Supplementary File 1: Complete List of differentially expressed genes in ESCC. Please click here to download this File.
Since the involvement of high-throughput OMICS techniques in cancer biology, the rate of generation of data has been significantly increased. This poses a challenge for researchers especially those without a computer-savvy nature. To overcome over the years bioinformaticians come up with the idea of developing a database to provide data in an organized manner. This generated a positive response from researchers, especially those who are not interested in technology. Furthermore, scattered OMICS data here and there in the literature is of no use to anybody. Therefore, to make proper use of that there had always been a need for a common platform where researchers with specialized interests can go and access the data. There is a number of database on different cancer including ONCOMINE38, ESCC ATLAS39, pancreatic cancer database (PCD)40, and DDEC41.
The concept of differentially expressed genes (DEGs) arises from the analysis of RNA sequencing data, where genes that have significant changes in expression levels across two or more conditions (such as cancer vs. normal, or treatment vs. control) are identified. Several tools have been developed to determine DEGs, which perform statistical tests based on quantifications of the genes expressed evaluated from the computational analyses of either raw RNA-seq reads or intensity ratios generated between the probe and the target sequence in the cancer vs. normal group. These tools provide information related to the expression level and pairwise magnitude of difference for each gene. Differential gene expression (DGE) analyses are useful for understanding the genetic mechanisms that contribute to phenotypic differences in organisms. DGE analyses have been applied to study a variety of biological processes, including the tumor origin detection, and/or microbiome analysis. By identifying DEGs, this analyses can provide insight into the underlying genetic factors that contribute to these biological processes involved in ESCC tumorigenesis21.
The GEO2R tool method, which is publicly available and is the most preferred method because most of the studies available in the literature have been analyzed using different algorithms, which led to huge differences in data analysis; therefore, to avoid these differences, this user-friendly platform was used because it's free and easy to use. This allows comparisons between conditions such as 'Cancer vs. Normal' or 'Treatment vs. No Treatment'.
In this case, ESCC was chosen because it is an emerging cancer of the gastrointestinal (GI) tract in India, and China. We choose GEO accession GSE161533 to analyze using GEO2R to identify DEGs between ESCC vs. normal. The study was chosen because it did not include ESCC patients who had previously received chemotherapy or radiotherapy treatment. It is preferred to use paired samples if available (ESCC and adjacent normal from the same patient) for any analysis. This is because the genome of ESCC and normal tissues from the same patient is expected to be very similar since they come from the same genetic background and because the tissues are in the same environment. Using paired samples helps to avoid bias in the analysis that might be introduced if you were to compare ESCC and normal tissues from different patients with different genetic backgrounds. Using paired samples allows for a more accurate identification of the differences in gene expression between ESCC and normal tissues within the same patient, which can help improve the specificity of the results. This approach is frequently utilized in gene expression studies to control individual variability and enhance the analytical power.
We took all the sample data from the subjects involved in the study and used the GEO2R platform to analyze the gene expression data. First, we assigned cancer samples, followed by normal samples. After assigning these samples, The default parameters available in the GEO2R database were used to identify cancer or treatment samples and the normal or control samples. To differentiate between cancer and normal samples, an adjusted p-value (adj. P Val) threshold of less than 0.05 and a fold-change threshold of >2.0 was set for upregulated genes, and an adjusted p-value (adj. P Val) threshold of less than 0.05 less and a fold-change threshold of <0.5 for downregulated genes. These thresholds have been commonly used in gene expression studies to identify differentially expressed genes between cancer vs normal. It is important to note that the choice of thresholds for significance can affect the number and identity of genes identified as differentially expressed. Additionally, it is important to carefully evaluate the biological relevance of the identified genes and to perform further validation studies to confirm the results.
In the literature, there has been a trend of reporting only the genes with at least 2-fold change for upregulated, and <0.5-fold-change for downregulated genes especially in microarray and proteomics studies42. In earlier studies, a fold-change of >1.5-fold was considered as upregulated and <0.67-fold change for downregulated genes43,44, but literature trends in the last decade clearly show that higher fold-change is preferred largely because when validation experiments are performed on candidates with low fold-value those are either weak or no correlation found between mRNA and protein levels data45. There is a dark side to choosing higher fold change is that sometimes you miss some molecules that are biologically relevant in the disease or cancer, but just omitted due to the cutoff preferred to make the list of DEGs/DEMs. Furthermore, the literature is biased toward reporting that DEGs especially prefer upregulated or overexpressed molecules rather than underexpressed ones. Furthermore, if the expression of molecules conforms to the same patterns of upregulation or overexpression in multiple studies, regardless of whether they are for the same cancer or disease, it is a favored approach among scientists. Additionally, furthermore if the same pattern of overexpression is observed in multiple diseases and reported in literature, it is again widely accepted in the scientific community.
Moreover, the similarity of diseases is contingent on whether microarray data or literature is employed for the comparison. Lastly, loosely defined descriptions of differential expression magnitudes in the literature exhibit only a limited correlation with microarray fold-change data46.
Further, a compendium can provide additional information from databases such as NCBI Entrez gene47, HGNC48, OMIM49, HPRD50,51, Ensemble52, KEGG53, WikiPathways54, GO55, miRBase56, and DGV57. While using GEO2R, an assessment of UMAP shows how samples are related. In the current analysis, two ESCC samples populate with normal samples suggesting that either there is a sampling error or the ESCC samples are heterogeneous enough to show up in the group of normal samples.
GEO2R tool is user friendly and easily accessible, but it has some limitations. GEO2R lacks the ability to generate PCA plots and heat maps or filter samples after quality control. It only provides a single Venn diagram for sample comparisons within the same series. GEO2R is limited to Series Matrix files, preventing cross-series comparisons. Additionally, GEO2R only analyzes microarray data and does not have quality controls for sample normality or cross-comparability. GEO2R does not allow an unlimited number of search results and only displays the top 250 genes for any given pairwise comparison within a dataset. It also analyzes datasets with insufficient sample replicates for a robust statistical analysis. GEO2R provide data in log fold change, which required it to convert into fold-change either using r or in an excel sheet. Also, to represent up- and downregulated genes one has to use other another software or online tool to make heatmap58,59,60.
In summary, a simple pipeline is provided in this article, which can be used for making a compendium for any kind of malignancy with minor modifications. Compendium are need of the hour to support biomedical scientists especially for biomarker discovery by providing the candidate molecules for validation in the clinical setting for their usage either for prognosis or diagnosis.
The authors have nothing to disclose.
MKK is recipient of the TARE fellowship (Grant # TAR/2018/001054) extramural grant (Grant # 5/13/55/2020/NCD-III) from the Science and Engineering Research Board (SERB), Department of Science and Technology, and the Indian Council of Medical Research (ICMR), Government of India, New Delhi, respectively.
Name | Company | Catalog Number | Comments |
NCBI-PUBMED | NCBI | https://ncbi.nlm.nih.gov/pubmed | Referring to section 1. required for searching the literature |
A laptop/macbook or personal computer with internet facility and a web browser. | |||
g:Profiler | ELIXIR infrastructure | https://biit.cs.ut.ee/gprofiler/gost | Referring to section 4.10. required for enrichment of GO:MF, GO:BP, and GO:CC |
Gene expression omnibus | NCBI | https://www.ncbi.nlm.nih.gov/geo/ | Referring to section 3.1. required for searching the microarray study database |
GEO2R | NCBI | https://www.ncbi.nlm.nih.gov/geo/geo2r/ | Referring to section 3.2. required for analyzing the data using GEO2R tool |
https://www.google.com | Referring to section 1.1. required for searching the literature | ||
HGNC | HGNC is a committee of the Human Genome Organisation (HUGO) | https://www.genenames.org | Referring to section 6.1 required to know the official gene symbol of the DEGs |
HPRD | Institute of Bioinformatics, Bangluru | http://hprd.org | Referring to section 5.1 required for informationn about protein architecture |
OMIM | Johns Hopkins University, Baltimore | http://www.omim.org/entry | Referring to section 8.1 required to know the OMIM ID of a particular gene / DEG |
Pangloss Program | Developed by Chris Seidel | http://www.pangloss.com/seidel/Protocols/venn.cgi | Referring to section 4.9. required for generating the Venn diagram |
PANTHER | Thomas lab at the University of Southern California | http://www.pantherdb.org/geneListAnalysis.do | Referring to section 4.10. required for enrichment of GO:MF, GO:BP, and GO:CC |
ShinyGO | South Dakota State University | http://bioinformatics.sdstate.edu/go | Referring to section 4.10. required for allocation of DEGs on the chromosomes |
このJoVE論文のテキスト又は図を再利用するための許可を申請します
許可を申請This article has been published
Video Coming Soon
Copyright © 2023 MyJoVE Corporation. All rights reserved