easyGEO can easily extract, process, re-analyze and visualize most recent gene expression data from NCBI GEO database. The resulted DE table can be seamlessly imported to easyGSEA for functional enrichment analysis or easyVizR for multiple comparisons.
Demo sessioneasyGEO extracts data of an NCBI GEO series, specified by its GEO accession number, beginning with “GSE”. Alternatively, easyGEO provides an option to analyse user-supplied datasets, either one’s own or those extracted from other sources.
Input the unique GSE number for the study you are interested in and click Search. Currently we support analysis on single channel microarray and most RNA-seq studies.
The example dataset was from the first published study on SARS-CoV-2-infected transcriptomes (Blanco-Melo et al., 2020).
After initial retrieval, data are stratified by the microarray or sequencing platforms used by the study, such as GPL18573 platform of Illumina NextSeq 500 (Homo sapiens) and GPL28369 platform of Illumina NextSeq 500 (Mustela putorius furo).
Hover over each choice for more information about the platform, e.g. platform ID, organism and experimental strategy. Select the one that you are interested in and click Select to proceed.
You may read more information about the data series, such as the study title, the experimental strategy, the author names and their contact information, on the right panel.
Proceed to 2. Data matrix.
The data matrix should be comma- or tab-delimited (csv, tsv, tab, txt). The first row of the matrix should be sample names; must match the sample names in the design matrix (see below "1.2. Upload a matrix describing experimental designs (design matrix)"). The first column of the matrix should be gene names; no duplicates are allowed.
For example,
Series1_NHBE_MOCK_1 | Series1_NHBE_MOCK_2 | Series1_NHBE_MOCK_3 | |
---|---|---|---|
DDX11L1 | 0 | 0 | 0 |
WASH7P | 29 | 24 | 23 |
FAM138A | 0 | 0 | 0 |
OR4F5 | 0 | 0 | 0 |
aagr-4 | 0.78 | 6.31 | 7.23 |
The design matrix should be comma- or tab-delimited (csv, tsv, tab, txt). The first row of the matrix should be sample attributes (e.g. strain names, experimental conditions, patient groups); no duplicates are allowed. The first column of the matrix should be sample names; must match the sample names in the data matrix.
For example,
cell.line | cell.type | strain | subject.status | time.after.treatment | |
---|---|---|---|---|---|
Series1_NHBE_Mock_1 | NHBE | primary human bronchial epithelial cells | N/A | N/A | N/A |
Series1_NHBE_Mock_2 | NHBE | primary human bronchial epithelial cells | N/A | N/A | N/A |
Series1_NHBE_Mock_3 | NHBE | primary human bronchial epithelial cells | N/A | N/A | N/A |
Series1_NHBE_SARS-CoV-2_1 | NHBE | primary human bronchial epithelial cells | USA-WA1/2020 | N/A | N/A |
Series1_NHBE_SARS-CoV-2_ | NHBE | primary human bronchial epithelial cells | USA-WA1/2020 | N/A | N/A |
Series1_NHBE_SARS-CoV-2_3 | NHBE | rimary human bronchial epithelial cells | USA-WA1/2020 | N/A | N/A |
Proceed to 2.Data matrix
This step allows you to review and/or load the gene expression data provided by the authors.
Once you have downloaded a supplementary file, please decompress it and check for the following:
Preloaded or uploaded data matrix are displayed on the right of the screen.
If it’s an auto-retrieval session, Show column names as options are provided to toggle between “GEO accession” (unique ID for each sample as stored in GEO database, which begins with “GSM”, e.g. GSM4432378) and “Sample name” (more descriptive name provided by the series’ authors, e.g. Series1_NHBE_Mock_1).
Filtering options provide a way to review the full data matrix or matrix filtered by selected samples in “3. Filter/review design matrix”. Number in parentheses denotes number of samples.
Click the top right button to download the data matrix for record, if needed.
This step allows you to review the design matrix submitted by the authors and filter the samples of interest.
Filter samples by selecting experimental factors of interest.
Selected samples are tallied and displayed in the dropdown on top right of the screen.
Samples are displayed as rows. Experimental variables are displayed as columns. Review carefully so as to make sure correct samples are selected for differential expression analysis later. If it’s an auto-retrieval session, Show row names as options are provided to toggle between “GEO accession” (unique ID for each sample as stored in GEO database, which begins with “GSM”, e.g. GSM4432378) and “Sample name” (more descriptive name provided by the series’ authors, e.g. Series1_NHBE_Mock_1).
Click the top right button to download the design matrix for record, if needed.
Summary about the experimental designs for the retrieved data series is also provided. Number in parentheses denotes number of samples included for a particular experimental variable
Now you are ready to perform differential expression (DE) analysis and able to download and visualize the results!
If the data matrix is complete, you will see a green "Data matrix ok!" message. Then, select the type of count data provided by the authors:
Please select the samples in the control group and those in the experimental group to perform DE analysis. There are two ways to select samples:
Manually assign samples into the control and the experimental groups by clicking the provided selection buttons.
Once data matrix and contrast selection are ready, an orange run button is provided on the top right corner of the screen. Click to perform DE analysis.
Once the analysis is complete, you may review the resulted DE table. Download the DE table, and proceed directly to easyGSEA for functional profiling, and/or easyVizR for multiple comparisons.
Explore different plots and diagrams to have a better understanding of the dataset. You can also easily search for and visualize the expression level changes of genes of interest.
In volcano plots, log2 transformed fold change (logFC) is plotted along the x-axis. -log10 transformed adjusted P-value (padj) is plotted along the y-axis. Horizontal and vertical dashed lines are drawn to indicate the user-defined thresholds of padj and |logFC|, respectively.
You can adjust the parameters on the right panel to customize the plot and visualize your genes of interest.
First, adjust thresholds of adj.P.Val (default < 0.05) and |logFC| (default >= 1), if needed.
Second, in the Static mode, there are three ways to visualize your genes of interest:
In the Interactive mode, genes that meet the user-defined padj and |logFC| thresholds are highlighted in red. Hover labels show details about each gene’s name, logFC, and padj.
easyGEO also provides an interactive heatmap where you can visualize and explore the expression level changes of genes in each biological repeat.
Genes along the y-axis are colored in each sample based on their expression values. Hover labels over each data point show detailed statistics. As with the volcano plots, genes can be highlighted in three modes, and hover labels over each data point show detailed statistics.
If “Log2 transformation” is selected, each expression value (counts per million (CPM) if raw counts; original values in the gene expression matrix otherwise) is added by 1 and log2 transformed; if “Z-score transformation” is selected, z scores are computed per gene according to its expression values across samples.
Search for a gene and visualize its expression level changes with violin and box plots.
Distributions of expression values for a selected gene of interest are plotted along the y-axis. In violin plots, standard deviations (s.d.) are indicated with grey perpendicular lines.
If needed, log2 transform (default, yes) the expression values, and/or adjust the # of s.d. to display (if violin plot) in the right panel.
Feel free to reach us at evitta@cmmt.ubc.ca if you have any questions.