easyGSEA leverages up-to-date and species-specific biological knowledge to discovering and interpreting trends in the huge lists of genes or proteins generated by many functional genomics techniques, such as gene expression microarray, RNA-seq, ChIP-seq, GWAS results, and methylation array, etc.
Currently supported modes of analysis: pre-ranked Gene Set Enrichment Analysis (GSEA) & OverRepresentation Analysis (ORA).
Demo sessionThere are two predominantly used enrichment methods to discover the molecular mechanisms that underlie gene expression datasets: (i) overrepresentation analysis (ORA), testing whether a gene set contains disproportionately many genes of significant expression change, and (ii) pre-ranked gene set enrichment analysis (GSEA), testing whether genes of a gene set accumulate at the top or bottom of the full gene vector ordered by direction and magnitude of expression change (Geistlinger et al, 2021).
Both methods are powered by a priori established gene sets, groups of genes that share common biological function, chromosomal location, or regulation. Successful gene sets can help identify underlying genetic abnormalities or signal transduction networks driving disease pathologies and help effectively bridge gene expression data with biological significance. (Subramanian et al, 2005; Bild & Febbo, 2005)
In ORA, differential expression (DE) scores are calculated, the full expression matrix is reduced to a list of genes passing a threshold (e.g. |fold change| ≥ 1, FDR < 0.05) known as differentially expressed genes (DEGs), and hypergeometric testing is performed to identify the gene sets that contain more-than-randomly-occurring DEGs. In GSEA, DE scores are computed, but users do not need to define a threshold. Instead, the genes are ordered in a ranked list, according to their DE between the classes, and the members of a gene set is assessed for their rank positions in the list. Next, an enrichment score (ES) is calculated to reflect the degree to which a gene set is overrepresented at the extremes (top or bottom) of the entire ranked list. Then, the statistical significance (nominal P value) of the ES is computed using an empirical phenotype-based permutation test procedure that preserves the complex correlation structure of the gene expression data. Finally, when an entire database of gene sets is evaluated, the estimated significance level is adjusted to account for multiple hypothesis testing.
ORA has been reported with analytical challenges. For example, no individual gene may meet the threshold for statistical significance after correcting for multiple hypotheses testing because the relevant biological differences are modest relative to the noise inherent to the quantification technology; alternatively, one may be left with a long list of statistically significant genes without any unifying biological theme; and when different groups study the same biological system, the list of statistically significant genes from the two studies may show distressingly little overlap. Therefore, GSEA has been established (Subramanian et al, 2005).
easyGSEA provides both GSEA and ORA analysis modules.
We currently support 11 species with a comprehensive list of monthly-updating gene set (GS) libraries, including unique and species-specific FlyBase, WormCat, WormBase and SGD ontologies. Each GS is named as “UUU_XXXXXXX%DDD”, where UUU = the unique identifier of the GS’s originating database, XXXXXXX = the GS description, and DDD = the unique ID (if any) of the GS.
We also support custom analysis with supplied gene set databases, in GMT format:
GSEA mode accepts two file formats: ranked gene list (RNK) and differential expression (DE) analysis file
The RNK file contains a single, rank ordered gene list (not gene set) in a simple newline-delimited text format. It has two columns: the first column contains genes while the second contains rank scores. It is used when you have a pre-ordered ranked list that you want to analyze with GSEA. The list need not be sorted.
We support both comma- and tab-delimited text files.
GeneName | Rank |
---|---|
gst-5 | 5.425 |
C29F3.7 | k4.409 |
dod-24 | 4.264 |
dhs-8 | 4.003 |
T21F4.1 | 3.927 |
It is also possible to generate an RNK by converting from differential expression analysis results by tools such as DESeq2, edgeR, and limma. The results should be saved in a comma- (.csv) or tab-delimited (.txt/.tab) text file. easyGSEA will automatically detect it. You will need to specify three columns: genes, logFC and p-value. Our app will generate the RNK for you.
name | logFC | logCPM | F | PValue | FDR |
---|---|---|---|---|---|
2L52.1 | -0.26 | 1.52 | 0.25 | 0.63 | 1 |
aagr-1 | -0.36 | 5.96 | 1.55 | 0.25 | 1 |
aagr-2 | -0.21 | 7.02 | 0.88 | 0.37 | 1 |
aagr-3 | -0.21 | 7.22 | 0.80 | 0.40 | 1 |
aagr-4 | 0.78 | 6.31 | 7.23 | 0.03 | 1 |
ORA mode accepts newline, tab, or space-delimited gene list. You may copy and paste your genes of interest in the text box provided on the page.
easyGSEA functionally characterize transcriptomes with integrative gene annotation (gene set, GS) databases. All essential analysis steps are performed in tab 1. Run Analysis.
To start, select either Pre-ranked GSEA or Overrepresentation analysis in the side bar.
easyGSEA supports 11 different species, search and select the one that matches your input query.
Alternatively, you can click on Custom analysis and upload your own gene set (GS) library files (in GMT format, see above) to use the easyGSEA workflow for customized analysis.
When you upload your own GS libraries (*.gmt), you will prompted to name each library with an identifier. Each uploaded file must be named with a unique identifier. The identifiers will be used to stratify your results by files later on.
By default, the uploaded are labelled by “GMT*”, where * is a number such as 1; the number automatically increments when you upload more.
If you specify that your .gmt files already contains an identifier, the string before the occurrence of a “_” in the first GS is subtracted to denote the GS library.
Click Advanced database options …. By default, easyGSEA has pre-selected combinatory biological pathway (KEGG, Reactome Pathway, Wikipathways) and process (Biological Process) for functional profiling. You may adjust the selections to suit your study purpose in the pop-up window as shown on the right. Once you have confirmed your database selections, click Select to continue!
Click Confirm to proceed to confirm your GS database selection or uploads.
easyGSEA maintains gene set databases according to HUGO Gene Nomenclature, where a gene "symbol" is a unique abbreviation for the gene name. If your query genes are symbols, select SYMBOLS. Otherwise, select Other/Mixed.
GSEA mode: upload either an RNK or a DE file (see above INPUT FILES/FORMATS). Once your file is uploaded, you will be prompted to specify the columns in your file: if RNK, specify the Gene and the Rank columns; if DE, specify the Gene, the logFC, and the P-value columns. By default, your query is named using your query file's name. You can re-name it by modifying the text. All results/figures you're going to download will be named according to your input here.
You will also need specify the gene identifier for fully numeric IDs. The default is NCBI Entrez gene IDs, but if you are analyzing array data, make sure select the corresponding platform. Then, your query genes will be automatically converted into symbols and you will have be able to download the converted gene/rank/DE tables after successful ID conversion.
Click Confirm and continue! to upload your file. Gene identifier conversion, if needed, is automatically done.
ORA mode: copy and paste your genes or proteins of interest, delimited by newline (“\n”), tab (“\t”) or space (“ “), into the text box provided in the interface. Select the identifier for numeric IDs (if any) as above. Enter a name for your list. Click Confirm. Your input genes/proteins will be loaded, and automatically converted to symbols, if applicable.
You may adjust the parameters for functional profiling analysis by clicking the sky blue gear button on top right of the orange run button.
By default, a minimum size of 15 and a maximum size of 200 is used to filter GSs. In the GSEA module, a default value of 1,000 (maximum 10,000) is used for permutations.
By default, a P-value threshold of 0.005 is applied to filter significantly enriched GSs.
In addition, by default, a threshold on adjusted P-value (padj) is applied for generating visualizations:
Click the switch button to disable this dynamic threshold adjustment.
easyGSEA will provide you with an ID conversion table if your input genes are not symbols. In the GSEA module, if your uploaded file is DE and your genes are not symbols, easyGSEA will generate a converted DE table for you to pass to easyVizR for multiple comparisons.
A summary pop-up window will appear after a successful run, summarizing GS filters and number of significantly enriched GSs.
Click Navigate to Enrichment Results for details.
easyGSEA provides a variety of interactive, customizable and publication-ready figures to visualize the enrichment results. Enrichment plot and statistics for each gene set can be easily retrieved by clicking the plots or by keyword search. We also provide a variaty of visualizations for you to explore the trends of your data as compared to the whole genome backgroud: Density, box and violin plots. You can also zoom into biological pathways (KEGG, Reactome, Wikipathways) to inspect the gene level changes.
Click top left buttons on this page to explore different visualizations on your data. Bar plot, Bubble plot, Manhatten plot and Volcano plot display gene sets that are significantly enriched in your datasets. Keywords does simple text mining and discribes the most frequent words in your enrichment results. All visualizations are customizable as explained below.
Color options: Red, Salmon , Blue, Cyan , Orange, Green , Purple, Grey
Top enriched GSs are plotted along the y-axis. In GSEA runs, enrichment scores (ESs) are plotted along the x-axis; in ORA runs, -log10 transformed P-values (pval) (-log10(pval)) are plotted along the x-axis. The color intensity reflects the -log10(pval/padj)*sign(ES) (GSEA module) or -log10(pval) (ORA module), as annotated by the color bar. Hover labels show the statistical information of each GS including its name, pval, padj, and leading-edge (GSEA module) or overlapping (ORA module) genes.
By default, results from all analyzed databases are displayed. Default P-value (pval) threshold is < 0.005. Default adjusted P-value (padj) is dynamically adjusted depending on the user-supplied dataset to i) capture the most significantly enriched categories, and ii) minimize false positives arising from overreliance on a single gene set library: 0.25 if 5 or more gene sets have an padj < 0.25; 0.05 if 20 or more have an padj < 0.05; and 0.01 if 20 or more have an padj < 0.01. Click the gear button on top right of the RUN button to switch to FALSE to disable dynamic adjustment if you have multiple datasets to analyze and would like to be consistent in padj threshold. You may choose to color the plot by pval or padj. The y-axis labels are abbreviated to 40 characters by default for aesthetic purposes. You may choose to display the labels in full or adjust the number of characters to abbreviate.
Alternatively, you can manually search and select the gene sets of interest to visualize by clicking the search icon. Click the palette icon to customize bar colors. Click the TV icon to remove/add database identifiers and/or GS IDs (if any).
Bar plot, and every individual visualization in eVITTA can be customized with its own plotting parameters. For example:
x- and y-axis, and plotting parameters are mostly identical to the above bar plot. Additional parameters are adjustable minimum and maximum bubble sizes for aesthetic purposes. Bubble size reflects number of leading-edge (GSEA module) or overlapping (ORA module) genes.
Simple text mining is done to extract the top frequently appearing words in the enriched gene sets. Hover over each word for the underlying gene sets. pval, padj and colors are adjustable as in the bar and the bubble plots. Adjust the number of top appearing words to display by clicking the gear button.
Manhattan plot displays enrichment results on all tested gene sets. Each GS library is uniquely colored and plotted along the x-axis. -log10(pval/padj) for each GS is plotted along the y-axis. Significantly enriched gene sets, as defined by the chosen pval or padj threshold, are highlighted with stronger colors. Horizontal dashed line reflects the threshold of pval or padj. Hover labels show the statistical information of each GS.
ESs are plotted along the x-axis. -log10(pval/padj) for each GS is plotted along the y-axis. You may adjust the pval or padj threshold using the gear dropdown on bottom left of the plot. In the Continuous mode, the color intensity reflects strengths of regulation (-log10(pval/padj)*sign(ES)). In the Discrete mode, GSs that meet pval or padj threshold are highlighted in red. Both continuous and discrete volcanos are interactive, where hover labels show details about each gene’s name, logFC, and padj. In the Static mode, GSs that meet the pval/padj threshold are highlighted and labelled with their names.
Click the interactive plots, or manually search the GS you are interested to zoom into its detailed statistics. easyGSEA provides options on visualize the distribution of ES scores of a selected GS in the genome background.
The primary result of GSEA is the enrichment score (ES), which reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. The enrichment plot provides a graphical view of the ES for a gene set. The score at the peak of the plot (the score furthest from 0.0) is the ES for the gene set. Gene sets with a distinct peak at the beginning (such as the one shown here) or end of the ranked list are generally the most interesting. The leading edge genesare the subset of members that contribute most to the ES. For a positive ES, the leading edge genes are the set of members that appear in the ranked list prior to the peak score. For a negative ES, they are the set of members that appear subsequent to the peak score.
The detailed enrichment statistics about a GS are provided in a copiable and dowloadable table.
The rank scores in bins of all genes (blue) and genes in a selected GS (orange) are plotted along the x-axis. The density (frequency) of each rank score bin is plotted along the y-axis. The leading-edge genes are plotted as dots (green) based on their rank scores.
Distributions of rank scores for the genome background (blue) and a selected GSs (orange) are plotted along the y-axis. In violin plots, standard deviations (s.d.) are indicated with grey perpendicular lines. You may adjust the number of s.d. to display using the gear dropdown on bottom left of the violin plot.
Similar to the above box plot, distributions of rank scores for the genome background (blue) and a selected GSs (orange) are plotted along the y-axis. Standard deviations (s.d.) are indicated with grey perpendicular lines. You may adjust the number of s.d. to display using the gear dropdown on bottom left of the violin plot.
You can visualze biological pathways (KEGG, Reactome, Wikipathways) by clicking/selecting corresponding gene set and scroll down to the bottom of the page. Genes are colored according to their ranks in your dataset.
In GSEA runs, upregulated genes are colored in red, while downregulated genes are colored in blue. Color intensity reflects strength of regulation. In ORA runs, overlap genes in both the user-suppplied gene list and the pathway are highlighted in green.
In GSEA runs, leading edge genes are highlighted in red. In ORA runs, overlap genes are highlighted in red. All reactome diagrams are interactive. Click on the nodes (subset of reactions, genes, enzymes, substrates, products, etc.) and edges (interactions, genetic or physical) to learn more.
In GSEA runs, upregulated genes are colored in red, while downregulated genes are colored in blue. Color intensity reflects strength of regulation. In ORA runs, overlap genes in both the user-suppplied gene list and the pathway are highlighted in green. All Wikipathways diagrams are interactive. Click on the each node (gene) to learn more about its definition and functions.
Each plot can be customized as you need. For example, you can easily select/deselect databases, adjust P and P.adj thresholds, adjust the number of top regulations to display, and modify artistic appearance of the figures (e.g. string length, bubble sizes).
Download the customized plots to your local drive for long-term storage purposes.
Gene sets (GSs) sometimes can be redundant. A network view is helpful to group GSs that probably describe similar biology, and/or examine relations between GSs.
Node denotes GS. Node size reflects the number of leading-edge genes (GSEA module), overlapping genes (ORA module), or genes in the original database. Node color intensity reflects strength of regulation (-log10(pval/padj)*sign(ES)). Edge reflects significant gene overlap between GSs as defined by Jaccard (the default), overlap, or combined coefficient. Hover over each node and edge for detailed statistical information about each GS and its relationship with other GSs.
Color options: Red, Salmon , Blue, Cyan , Orange, Green , Purple, Grey
If you find the enriched GSs are too connected or too disconnected with each other, or if there are too many/few GSs in the defined thresholds, or if you’d like to adjust the method of plotting edges, click the gear button on top right and adjust the plotting parameters to fine tune the network view.
We provide means to cluster GSs according to their similarities, and offer different visualizations to help you interpret the enriched GSs.
Similarity scores (as defined by Jaccard, Overlap or Combined coefficients) between gene sets are converted into a distance (dissimilarity) matrix. Hierarchical clustering using the complete method is performed and a similarity threshold (default 0.25) is used to group gene sets that probably describe similar biology. The most significantly enriched gene set (lowest padj) is chosen to annotate the cluster it occurs in.
Node denotes gene set, height denotes dissimilarity score, dashed line denotes the similarity threshold. Hover over each node for the gene set name. The similarity threshold, label text size, and minimum cluster size for labels are all customizable for a best way to summarize your data.
You could make some adjustments to the dendrogram using the dendrogram dropdown menu.
The most significant gene set in each cluster are plotted in bars. They can be sorted either by cluster sizes (1 to the largest), or ES in GSEA and pval/padj in ORA.
As in above bar plot, the most significant gene set in each cluster are plotted in bubbles. They can be sorted either by cluster sizes (1 to the largest), or ES in GSEA and pval/padj in ORA.
The table showing the clustering statistics can be downloaded for record and detailed examination.
Click the download button and save the plot you need.
Download the enrichment table for multiple comparisons in easyVizR. All gene set libraries in easyGSEA are also available for download for custom analysis and/or tool development.
You may download the enrichment table and proceed to easyVizR for multiple comparisons. You may also explore details about the enrichment results on your local drive.
You may use our up-to-date gene set libraries to help with your analysis (if any) and/or for further tool development.
Feel free to reach us at evitta@cmmt.ubc.ca if you have any questions.