easyVizR (read: easy-vise-R) provides various visualization modules designed for comparing and making inferences from multiple sets of expression data.
Demo sessionMultiple comparisons play a crucial role in research into complex regulatory networks. Oftentimes, multiple genotypes are profiled in parallel, and identifying the overlaps and disjoints among differential expression profiles is important to uncover functional dysregulations following a genetic perturbation. Even when multi-factored, multi-leveled study designs are not involved, it is increasingly standard for researchers to compare new results to published data.
To streamline the exploration of multiple datasets, easyVizR integrates list filtering, intersection selection, and visualization in the same workflow. At any time, one can change the filters or the selected intersection, and the graphs will be updated dynamically to reflect the new parameters. This flexible workflow allows for rapid discovery of patterns that underlie common and divergent regulations in any number of datasets.
easyVizR's workflow consists of four steps:
easyVizR accepts comma-delimited data tables (*.csv) as input. Each input data table must contain four essential columns:
In easyVizR, these four columns are referred to as “Name”, “Value”, “PValue” and “FDR”, respectively.
Supported input types includes but is not limited to the following:
Note that it is strongly recommended to use the whole dataset without any cutoffs or filters. You can apply filters in the visualization interface, which will be reflected dynamically in the visualizations. Take advantage of this reactivity to interrogate the data in greater depth.
As specified above, the input dataframe should have at least four columns: Name, Stat (the main statistic, e.g. logFC), PValue, FDR. A minimal example looks like this:
Name | logFC | PValue | FDR |
---|---|---|---|
2L52.1 | -0.26 | 0.63 | 1 |
aagr-1 | -0.36 | 0.25 | 1 |
aagr-2 | -0.21 | 0.37 | 1 |
aagr-3 | -0.21 | 0.40 | 1 |
aagr-4 | 0.78 | 0.03 | 1 |
Currently, both PValue and FDR columns are required. If your data only has one of these columns, you can bypass this requirement by manually duplicating the available column, and avoiding to use the duplicated column in visualizations (NOT RECOMMENDED).
You can upload a single .txt or .csv file (comma separated).
For each upload, you will be asked to specify the following things:
Select a folder containing multiple .txt or .csv files (comma separated).
Large dataframes can eat up resources on the server side and lengthen the processing time. To prevent this, user uploads are limited to the following:
Here you can select and delete datasets that are no longer needed.
Select two or more datasets for comparison. The datasets selected must share at least one identifier.
After datasets are selected, their expression tables will be horizontally merged into a master data-frame, with column names in the format “XXX_NNNNN” (where XXX denotes the column name, e.g. Value, and NNNNN denotes the name of the dataset).
After selecting datasets, the user must apply filters to each dataset to generate gene lists. Conventionally, the gene list contains genes that are “significantly changed” according to a set of cutoffs. These gene lists are then used for intersection analysis.
There are two methods to apply desired filters:
Enter your filters in the filter inputs.
Some commonly used filtering strategies are provided as presets. Hover over each button to see the effects.
If you have unsaved changes, the filter inputs will glow red, and a text reminder will pop up on the bottom left.
Click the Save Filters button to save them. This will also update the Preview gene list table into the Saved gene list table.
For a small number of gene lists (n < 6), the Venn Diagram is useful for visualizing intersections between gene lists.
Users can toggle between two versions of the diagram:
The number of terms not included in any gene list is shown on the bottom right.
By default, the intersection currently selected in Intersection of Interest is highlighted.
For a small number of gene lists (n < 6), UpSet plot (rendered with UpSetR) is also provided.
In the UpSet plot, set sizes are presented in bar format. Thus, it is a useful alternative to Venn and euler diagrams when they become uninformative.
By default, the intersection currently selected in Intersection of Interest is highlighted.
Here you can select an intersection of interest to interrogate in further detail.
In the interface, “True”, “False” and “Ignore” options are provided for each filtered list to specify set relationships. The effects of each of these options are specified below:
Option | Contained in the filtered gene list? | Satisfy the filters? | Represented in Venn Diagram |
---|---|---|---|
TRUE | YES | YES | included in the circle |
FALSE | NO | NO | excluded from the circle |
Ignore | may or may not | may or may not | may be included or excluded |
In terms of set operations, for filtered list A, with filter p<0.05:
The set operations are linked by “AND”. We do not currently support “OR”.
Combining these options for multiple filtered lists allows one to build a set operation that points to a particular intersection.
For instance, for three filtered lists A, B, C, the following selections are possible:
There are two ways you can verify that you have selected the correct intersection: 1) by comparing the number of rows in the table to the Venn counts, and 2) by checking the Active filters description.
Terms in the selected intersection are presented in the intersection table below.
There are three available views:
The intersection table provides a text enrichment word cloud module. This module shows enriched words in the identifier column, which is especially useful for identifying recurring words in a set of functional categories.
NOTE: this widget should not be used for rigorous analysis. It has lots of flaws, e.g.:
Data and filter options are available as widgets on top right of the page.
There are four buttons:
If you have specific genes or annotations in mind, open the dropdown and enter them here.
Select View genes to view the specified genes or annotations. Click Reset to revert to the original view.
If the selected genes or annotations do not fulfill the current filters, they will not be shown. Remove or change the filters to view all selected genes or annotations.
This dropdown is a minified version of the filter selection controls in Tab 2.
Use this to adjust the filters any time in the visualization tabs.
This dropdown shows the filtered gene lists that are currently active.
This dropdown displays additional options for easyGSEA output.
Typical easyGSEA outputs have a leading identifier specifying the database (e.g. “KEGG_”) and a trailing identifier specifying the gene set ID (if applicable).
By default, most of the visualizations are based on the currently selected intersection.
At any point, you can change the filters and select a different intersection, and the plots will be refreshed dynamically. This flexibility allows for rapid discovery of patterns that underlie common and divergent regulations in multiple datasets.
The interactive heatmap (rendered with plotly) visualizes how terms in the selected intersection are differentially regulated across datasets.
For two selected datasets (denoted X and Y), easyVizR provides an interactive two-dimensional scatter plot (rendered with plotly).
By default, the differential expression metric (“Value”) is plotted.
To specify which set of terms to plot, user can choose among five options:
To better situate a set of terms in the full correlation profile, excluded terms can be optionally shown in the background in light grey.
For the plotted data points, three coloring options are provided:
For any point in the plot, users can hover over individual data points to see details about its metric, p-value and FDR in datasets X and Y.
To prevent infinity errors in log transformation, PValue or FDR exactly equal to 0 are manually set at 10e-5 (0.00001) when plotting. Terms with NA in either dimension are excluded.
Color summary: in the “color summary” dropdown, the displayed colors of data points are tallied into a table, which is available for download as a comma-delimited file.
Correlation: the correlation coefficient r^2 is displayed on the left, along with the equation for the correlation line.
Note: all points in the scatter plot are used to calculate the correlation, regardless of color, size or position (foreground/ background).
Exclusion Report: Due to the nature of the scatter plot, terms missing values on one of the dimensions cannot be plotted. These are shown in the exclusion report below the scatter plot.
Term exclusions happen when identifiers are not shared across all your datasets, AND you have selected one of the following:
For three selected datasets (denoted X, Y and Z), an interactive three-dimensional scatter plot is rendered with plotly.
By default, the differential expression metric (“Value”) is plotted. Sizes are defined as (multiplier-1). Terms with NA in any dimension are excluded (see exclusion report below).
Volcano plot plots the -log10(PValue) against the main differential expression metric (e.g. logFC). Based on defined thresholds, certain terms are highlighted.
By default, only the selected intersection is included in the graph. Additionally, excluded terms can be shown in the background in light grey.
To prevent infinity errors in log transformation, PValue exactly equal to 0 are manually set at 10e-5 (0.00001) when plotting.
Bar plot can be used to plot any shared numeric column (“Value” is plotted by default). Bar plot is only available if the number of genes in selected intersection <= 15.
Colors are displayed as –log10(PValue(X)). To prevent infinity errors in log transformation, PValue exactly equal to 0 are manually set at 10e-5 (0.00001) when plotting
The Network module is only available for GSEA datasets with a valid column for leading-edge genes delimited by a specific separator.
Rank-rank hypergeometric overlap (RRHO) is a two-dimensional visualization algorithm that represents correspondence between two differential expression profiles using a ranked-list approach (Plaisier et al, 2010).
To run RRHO, users are prompted to select two datasets to compare. From the p-value and differential expression metric (e.g. logFC, ES), rank scores are automatically computed with log10 transformation, and the results are ordered into ranked lists. The algorithm then steps through the two ranked lists, and statistical significance of the number of overlapping genes above the sliding threshold are computed in succession. The resulting p-values are assembled into a hypergeometric matrix, which is used to plot the RRHO level plot.
An in-depth explanation of the algorithm and examples for result interpretation are found in the original paper.
Two complementary visualizations are included in this module: level plot and rank-rank scatter.
The RRHO level plot is generated from log10-transformed hypergeometric P-values.
Color scale indicates log10-transformed hypergeometric P-values; under-enrichment is indicated by negative values. Normally, no white cells should occur, but if they do, they may correspond to hypergeometric P-values of zero.
Step number and step size dictate the resolution of the graph. Smaller step sizes generate more detailed graphs, but take up a lot of computational resources. By default, for ranked lists of n<1000, step number is set as sqrt(n); for large lists of n>=1000, step number is capped at sqrt(1000).
“Hotspots” in the plot correspond to places where the two datasets are most similar.correspond to places where the two datasets are most similar. Strong correlation is indicated by high values along the diagonal; see examples here.
Besides the level plot, an additional rank-rank scatter plot is provided to visualize the spread of the data. Spearman’s correlation coefficient (rho) is also provided.
For examples of weak, medium and strong correlation, see here.
Feel free to reach us at evitta@cmmt.ubc.ca if you have any questions.
Q: I'm seeing a lot of numbers in the header selection, why is that?
A: Check if your files have a header column. By default, the app uses the first row as the header column.
I'm uploading differential expression data. My files always give me “duplicate name” warnings, although there shouldn’t be any?
A possibility is that you saved your .csv files using excel. Excel automatically converts some gene names to dates, which can cause this error. As the link suggests, take extra precautions when working with excel.
I'm uploading a folder of files, and somehow it errors out.
Please check the following: 1) your folder only contains the files you wish to upload, 2) all the files share the same set of column names, and 3) there are no formatting problems (e.g. white lines) inside the csv files.
Q: I chose to filter by gene list, but now nothing shows up in the visualizations and the table and I get no "gene not found" message.
A: Most likely your genes don't fulfill the active filters and are excluded. Try removing all the filters.
Q: The values I see in Venn aren’t matching up with the number of points in the scatter plot!
A: Some data points might have gotten excluded. Number of excluded items + number of plotted items *should* equal to the corresponding Venn count.
Q: How do I see all the genes/ gene sets, without any filters?
A: Select “Ignore” in the intersection selection panel for all datasets. Filters should not matter.
Q: How do I see all the genes/ gene sets that have values in all datasets?
A: Select “No filters” in the filtering panel, and select “True” for in the intersection selection panel.
Q: I want to see “genes that are significant (p<0.05) in dataset X but NOT in dataset Y”.
A: Select p<0.05 for both X and Y, then select True for X, and False for Y.
Q: I ran into problems with plotly graphs in Safari.
A: Plots rendered with plotly may not display properly in safari. Refer to “known issues” below.
Q: Parts of the interface randomly froze or became blank?
A: Make sure no visualization algorithm is running in the background. If this is not the case, this is likely a bug. Navigate to another tab and come back to refresh the UI; this should fix it in most cases. Drop us a bug report as well.
Q: I'm using this to plot GSEA data. I included the leadingEdge at upload, but now it won't show up in the heatmap, help?
A: Check if all your selected datasets have the leadingEdge column, and if the column has the same name. The heatmap only shows columns that are shared among all datasets.
Safari users: