6. Developmental Guide
6.1. Module APIs
- gseapy.gsea()[source]
Run Gene Set Enrichment Analysis.
- Parameters:
data – Gene expression data table, Pandas DataFrame, gct file.
gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.
cls – A list or a .cls file format required for GSEA.
outdir (str) – Results output directory. If None, nothing will write to disk.
permutation_num (int) – Number of permutations. Default: 1000. Minimial possible nominal p-value is about 1/nperm.
permutation_type (str) – Type of permutation reshuffling, choose from {“phenotype”: ‘sample.labels’ , “gene_set” : gene.labels}.
min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
max_size (int) – Maximum allowed number of genes from gene set also the data set. Default: 500.
weight (float) – Refer to
algorithm.enrichment_score()
. Default:1.method –
The method used to calculate a correlation or ranking. Default: ‘signal_to_noise’. Others methods are:
’signal_to_noise’
You must have at least three samples for each phenotype to use this metric. The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
’t_test’
Uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric. The larger the tTest ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
’ratio_of_classes’ (also referred to as fold change).
Uses the ratio of class means to calculate fold change for natural scale data.
’diff_of_classes’
Uses the difference of class means to calculate fold change for nature scale data
’log2_ratio_of_classes’
Uses the log2 ratio of class means to calculate fold change for natural scale data. This is the recommended statistic for calculating fold change for log scale data.
ascending (bool) – Sorting order of rankings. Default: False.
threads (int) – Number of threads you are going to use. Default: 4.
figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [6.5,6].
format (str) – Matplotlib figure format. Default: ‘pdf’.
graph_num (int) – Plot graphs for top sets of each phenotype.
no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
seed – Random seed. expect an integer. Default:None.
verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.
- Returns:
Return a GSEA obj. All results store to a dictionary, obj.results, where contains:
| { | term: gene set name, | es: enrichment score, | nes: normalized enrichment score, | pval: Nominal p-value (from the null distribution of the gene set, | fdr: FDR qvalue (adjusted False Discory Rate), | fwerp: Family wise error rate p-values, | tag %: Percent of gene set before running enrichment peak (ES), | gene %: Percent of gene list before running enrichment peak (ES), | lead_genes: leading edge genes (gene hits before running enrichment peak), | matched genes: genes matched to the data, | }
- gseapy.prerank()[source]
Run Gene Set Enrichment Analysis with pre-ranked correlation defined by user.
- Parameters:
rnk – pre-ranked correlation table or pandas DataFrame. Same input with
GSEA
.rnk file.gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.
outdir – results output directory. If None, nothing will write to disk.
permutation_num (int) – Number of permutations. Default: 1000. Minimial possible nominal p-value is about 1/nperm.
min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
max_size (int) – Maximum allowed number of genes from gene set also the data set. Defaults: 500.
weight (str) – Refer to
algorithm.enrichment_score()
. Default:1.ascending (bool) – Sorting order of rankings. Default: False.
threads (int) – Number of threads you are going to use. Default: 4.
figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [6.5,6].
format (str) – Matplotlib figure format. Default: ‘pdf’.
graph_num (int) – Plot graphs for top sets of each phenotype.
no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
seed – Random seed. expect an integer. Default:None.
verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.
- Returns:
Return a Prerank obj. All results store to a dictionary, obj.results, where contains:
| { | term: gene set name, | es: enrichment score, | nes: normalized enrichment score, | pval: Nominal p-value (from the null distribution of the gene set, | fdr: FDR qvalue (adjusted False Discory Rate), | fwerp: Family wise error rate p-values, | tag %: Percent of gene set before running enrichment peak (ES), | gene %: Percent of gene list before running enrichment peak (ES), | lead_genes: leading edge genes (gene hits before running enrichment peak), | matched genes: genes matched to the data, | }
- gseapy.ssgsea()[source]
Run Gene Set Enrichment Analysis with single sample GSEA tool
- Parameters:
data – Expression table, pd.Series, pd.DataFrame, GCT file, or .rnk file format.
gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.
outdir – Results output directory. If None, nothing will write to disk.
sample_norm_method (str) –
Sample normalization method. Choose from {‘rank’, ‘log’, ‘log_rank’, None}. Default: rank. this argument will be used for ordering genes.
’rank’: Rank your expression data, and transform by 10000*rank_dat/gene_numbers
’log’ : Do not rank, but transform data by log(data + exp(1)), while data = data[data<1] =1.
’log_rank’: Rank your expression data, and transform by log(10000*rank_dat/gene_numbers+ exp(1))
None or ‘custom’: Do nothing, and use your own rank value to calculate enrichment score.
see here: https://github.com/GSEA-MSigDB/ssGSEAProjection-gpmodule/blob/master/src/ssGSEAProjection.Library.R, line 86
- Parameters:
correl_norm_type (str) –
correlation normalization type. Choose from {‘rank’, ‘symrank’, ‘zscore’, None}. Default: rank. After ordering genes by sample_norm_method, further data transformed could be applied to get enrichment score.
when weight == 0, sample_norm_method and correl_norm_type do not matter; when weight > 0, the combination of sample_norm_method and correl_norm_type dictate how the gene expression values in input data are transformed to obtain the score – use this setting with care (the transformations can skew scores towards +ve or -ve values)
sample_norm_method will first transformed and rank original data. the data is named correl_vector for each sample. then correl_vector is transformed again by
correl_norm_type is None or ‘rank’ : do nothing, genes are weighted by actual correl_vector.
correl_norm_type ==’symrank’: symmetric ranking.
correl_norm_type ==’zscore’: standardizes the correl_vector before using them to calculate scores.
min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
max_size (int) – Maximum allowed number of genes from gene set also the data set. Default: 2000.
permutation_num (int) – For ssGSEA, default is 0. However, if you try to use ssgsea method to get pval and fdr, set to an interger.
weight (str) – Refer to
algorithm.enrichment_score()
. Default:0.25.ascending (bool) – Sorting order of rankings. Default: False.
threads (int) – Number of threads you are going to use. Default: 4.
figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [7,6].
format (str) – Matplotlib figure format. Default: ‘pdf’.
graph_num (int) – Plot graphs for top sets of each phenotype.
no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
seed – Random seed. expect an integer. Default:None.
verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.
- Returns:
Return a ssGSEA obj. All results store to a dictionary, access enrichment score by obj.resultsOnSamples, and normalized enrichment score by obj.res2d. if permutation_num > 0, additional results contain:
| { | term: gene set name, | es: enrichment score, | nes: normalized enrichment score, | pval: Nominal p-value (from the null distribution of the gene set (if permutation_num > 0), | fdr: FDR qvalue (adjusted FDR) (if permutation_num > 0), | fwerp: Family wise error rate p-values (if permutation_num > 0), | tag %: Percent of gene set before running enrichment peak (ES), | gene %: Percent of gene list before running enrichment peak (ES), | lead_genes: leading edge genes (gene hits before running enrichment peak), | matched genes: genes matched to the data, | }
- gseapy.enrichr()[source]
Enrichr API.
- Parameters:
gene_list – str, list, tuple, series, dataframe. Also support input txt file with one gene id per row. The input identifier should be the same type to gene_sets.
gene_sets –
str, list, tuple of Enrichr Library name(s). or custom defined gene_sets (dict, or gmt file).
Examples:
- Input Enrichr Libraries (https://maayanlab.cloud/Enrichr/#stats):
str: ‘KEGG_2016’ list: [‘KEGG_2016’,’KEGG_2013’] Use comma to separate each other, e.g. “KEGG_2016,huMAP,GO_Biological_Process_2018”
- Input custom files:
- dict: gene_sets={‘A’:[‘gene1’, ‘gene2’,…],
’B’:[‘gene2’, ‘gene4’,…], …}
gmt: “genes.gmt”
see also the online docs: https://gseapy.readthedocs.io/en/latest/gseapy_example.html#2.-Enrichr-Example
organism –
Enrichr supported organism. Select from (human, mouse, yeast, fly, fish, worm). This argument only affects the Enrichr library names you’ve chosen. No any affects to gmt or dict input of gene_sets.
see here for more details: https://maayanlab.cloud/modEnrichr/.
outdir – Output file directory
background –
int, list, str. Background genes. This argument works only if gene_sets has a type Dict or gmt file. If your input are just Enrichr library names, this argument will be ignored.
However, this argument is not straightforward when gene_sets is given a custom input (a gmt file or dict).
By default, all genes listed in the gene_sets input will be used as background.
There are 3 ways to tune this argument:
(Recommended) Input a list of background genes: [‘gene1’, ‘gene2’,…] The background gene list is defined by your experment. e.g. the expressed genes in your RNA-seq. The gene identifer in gmt/dict should be the same type to the backgound genes.
Specify a number: e.g. 20000. (the number of total expressed genes). This works, but not recommend. It assumes that all your genes could be found in background. If genes exist in gmt but not included in background provided, they will affect the significance of the statistical test.
Set a Biomart dataset name: e.g. “hsapiens_gene_ensembl” The background will be all annotated genes from the BioMart datasets you’ve choosen. The program will try to retrieve the background information automatically.
- Enrichr module use the code below to get the background genes:
>>> from gseapy.parser import Biomart >>> bm = Biomart() >>> df = bm.query(dataset=background, # e.g. 'hsapiens_gene_ensembl' attributes=['ensembl_gene_id', 'external_gene_name', 'entrezgene_id'], filename=f'~/.cache/gseapy/{background}.background.genes.txt') >>> df.dropna(subset=["entrezgene_id"], inplace=True)
So only genes with entrezid above will be the background genes if not input specify by user.
cutoff – Show enriched terms which Adjusted P-value < cutoff. Only affects the output figure, not the final output file. Default: 0.05
format – Output figure format supported by matplotlib,(‘pdf’,’png’,’eps’…). Default: ‘pdf’.
figsize – Matplotlib figsize, accept a tuple or list, e.g. (width,height). Default: (6.5,6).
no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
verbose (bool) – Increase output verbosity, print out progress of your job, Default: False.
- Returns:
An Enrichr object, which obj.res2d stores your last query, obj.results stores your all queries.
- gseapy.enrich()[source]
Perform over-representation analysis (hypergeometric test).
- Parameters:
gene_list – str, list, tuple, series, dataframe. Also support input txt file with one gene id per row. The input identifier should be the same type to gene_sets.
gene_sets –
str, list, tuple of Enrichr Library name(s). or custom defined gene_sets (dict, or gmt file).
Examples:
- dict: gene_sets={‘A’:[‘gene1’, ‘gene2’,…],
’B’:[‘gene2’, ‘gene4’,…], …}
gmt: “genes.gmt”
outdir – Output file directory
background –
None | int | list | str. Background genes. This argument works only if gene_sets has a type Dict or gmt file.
However, this argument is not straightforward when gene_sets is given a custom input (a gmt file or dict).
By default, all genes listed in the gene_sets input will be used as background.
There are 3 ways to tune this argument:
(Recommended) Input a list of background genes: [‘gene1’, ‘gene2’,…] The background gene list is defined by your experment. e.g. the expressed genes in your RNA-seq. The gene identifer in gmt/dict should be the same type to the backgound genes.
Specify a number: e.g. 20000. (the number of total expressed genes). This works, but not recommend. It assumes that all your genes could be found in background. If genes exist in gmt but not included in background provided, they will affect the significance of the statistical test.
Set a Biomart dataset name: e.g. “hsapiens_gene_ensembl” The background will be all annotated genes from the BioMart datasets you’ve choosen. The program will try to retrieve the background information automatically.
- Enrichr module use the code below to get the background genes:
>>> from gseapy.parser import Biomart >>> bm = Biomart() >>> df = bm.query(dataset=background, # e.g. 'hsapiens_gene_ensembl' attributes=['ensembl_gene_id', 'external_gene_name', 'entrezgene_id'], filename=f'~/.cache/gseapy/{background}.background.genes.txt') >>> df.dropna(subset=["entrezgene_id"], inplace=True)
So only genes with entrezid above will be the background genes if not input specify by user.
cutoff – Show enriched terms which Adjusted P-value < cutoff. Only affects the output figure, not the final output file. Default: 0.05
format – Output figure format supported by matplotlib,(‘pdf’,’png’,’eps’…). Default: ‘pdf’.
figsize – Matplotlib figsize, accept a tuple or list, e.g. (width,height). Default: (6.5,6).
no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
verbose (bool) – Increase output verbosity, print out progress of your job, Default: False.
- Returns:
An Enrichr object, which obj.res2d stores your last query, obj.results stores your all queries.
- gseapy.replot()[source]
The main function to reproduce GSEA desktop outputs.
- Parameters:
indir – GSEA desktop results directory. In the sub folder, you must contain edb file folder.
outdir – Output directory.
weight (float) – weighted score type. choose from {0,1,1.5,2}. Default: 1.
figsize (list) – Matplotlib output figure figsize. Default: [6.5,6].
format (str) – Matplotlib output figure format. Default: ‘pdf’.
min_size (int) – Min size of input genes presented in Gene Sets. Default: 3.
max_size (int) – Max size of input genes presented in Gene Sets. Default: 5000. You are not encouraged to use min_size, or max_size argument in
replot()
function. Because gmt file has already been filtered.verbose – Bool, increase output verbosity, print out progress of your job, Default: False.
- Returns:
Generate new figures with selected figure format. Default: ‘pdf’.
6.2. GSEA Statistics
- class gseapy.gsea.GSEA(data: DataFrame | str, gene_sets: List[str] | str | Dict[str, str], classes: List[str] | str | Dict[str, str], outdir: str | None = None, min_size: int = 15, max_size: int = 500, permutation_num: int = 1000, weight: float = 1.0, permutation_type: str = 'phenotype', method: str = 'signal_to_noise', ascending: bool = False, threads: int = 1, figsize: Tuple[float, float] = (6.5, 6), format: str = 'pdf', graph_num: int = 20, no_plot: bool = False, seed: int = 123, verbose: bool = False)[source]
GSEA main tool
- calc_metric(df: DataFrame, method: str, pos: str, neg: str, classes: Dict[str, str], ascending: bool) Tuple[List[int], Series] [source]
The main function to rank an expression table. works for 2d array.
- Parameters:
df – gene_expression DataFrame.
method –
The method used to calculate a correlation or ranking. Default: ‘log2_ratio_of_classes’. Others methods are:
’signal_to_noise’ (s2n) or ‘abs_signal_to_noise’ (abs_s2n)
You must have at least three samples for each phenotype. The more distinct the gene expression is in each phenotype, the more the gene acts as a “class marker”.
’t_test’
Uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric. The larger the t-test ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
’ratio_of_classes’ (also referred to as fold change).
Uses the ratio of class means to calculate fold change for natural scale data.
’diff_of_classes’
Uses the difference of class means to calculate fold change for natural scale data
’log2_ratio_of_classes’
Uses the log2 ratio of class means to calculate fold change for natural scale data. This is the recommended statistic for calculating fold change for log scale data.
pos (str) – one of labels of phenotype’s names.
neg (str) – one of labels of phenotype’s names.
classes (dict) – column id to group mapping.
ascending (bool) – bool or list of bool. Sort ascending vs. descending.
- Returns:
returns argsort values of a tuple where 0: argsort positions (indices) 1: pd.Series of correlation value. Gene_name is index, and value is rankings.
visit here for more docs: http://software.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html
- class gseapy.gsea.Prerank(rnk: DataFrame | Series | str, gene_sets: List[str] | str | Dict[str, str], outdir: str | None = None, pheno_pos='Pos', pheno_neg='Neg', min_size: int = 15, max_size: int = 500, permutation_num: int = 1000, weight: float = 1.0, ascending: bool = False, threads: int = 1, figsize: Tuple[float, float] = (6.5, 6), format: str = 'pdf', graph_num: int = 20, no_plot: bool = False, seed: int = 123, verbose: bool = False)[source]
GSEA prerank tool
- class gseapy.gsea.Replot(indir: str, outdir: str = 'GSEApy_Replot', weight: float = 1.0, min_size: int = 3, max_size: int = 1000, figsize: Tuple[float, float] = (6.5, 6), format: str = 'pdf', verbose: bool = False)[source]
To reproduce GSEA desktop output results.
- class gseapy.base.GMT(mapping: Dict[str, str] | None = None, description: str | None = None)[source]
- class gseapy.base.GSEAbase(outdir: str | None = None, gene_sets: List[str] | str | Dict[str, str] = 'KEGG_2016', module: str = 'base', threads: int = 1, enrichr_url: str = 'http://maayanlab.cloud', verbose: bool = False)[source]
base class of GSEA.
- enrichment_score(gene_list: Iterable[str], correl_vector: Iterable[float], gene_set: Dict[str, List[str]], weight: float = 1.0, nperm: int = 1000, seed: int = 123, single: bool = False, scale: bool = False)[source]
This is the most important function of GSEApy. It has the same algorithm with GSEA and ssGSEA.
- Parameters:
gene_list – The ordered gene list gene_name_list, rank_metric.index.values
gene_set – gene_sets in gmt file, please use gmt_parser to get gene_set.
weight – It’s the same with gsea’s weighted_score method. Weighting by the correlation is a very reasonable choice that allows significant gene sets with less than perfect coherence. options: 0(classic),1,1.5,2. default:1. if one is interested in penalizing sets for lack of coherence or to discover sets with any type of nonrandom distribution of tags, a value p < 1 might be appropriate. On the other hand, if one uses sets with large number of genes and only a small subset of those is expected to be coherent, then one could consider using p > 1. Our recommendation is to use p = 1 and use other settings only if you are very experienced with the method and its behavior.
correl_vector – A vector with the correlations (e.g. signal to noise scores) corresponding to the genes in the gene list. Or rankings, rank_metric.values
nperm – Only use this parameter when computing esnull for statistical testing. Set the esnull value equal to the permutation number.
seed – Random state for initializing gene list shuffling. Default: seed=None
- Returns:
ES: Enrichment score (real number between -1 and +1)
ESNULL: Enrichment score calculated from random permutations.
Hits_Indices: Index of a gene in gene_list, if gene is included in gene_set.
RES: Numerical vector containing the running enrichment score for all locations in the gene list .
- load_gmt(gene_list: Iterable[str], gmt: List[str] | str | Dict[str, str]) Dict[str, List[str]] [source]
load gene set dict
- load_gmt_only(gmt: List[str] | str | Dict[str, str]) Dict[str, List[str]] [source]
parse gene_sets. gmt: List, Dict, Strings
However,this function will merge different gene sets into one big dict to save computation time for later.
- make_unique(rank_metric: DataFrame, col_idx: int) DataFrame [source]
make gene id column unique by adding a digit, similar to R’s make.unique
- plot(terms: str | List[str], colors: str | List[str] | None = None, legend_kws: Dict[str, Any] | None = None, figsize: Tuple[float, float] = (4, 5), show_ranking: bool = True, ofname: str | None = None)[source]
terms: str, list. terms/pathways to show colors: str, list. list of colors for each term/pathway legend_kws: kwargs to pass to ax.legend. e.g. loc, bbox_to_achor. ofname: savefig
- property results
compatible to old style
- to_df(gsea_summary: List[Dict], gmt: Dict[str, List[str]], rank_metric: Series | DataFrame, indices: List | None = None)[source]
Convernt GSEASummary to DataFrame
- rank_metric: if a Series, then it must be sorted in descending order already
if a DataFrame, indices must not None.
indices: Only works for DataFrame input. Stores the indices of sorted array
6.3. Over-representation Statistics
- gseapy.stats.calc_pvalues(query, gene_sets, background=20000, **kwargs)[source]
calculate pvalues for all categories in the graph
- Parameters:
query (set) – set of identifiers for which the p value is calculated
gene_sets (dict) – gmt file dict after background was set
background (set) – total number of genes in your annotated database.
- Returns:
pvalues x: overlapped gene number n: length of gene_set which belongs to each terms hits: overlapped gene names.
6. For 2*2 contingency table:
in query | not in query | row total=> in gene_set | a | b | a+b => not in gene_set | c | d | c+d
column total | a+b+c+d = anno database
- Then, in R
- x=a the number of white balls drawn without replacement
from an urn which contains both black and white balls.
m=a+b the number of white balls in the urn n=c+d the number of black balls in the urn k=a+c the number of balls drawn from the urn
In Scipy: for args in scipy.hypergeom.sf(k, M, n, N, loc=0):
M: the total number of objects, n: the total number of Type I objects. k: the random variate represents the number of Type I objects in N drawn
without replacement from the total population.
Therefore, these two functions are the same when using parameters from 2*2 table: R: > phyper(x-1, m, n, k, lower.tail=FALSE) Scipy: >>> hypergeom.sf(x-1, m+n, m, k)
For Odds ratio in Enrichr (see https://maayanlab.cloud/Enrichr/help#background&q=4)
oddsRatio = (1.0 * x * d) / Math.max(1.0 * b * c, 1)
where:
x are the overlapping genes, b (m-x) are the genes in the annotated set - overlapping genes, c (k-x) are the genes in the input set - overlapping genes, d (bg-m-k+x) are the 20,000 genes (or total genes in the background) - genes in the annotated set - genes in the input set + overlapping genes
- gseapy.stats.fdrcorrection(pvals, alpha=0.05)[source]
benjamini hocheberg fdr correction. inspired by statsmodels
- gseapy.stats.multiple_testing_correction(ps, alpha=0.05, method='benjamini-hochberg', **kwargs)[source]
correct pvalues for multiple testing and add corrected q value
- Parameters:
ps – list of pvalues
alpha – significance level default : 0.05
method – multiple testing correction method [bonferroni|benjamini-hochberg]
- Returns (q, rej):
two lists of q-values and rejected nodes
6.4. Enrichr API
- class gseapy.enrichr.Enrichr(gene_list: Iterable[str], gene_sets: List[str] | str | Dict[str, str], organism: str = 'human', outdir: str | None = 'Enrichr', background: List[str] | int | str = 'hsapiens_gene_ensembl', cutoff: float = 0.05, format: str = 'pdf', figsize: Tuple[float, float] = (6.5, 6), top_term: int = 10, no_plot: bool = False, verbose: bool = False)[source]
Enrichr API
- check_genes(gene_list: List[str], usr_list_id: str)[source]
Compare the genes sent and received to get successfully recognized genes
- enrich(gmt: Dict[str, List[str]])[source]
use local mode
p = p-value computed using the Fisher exact test (Hypergeometric test) z = z-score (Odds Ratio) combine score = - log(p)·z
see here: http://amp.pharm.mssm.edu/Enrichr/help#background&q=4
columns contain:
Term Overlap P-value Odds Ratio Combinde Score Adjusted_P-value Genes
- filter_gmt(gmt, background)[source]
the gmt values should be filtered only for genes that exist in background this substantially affect the significance of the test, the hypergeometric distribution.
- Parameters:
gmt – a dict of gene sets.
background – list, set, or tuple. A list of custom backgound genes.
6.5. BioMart API
- class gseapy.biomart.Biomart(host: str = 'www.ensembl.org', verbose: bool = False)[source]
query from BioMart
- get_attributes(dataset: str = 'hsapiens_gene_ensembl')[source]
Get available attritbutes from dataset you’ve selected
- get_datasets(mart: str = 'ENSEMBL_MART_ENSEMBL')[source]
Get available datasets from mart you’ve selected
- get_filters(dataset: str = 'hsapiens_gene_ensembl')[source]
Get available filters from dataset you’ve selected
- query(dataset: str = 'hsapiens_gene_ensembl', attributes: List[str] | None = [], filters: Dict[str, Iterable[str]] | None = {}, filename: str | None = None)[source]
mapping ids using BioMart.
- Parameters:
dataset – str, default: ‘hsapiens_gene_ensembl’
attributes – str, list, tuple
filters – dict, {‘filter name’: list(filter value)}
host – www.ensembl.org, asia.ensembl.org, useast.ensembl.org
- Returns:
a dataframe contains all attributes you selected.
Example:
>>> queries = {'ensembl_gene_id': ['ENSG00000125285','ENSG00000182968'] } # need to be a python dict >>> results = bm.query(dataset='hsapiens_gene_ensembl', attributes=['ensembl_gene_id', 'external_gene_name', 'entrezgene_id', 'go_id'], filters=queries)
- query_simple(dataset: str = 'hsapiens_gene_ensembl', attributes: List[str] = [], filters: Dict[str, Iterable[str]] = {}, filename: str | None = None)[source]
This function is a simple version of BioMart REST API. same parameter to query().
However, you could get cross page of mapping. such as Mouse 2 human gene names
Note: it will take a couple of minutes to get the results. A xml template for querying biomart. (see https://gist.github.com/keithshep/7776579)
- Example::
>>> from gseapy import Biomart >>> bm = Biomart() >>> results = bm.query_simple(dataset='mmusculus_gene_ensembl', attributes=['ensembl_gene_id', 'external_gene_name', 'hsapiens_homolog_associated_gene_name', 'hsapiens_homolog_ensembl_gene'])
6.6. Parser
- gseapy.parser.download_library(name: str, organism: str = 'human') Dict[str, List[str]] [source]
download enrichr libraries.
- Parameters:
name (str) – the enrichr library name. see gseapy.get_library_name().
organism (str) – Select one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }
- Return dict:
gene_sets of the enrichr library from selected organism
- gseapy.parser.get_library(name: str, organism: str = 'Human', min_size: int = 0, max_size: int = 2000, gene_list: List[str] | None = None) Dict[str, List[str]] [source]
Parse gene_sets.gmt(gene set database) file or download from enrichr server.
- Parameters:
name (str) – the gene_sets.gmt file or an enrichr library name. checkout full enrichr library name here: https://maayanlab.cloud/Enrichr/#libraries
organism (str) – choose one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }. This arugment has not effect if input is a .gmt file.
min_size – Minimum allowed number of genes for each gene set. Default: 0.
max_size – Maximum allowed number of genes for each gene set. Default: 2000.
gene_list – if input a gene list, min and max overlapped genes between gene set and gene_list are kept.
- Return dict:
Return a filtered gene set database dictionary.
Note: DO NOT filter gene sets, when use
replot()
. BecauseGSEA
Desktop have already done this for you.
- gseapy.parser.get_library_name(organism: str = 'Human') List[str] [source]
return enrichr active enrichr library name. see also: https://maayanlab.cloud/modEnrichr/
- Parameters:
organism (str) – Select one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }
- Returns:
a list of enrichr libraries from selected database
- gseapy.parser.gsea_cls_parser(cls: str) Tuple[str] [source]
Extract class(phenotype) name from .cls file.
- Parameters:
cls – the a class list instance or .cls file which is identical to GSEA input .
- Returns:
phenotype name and a list of class vector.
6.7. Visualization
- gseapy.plot.barplot(df: DataFrame, column: str = 'Adjusted P-value', group: str | None = None, title: str = '', cutoff: float = 0.05, top_term: int = 10, figsize: Tuple[float, float] = (4, 6), color: str | List[str] | Dict[str, str] = 'salmon', ofname: str | None = None, **kwargs)[source]
Visualize GSEApy Results. When multiple datasets exist in the input dataframe, the group argument is your friend.
- Parameters:
df – GSEApy DataFrame results.
column – column name in df to map the x-axis data. Default: Adjusted P-value
group – group by the variable in df that will produce bars with different colors.
title – figure title.
cutoff – terms with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)
top_term – number of top enriched terms grouped by hue are shown.
figsize – tuple, matplotlib figsize.
color – color or list or dict of matplotlib.colors. Must be reconigzed by matplotlib. if dict input, dict keys must be found in the group
ofname – output file name. If None, don’t save figure
- Returns:
matplotlib.Axes. return None if given ofname. Only terms with column <= cut-off are plotted.
- gseapy.plot.dotplot(df: DataFrame, column: str = 'Adjusted P-value', x: str | None = None, y: str = 'Term', x_order: List[str] | bool = False, y_order: List[str] | bool = False, title: str = '', cutoff: float = 0.05, top_term: int = 10, size: float = 5, figsize: Tuple[float, float] = (4, 6), cmap: str = 'viridis_r', ofname: str | None = None, xticklabels_rot: float | None = None, yticklabels_rot: float | None = None, marker: str = 'o', show_ring: bool = False, **kwargs)[source]
Visualize GSEApy Results with categorical scatterplot When multiple datasets exist in the input dataframe, the x argument is your friend.
- Parameters:
df – GSEApy DataFrame results.
column – column name in df that map the dot colors. Default: Adjusted P-value.
x – Categorical variable in df that map the x-axis data. Default: None.
y – Categorical variable in df that map the y-axis data. Default: Term.
x_order – bool, array-like list. Default: False. If True, peformed hierarchical_clustering on X-axis. or input a array-like list of x categorical levels.
x_order – bool, array-like list. Default: False. If True, peformed hierarchical_clustering on Y-axis. or input a array-like list of y categorical levels.
title – Figure title.
cutoff – Terms with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)
top_term – Number of enriched terms to show (based on values in the column (colormap)).
size – float, scale the dot size to get proper visualization.
figsize – tuple, matplotlib figure size.
cmap – Matplotlib colormap for mapping the column semantic.
ofname – Output file name. If None, don’t save figure
marker – The matplotlib.markers. See https://matplotlib.org/stable/api/markers_api.html
bool (show_ring) – Whether to draw outer ring.
- Returns:
matplotlib.Axes. return None if given ofname. Only terms with column <= cut-off are plotted.
- gseapy.plot.enrichment_map(df: DataFrame, column: str = 'Adjusted P-value', cutoff: float = 0.05, top_term: int = 10, **kwargs) Tuple[DataFrame, DataFrame] [source]
Visualize GSEApy Results. Node size corresponds to the percentage of gene overlap in a certain term of interest. Colour of the node corresponds to the significance of the enriched terms. Edge size corresponds to the number of genes that overlap between the two connected nodes. Gray edges correspond to both nodes when it is the only colour edge. When there are two different edge colours, red corresponds to positve nodes and blue corresponds to negative nodes.
- Parameters:
df – GSEApy DataFrame results.
column – column name in df to map the node colors. Default: Adjusted P-value or FDR q-val. choose from (“Adjusted P-value”, “P-value”, “FDR q-val”, “NOM p-val”).
group – group by the variable in df that will produce bars with different colors.
title – figure title.
cutoff – nodes with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)
top_term – number of top enriched terms are selected as nodes.
- Returns:
tuple of dataframe (nodes, edges)
- gseapy.plot.gseaplot(term: str, hits: Sequence[int], nes: float, pval: float, fdr: float, RES: Sequence[float], rank_metric: Sequence[float] | None = None, pheno_pos: str = '', pheno_neg: str = '', color: str = '#88C544', figsize: Tuple[float, float] = (6, 5.5), cmap: str = 'seismic', ofname: str | None = None, **kwargs) List[Axes] | None [source]
This is the main function for generating the gsea plot.
- Parameters:
term – gene_set name
hits – hits indices of rank_metric.index presented in gene set S.
nes – Normalized enrichment scores.
pval – nominal p-value.
fdr – false discovery rate.
RES – running enrichment scores.
rank_metric – pd.Series for rankings, rank_metric.values.
pheno_pos – phenotype label, positive correlated.
pheno_neg – phenotype label, negative correlated.
color – color for RES and hits.
figsize – matplotlib figsize.
ofname – output file name. If None, don’t save figure
return matplotlib.Figure.
- gseapy.plot.gseaplot2(terms: List[str], hits: List[Sequence[int]], RESs: List[Sequence[float]], rank_metric: Sequence[float] | None = None, colors: str | List[str] | None = None, figsize: Tuple[float, float] = (6, 4), legend_kws: Dict[str, Any] | None = None, ofname: str | None = None, **kwargs) List[Axes] | None [source]
Trace plot for combining multiple terms/pathways into one plot :param terms: list of terms to show in trace plot :param hits: list of hits indices correspond to each term. :param RESs: list of running enrichment scores correspond to each term. :param rank_metric: Optional, rankings. :param figsize: matplotlib figsize. :legend_kws: Optional, contol the location of lengends :param ofname: output file name. If None, don’t save figure
return matplotlib.Figure.
- gseapy.plot.heatmap(df: DataFrame, z_score: int | None = None, title: str = '', figsize: Tuple[float, float] = (5, 5), cmap: str | None = None, xticklabels: bool = True, yticklabels: bool = True, ofname: str | None = None, **kwargs)[source]
Visualize the dataframe.
- Parameters:
df – DataFrame from expression table.
z_score – 0, 1, or None. z_score axis{0, 1}. If None, not scale.
title – figure title.
figsize – heatmap figsize.
cmap – matplotlib colormap. e.g. “RdBu_r”.
xticklabels – bool, whether to show xticklabels.
xticklabels – bool, whether to show xticklabels.
ofname – output file name. If None, don’t save figure
- gseapy.plot.ringplot(df: DataFrame, column: str = 'Adjusted P-value', x: str | None = None, title: str = '', cutoff: float = 0.05, top_term: int = 10, size: float = 5, figsize: Tuple[float, float] = (4, 6), cmap: str = 'viridis_r', ofname: str | None = None, xticklabels_rot: float | None = None, yticklabels_rot: float | None = None, marker='o', show_ring: bool = True, **kwargs)[source]
ringplot is deprecated, use dotplot instead
- Parameters:
df – GSEApy DataFrame results.
x – Group by the variable in df that will produce categorical scatterplot.
column – column name in df to map the dot colors. Default: Adjusted P-value
title – figure title
cutoff – terms with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)
top_term – number of enriched terms to show.
size – float, scale the dot size to get proper visualization.
figsize – tuple, matplotlib figure size.
cmap – matplotlib colormap for mapping the column semantic.
ofname – output file name. If None, don’t save figure
marker – the matplotlib.markers. See https://matplotlib.org/stable/api/markers_api.html
bool (show_ring) – whether to show outer ring.
- Returns:
matplotlib.Axes. return None if given ofname. Only terms with column <= cut-off are plotted.
- gseapy.plot.zscore(data2d: DataFrame, axis: int | None = 0)[source]
Standardize the mean and variance of the data axis Parameters.
- Parameters:
data2d – DataFrame to normalize.
axis – int, Which axis to normalize across. If 0, normalize across rows, if 1, normalize across columns. If None, don’t change data
- Returns:
Normalized DataFrame. Normalized data with a mean of 0 and variance of 1 across the specified axis.