6. Developmental Guide

6.1. Module APIs

gseapy.gsea()[source]

Run Gene Set Enrichment Analysis.

Parameters:
  • data – Gene expression data table, Pandas DataFrame, gct file.

  • gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.

  • cls – A list or a .cls file format required for GSEA.

  • outdir (str) – Results output directory. If None, nothing will write to disk.

  • permutation_num (int) – Number of permutations. Default: 1000. Minimial possible nominal p-value is about 1/nperm.

  • permutation_type (str) – Type of permutation reshuffling, choose from {“phenotype”: ‘sample.labels’ , “gene_set” : gene.labels}.

  • min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.

  • max_size (int) – Maximum allowed number of genes from gene set also the data set. Default: 500.

  • weight (float) – Refer to algorithm.enrichment_score(). Default:1.

  • method

    The method used to calculate a correlation or ranking. Default: ‘log2_ratio_of_classes’. Others methods are:

    1. ’signal_to_noise’

      You must have at least three samples for each phenotype to use this metric. The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    2. ’t_test’

      Uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric. The larger the tTest ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    3. ’ratio_of_classes’ (also referred to as fold change).

      Uses the ratio of class means to calculate fold change for natural scale data.

    4. ’diff_of_classes’

      Uses the difference of class means to calculate fold change for nature scale data

    5. ’log2_ratio_of_classes’

      Uses the log2 ratio of class means to calculate fold change for natural scale data. This is the recommended statistic for calculating fold change for log scale data.

  • ascending (bool) – Sorting order of rankings. Default: False.

  • threads (int) – Number of threads you are going to use. Default: 4.

  • figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [6.5,6].

  • format (str) – Matplotlib figure format. Default: ‘pdf’.

  • graph_num (int) – Plot graphs for top sets of each phenotype.

  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.

  • seed – Random seed. expect an integer. Default:None.

  • verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.

Returns:

Return a GSEA obj. All results store to a dictionary, obj.results, where contains:

| {
|  term: gene set name,
|  es: enrichment score,
|  nes: normalized enrichment score,
|  pval:  Nominal p-value (from the null distribution of the gene set,
|  fdr: FDR qvalue (adjusted False Discory Rate),
|  fwerp: Family wise error rate p-values,
|  tag %: Percent of gene set before running enrichment peak (ES),
|  gene %: Percent of gene list before running enrichment peak (ES),
|  lead_genes: leading edge genes (gene hits before running enrichment peak),
|  matched genes: genes matched to the data,
| }

gseapy.prerank()[source]

Run Gene Set Enrichment Analysis with pre-ranked correlation defined by user.

Parameters:
  • rnk – pre-ranked correlation table or pandas DataFrame. Same input with GSEA .rnk file.

  • gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.

  • outdir – results output directory. If None, nothing will write to disk.

  • permutation_num (int) – Number of permutations. Default: 1000. Minimial possible nominal p-value is about 1/nperm.

  • min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.

  • max_size (int) – Maximum allowed number of genes from gene set also the data set. Defaults: 500.

  • weight (str) – Refer to algorithm.enrichment_score(). Default:1.

  • ascending (bool) – Sorting order of rankings. Default: False.

  • threads (int) – Number of threads you are going to use. Default: 4.

  • figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [6.5,6].

  • format (str) – Matplotlib figure format. Default: ‘pdf’.

  • graph_num (int) – Plot graphs for top sets of each phenotype.

  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.

  • seed – Random seed. expect an integer. Default:None.

  • verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.

Returns:

Return a Prerank obj. All results store to a dictionary, obj.results, where contains:

| {
|  term: gene set name,
|  es: enrichment score,
|  nes: normalized enrichment score,
|  pval:  Nominal p-value (from the null distribution of the gene set,
|  fdr: FDR qvalue (adjusted False Discory Rate),
|  fwerp: Family wise error rate p-values,
|  tag %: Percent of gene set before running enrichment peak (ES),
|  gene %: Percent of gene list before running enrichment peak (ES),
|  lead_genes: leading edge genes (gene hits before running enrichment peak),
|  matched genes: genes matched to the data,
| }

gseapy.ssgsea()[source]

Run Gene Set Enrichment Analysis with single sample GSEA tool

Parameters:
  • data – Expression table, pd.Series, pd.DataFrame, GCT file, or .rnk file format.

  • gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.

  • outdir – Results output directory. If None, nothing will write to disk.

  • sample_norm_method (str) –

    Sample normalization method. Choose from {‘rank’, ‘log’, ‘log_rank’, None}. Default: rank. this argument will be used for ordering genes.

    1. ’rank’: Rank your expression data, and transform by 10000*rank_dat/gene_numbers

    2. ’log’ : Do not rank, but transform data by log(data + exp(1)), while data = data[data<1] =1.

    3. ’log_rank’: Rank your expression data, and transform by log(10000*rank_dat/gene_numbers+ exp(1))

    4. None or ‘custom’: Do nothing, and use your own rank value to calculate enrichment score.

see here: https://github.com/GSEA-MSigDB/ssGSEAProjection-gpmodule/blob/master/src/ssGSEAProjection.Library.R, line 86

Parameters:
  • correl_norm_type (str) –

    correlation normalization type. Choose from {‘rank’, ‘symrank’, ‘zscore’, None}. Default: rank. After ordering genes by sample_norm_method, further data transformed could be applied to get enrichment score.

    when weight == 0, sample_norm_method and correl_norm_type do not matter; when weight > 0, the combination of sample_norm_method and correl_norm_type dictate how the gene expression values in input data are transformed to obtain the score – use this setting with care (the transformations can skew scores towards +ve or -ve values)

    sample_norm_method will first transformed and rank original data. the data is named correl_vector for each sample. then correl_vector is transformed again by

    1. correl_norm_type is None or ‘rank’ : do nothing, genes are weighted by actual correl_vector.

    2. correl_norm_type ==’symrank’: symmetric ranking.

    3. correl_norm_type ==’zscore’: standardizes the correl_vector before using them to calculate scores.

  • min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.

  • max_size (int) – Maximum allowed number of genes from gene set also the data set. Default: 2000.

  • permutation_num (int) – For ssGSEA, default is 0. However, if you try to use ssgsea method to get pval and fdr, set to an interger.

  • weight (str) – Refer to algorithm.enrichment_score(). Default:0.25.

  • ascending (bool) – Sorting order of rankings. Default: False.

  • threads (int) – Number of threads you are going to use. Default: 4.

  • figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [7,6].

  • format (str) – Matplotlib figure format. Default: ‘pdf’.

  • graph_num (int) – Plot graphs for top sets of each phenotype.

  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.

  • seed – Random seed. expect an integer. Default:None.

  • verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.

Returns:

Return a ssGSEA obj. All results store to a dictionary, access enrichment score by obj.resultsOnSamples, and normalized enrichment score by obj.res2d. if permutation_num > 0, additional results contain:

| {
|  term: gene set name,
|  es: enrichment score,
|  nes: normalized enrichment score,
|  pval:  Nominal p-value (from the null distribution of the gene set (if permutation_num > 0),
|  fdr: FDR qvalue (adjusted FDR) (if permutation_num > 0),
|  fwerp: Family wise error rate p-values (if permutation_num > 0),
|  tag %: Percent of gene set before running enrichment peak (ES),
|  gene %: Percent of gene list before running enrichment peak (ES),
|  lead_genes: leading edge genes (gene hits before running enrichment peak),
|  matched genes: genes matched to the data,
| }

gseapy.enrichr()[source]

Enrichr API.

Parameters:
  • gene_list – str, list, tuple, series, dataframe. Also support input txt file with one gene id per row. The input identifier should be the same type to gene_sets.

  • gene_sets

    str, list, tuple of Enrichr Library name(s). or custom defined gene_sets (dict, or gmt file).

    Examples:

    Input Enrichr Libraries (https://maayanlab.cloud/Enrichr/#stats):

    str: ‘KEGG_2016’ list: [‘KEGG_2016’,’KEGG_2013’] Use comma to separate each other, e.g. “KEGG_2016,huMAP,GO_Biological_Process_2018”

    Input custom files:
    dict: gene_sets={‘A’:[‘gene1’, ‘gene2’,…],

    ’B’:[‘gene2’, ‘gene4’,…], …}

    gmt: “genes.gmt”

    see also the online docs: https://gseapy.readthedocs.io/en/latest/gseapy_example.html#2.-Enrichr-Example

  • organism

    Enrichr supported organism. Select from (human, mouse, yeast, fly, fish, worm). This argument only affects the Enrichr library names you’ve chosen. No any affects to gmt or dict input of gene_sets.

    see here for more details: https://maayanlab.cloud/modEnrichr/.

  • outdir – Output file directory

  • background

    int, list, str. Background genes. This argument works only if gene_sets has a type Dict or gmt file. If your input are just Enrichr library names, this argument will be ignored.

    However, this argument is not straightforward when gene_sets is given a custom input (a gmt file or dict).

    By default, all genes listed in the gene_sets input will be used as background.

    There are 3 ways to tune this argument:

    1. (Recommended) Input a list of background genes: [‘gene1’, ‘gene2’,…] The background gene list is defined by your experment. e.g. the expressed genes in your RNA-seq. The gene identifer in gmt/dict should be the same type to the backgound genes.

    2. Specify a number: e.g. 20000. (the number of total expressed genes). This works, but not recommend. It assumes that all your genes could be found in background. If genes exist in gmt but not included in background provided, they will affect the significance of the statistical test.

    3. Set a Biomart dataset name: e.g. “hsapiens_gene_ensembl” The background will be all annotated genes from the BioMart datasets you’ve choosen. The program will try to retrieve the background information automatically.

      Enrichr module use the code below to get the background genes:
      >>> from gseapy.parser import Biomart
      >>> bm = Biomart()
      >>> df = bm.query(dataset=background, #  e.g. 'hsapiens_gene_ensembl'
                   attributes=['ensembl_gene_id', 'external_gene_name', 'entrezgene_id'],
                   filename=f'~/.cache/gseapy/{background}.background.genes.txt')
      >>> df.dropna(subset=["entrezgene_id"], inplace=True)
      

      So only genes with entrezid above will be the background genes if not input specify by user.

  • cutoff – Show enriched terms which Adjusted P-value < cutoff. Only affects the output figure, not the final output file. Default: 0.05

  • format – Output figure format supported by matplotlib,(‘pdf’,’png’,’eps’…). Default: ‘pdf’.

  • figsize – Matplotlib figsize, accept a tuple or list, e.g. (width,height). Default: (6.5,6).

  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.

  • verbose (bool) – Increase output verbosity, print out progress of your job, Default: False.

Returns:

An Enrichr object, which obj.res2d stores your last query, obj.results stores your all queries.

gseapy.enrich()[source]

Perform over-representation analysis (hypergeometric test).

Parameters:
  • gene_list – str, list, tuple, series, dataframe. Also support input txt file with one gene id per row. The input identifier should be the same type to gene_sets.

  • gene_sets

    str, list, tuple of Enrichr Library name(s). or custom defined gene_sets (dict, or gmt file).

    Examples:

    dict: gene_sets={‘A’:[‘gene1’, ‘gene2’,…],

    ’B’:[‘gene2’, ‘gene4’,…], …}

    gmt: “genes.gmt”

  • outdir – Output file directory

  • background

    None | int | list | str. Background genes. This argument works only if gene_sets has a type Dict or gmt file.

    However, this argument is not straightforward when gene_sets is given a custom input (a gmt file or dict).

    By default, all genes listed in the gene_sets input will be used as background.

    There are 3 ways to tune this argument:

    1. (Recommended) Input a list of background genes: [‘gene1’, ‘gene2’,…] The background gene list is defined by your experment. e.g. the expressed genes in your RNA-seq. The gene identifer in gmt/dict should be the same type to the backgound genes.

    2. Specify a number: e.g. 20000. (the number of total expressed genes). This works, but not recommend. It assumes that all your genes could be found in background. If genes exist in gmt but not included in background provided, they will affect the significance of the statistical test.

    3. Set a Biomart dataset name: e.g. “hsapiens_gene_ensembl” The background will be all annotated genes from the BioMart datasets you’ve choosen. The program will try to retrieve the background information automatically.

      Enrichr module use the code below to get the background genes:
      >>> from gseapy.parser import Biomart
      >>> bm = Biomart()
      >>> df = bm.query(dataset=background, #  e.g. 'hsapiens_gene_ensembl'
                   attributes=['ensembl_gene_id', 'external_gene_name', 'entrezgene_id'],
                   filename=f'~/.cache/gseapy/{background}.background.genes.txt')
      >>> df.dropna(subset=["entrezgene_id"], inplace=True)
      

      So only genes with entrezid above will be the background genes if not input specify by user.

  • cutoff – Show enriched terms which Adjusted P-value < cutoff. Only affects the output figure, not the final output file. Default: 0.05

  • format – Output figure format supported by matplotlib,(‘pdf’,’png’,’eps’…). Default: ‘pdf’.

  • figsize – Matplotlib figsize, accept a tuple or list, e.g. (width,height). Default: (6.5,6).

  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.

  • verbose (bool) – Increase output verbosity, print out progress of your job, Default: False.

Returns:

An Enrichr object, which obj.res2d stores your last query, obj.results stores your all queries.

gseapy.replot()[source]

The main function to reproduce GSEA desktop outputs.

Parameters:
  • indir – GSEA desktop results directory. In the sub folder, you must contain edb file folder.

  • outdir – Output directory.

  • weight (float) – weighted score type. choose from {0,1,1.5,2}. Default: 1.

  • figsize (list) – Matplotlib output figure figsize. Default: [6.5,6].

  • format (str) – Matplotlib output figure format. Default: ‘pdf’.

  • min_size (int) – Min size of input genes presented in Gene Sets. Default: 3.

  • max_size (int) – Max size of input genes presented in Gene Sets. Default: 5000. You are not encouraged to use min_size, or max_size argument in replot() function. Because gmt file has already been filtered.

  • verbose – Bool, increase output verbosity, print out progress of your job, Default: False.

Returns:

Generate new figures with selected figure format. Default: ‘pdf’.

6.2. GSEA Statistics

class gseapy.gsea.GSEA(data: DataFrame | str, gene_sets: List[str] | str | Dict[str, str], classes: List[str] | str | Dict[str, str], outdir: str | None = None, min_size: int = 15, max_size: int = 500, permutation_num: int = 1000, weight: float = 1.0, permutation_type: str = 'phenotype', method: str = 'signal_to_noise', ascending: bool = False, threads: int = 1, figsize: Tuple[float, float] = (6.5, 6), format: str = 'pdf', graph_num: int = 20, no_plot: bool = False, seed: int = 123, verbose: bool = False)[source]

GSEA main tool

calc_metric(df: DataFrame, method: str, pos: str, neg: str, classes: Dict[str, str], ascending: bool) Tuple[List[int], Series][source]

The main function to rank an expression table. works for 2d array.

Parameters:
  • df – gene_expression DataFrame.

  • method

    The method used to calculate a correlation or ranking. Default: ‘log2_ratio_of_classes’. Others methods are:

    1. ’signal_to_noise’ (s2n) or ‘abs_signal_to_noise’ (abs_s2n)

      You must have at least three samples for each phenotype. The more distinct the gene expression is in each phenotype, the more the gene acts as a “class marker”.

    2. ’t_test’

      Uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric. The larger the t-test ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    3. ’ratio_of_classes’ (also referred to as fold change).

      Uses the ratio of class means to calculate fold change for natural scale data.

    4. ’diff_of_classes’

      Uses the difference of class means to calculate fold change for natural scale data

    5. ’log2_ratio_of_classes’

      Uses the log2 ratio of class means to calculate fold change for natural scale data. This is the recommended statistic for calculating fold change for log scale data.

  • pos (str) – one of labels of phenotype’s names.

  • neg (str) – one of labels of phenotype’s names.

  • classes (dict) – column id to group mapping.

  • ascending (bool) – bool or list of bool. Sort ascending vs. descending.

Returns:

returns argsort values of a tuple where 0: argsort positions (indices) 1: pd.Series of correlation value. Gene_name is index, and value is rankings.

visit here for more docs: http://software.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html

load_classes(classes: str | List[str] | Dict[str, Any])[source]

Parse group (classes)

load_data(groups: List[str] | Dict[str, Any]) Tuple[DataFrame, Dict][source]

pre-processed the data frame.new filtering methods will be implement here.

run()[source]

GSEA main procedure

class gseapy.gsea.Prerank(rnk: DataFrame | Series | str, gene_sets: List[str] | str | Dict[str, str], outdir: str | None = None, pheno_pos='Pos', pheno_neg='Neg', min_size: int = 15, max_size: int = 500, permutation_num: int = 1000, weight: float = 1.0, ascending: bool = False, threads: int = 1, figsize: Tuple[float, float] = (6.5, 6), format: str = 'pdf', graph_num: int = 20, no_plot: bool = False, seed: int = 123, verbose: bool = False)[source]

GSEA prerank tool

run()[source]

GSEA prerank workflow

class gseapy.gsea.Replot(indir: str, outdir: str = 'GSEApy_Replot', weight: float = 1.0, min_size: int = 3, max_size: int = 1000, figsize: Tuple[float, float] = (6.5, 6), format: str = 'pdf', verbose: bool = False)[source]

To reproduce GSEA desktop output results.

gsea_edb_parser(results_path)[source]

Parse results.edb file stored under edb file folder.

Parameters:

results_path – the path of results.edb file.

Returns:

a dict contains { enrichment_term: [es, nes, pval, fdr, fwer, hit_ind]}

run()[source]

main replot function

class gseapy.base.GMT(mapping: Dict[str, str] | None = None, description: str | None = None)[source]
apply(func)[source]

apply function in place

write(ofname: str)[source]

write gmt file to disk

class gseapy.base.GSEAbase(outdir: str | None = None, gene_sets: List[str] | str | Dict[str, str] = 'KEGG_2016', module: str = 'base', threads: int = 1, enrichr_url: str = 'http://maayanlab.cloud', verbose: bool = False)[source]

base class of GSEA.

enrichment_score(gene_list: Iterable[str], correl_vector: Iterable[float], gene_set: Dict[str, List[str]], weight: float = 1.0, nperm: int = 1000, seed: int = 123, single: bool = False, scale: bool = False)[source]

This is the most important function of GSEApy. It has the same algorithm with GSEA and ssGSEA.

Parameters:
  • gene_list – The ordered gene list gene_name_list, rank_metric.index.values

  • gene_set – gene_sets in gmt file, please use gmt_parser to get gene_set.

  • weight – It’s the same with gsea’s weighted_score method. Weighting by the correlation is a very reasonable choice that allows significant gene sets with less than perfect coherence. options: 0(classic),1,1.5,2. default:1. if one is interested in penalizing sets for lack of coherence or to discover sets with any type of nonrandom distribution of tags, a value p < 1 might be appropriate. On the other hand, if one uses sets with large number of genes and only a small subset of those is expected to be coherent, then one could consider using p > 1. Our recommendation is to use p = 1 and use other settings only if you are very experienced with the method and its behavior.

  • correl_vector – A vector with the correlations (e.g. signal to noise scores) corresponding to the genes in the gene list. Or rankings, rank_metric.values

  • nperm – Only use this parameter when computing esnull for statistical testing. Set the esnull value equal to the permutation number.

  • seed – Random state for initializing gene list shuffling. Default: seed=None

Returns:

ES: Enrichment score (real number between -1 and +1)

ESNULL: Enrichment score calculated from random permutations.

Hits_Indices: Index of a gene in gene_list, if gene is included in gene_set.

RES: Numerical vector containing the running enrichment score for all locations in the gene list .

get_libraries() List[str][source]

return active enrichr library name.Offical API

load_gmt(gene_list: Iterable[str], gmt: List[str] | str | Dict[str, str]) Dict[str, List[str]][source]

load gene set dict

load_gmt_only(gmt: List[str] | str | Dict[str, str]) Dict[str, List[str]][source]

parse gene_sets. gmt: List, Dict, Strings

However,this function will merge different gene sets into one big dict to save computation time for later.

parse_gmt(gmt: str) Dict[str, List[str]][source]

gmt parser when input is a string

plot(terms: str | List[str], colors: str | List[str] | None = None, legend_kws: Dict[str, Any] | None = None, figsize: Tuple[float, float] = (4, 5), show_ranking: bool = True, ofname: str | None = None)[source]

terms: str, list. terms/pathways to show colors: str, list. list of colors for each term/pathway legend_kws: kwargs to pass to ax.legend. e.g. loc, bbox_to_achor. ofname: savefig

prepare_outdir()[source]

create temp directory.

property results

compatible to old style

to_df(gsea_summary: List[Dict], gmt: Dict[str, List[str]], rank_metric: Series | DataFrame, indices: List | None = None)[source]

Convernt GSEASummary to DataFrame

rank_metric: if a Series, then it must be sorted in descending order already

if a DataFrame, indices must not None.

indices: Only works for DataFrame input. Stores the indices of sorted array

6.3. Over-representation Statistics

gseapy.stats.calc_pvalues(query, gene_sets, background=20000, **kwargs)[source]

calculate pvalues for all categories in the graph

Parameters:
  • query (set) – set of identifiers for which the p value is calculated

  • gene_sets (dict) – gmt file dict after background was set

  • background (set) – total number of genes in your annotated database.

Returns:

pvalues x: overlapped gene number n: length of gene_set which belongs to each terms hits: overlapped gene names.

6. For 2*2 contingency table:

in query | not in query | row total

=> in gene_set | a | b | a+b => not in gene_set | c | d | c+d

column total | a+b+c+d = anno database

Then, in R
x=a the number of white balls drawn without replacement

from an urn which contains both black and white balls.

m=a+b the number of white balls in the urn n=c+d the number of black balls in the urn k=a+c the number of balls drawn from the urn

In Scipy: for args in scipy.hypergeom.sf(k, M, n, N, loc=0):

M: the total number of objects, n: the total number of Type I objects. k: the random variate represents the number of Type I objects in N drawn

without replacement from the total population.

Therefore, these two functions are the same when using parameters from 2*2 table: R: > phyper(x-1, m, n, k, lower.tail=FALSE) Scipy: >>> hypergeom.sf(x-1, m+n, m, k)

For Odds ratio in Enrichr (see https://maayanlab.cloud/Enrichr/help#background&q=4)

oddsRatio = (1.0 * x * d) / Math.max(1.0 * b * c, 1)

where:

x are the overlapping genes, b (m-x) are the genes in the annotated set - overlapping genes, c (k-x) are the genes in the input set - overlapping genes, d (bg-m-k+x) are the 20,000 genes (or total genes in the background) - genes in the annotated set - genes in the input set + overlapping genes

gseapy.stats.fdrcorrection(pvals, alpha=0.05)[source]

benjamini hocheberg fdr correction. inspired by statsmodels

gseapy.stats.multiple_testing_correction(ps, alpha=0.05, method='benjamini-hochberg', **kwargs)[source]

correct pvalues for multiple testing and add corrected q value

Parameters:
  • ps – list of pvalues

  • alpha – significance level default : 0.05

  • method – multiple testing correction method [bonferroni|benjamini-hochberg]

Returns (q, rej):

two lists of q-values and rejected nodes

6.4. Enrichr API

class gseapy.enrichr.Enrichr(gene_list: Iterable[str], gene_sets: List[str] | str | Dict[str, str], organism: str = 'human', outdir: str | None = 'Enrichr', background: List[str] | int | str = 'hsapiens_gene_ensembl', cutoff: float = 0.05, format: str = 'pdf', figsize: Tuple[float, float] = (6.5, 6), top_term: int = 10, no_plot: bool = False, verbose: bool = False)[source]

Enrichr API

check_genes(gene_list: List[str], usr_list_id: str)[source]

Compare the genes sent and received to get successfully recognized genes

enrich(gmt: Dict[str, List[str]])[source]

use local mode

p = p-value computed using the Fisher exact test (Hypergeometric test) z = z-score (Odds Ratio) combine score = - log(p)·z

see here: http://amp.pharm.mssm.edu/Enrichr/help#background&q=4

columns contain:

Term Overlap P-value Odds Ratio Combinde Score Adjusted_P-value Genes

filter_gmt(gmt, background)[source]

the gmt values should be filtered only for genes that exist in background this substantially affect the significance of the test, the hypergeometric distribution.

Parameters:
  • gmt – a dict of gene sets.

  • background – list, set, or tuple. A list of custom backgound genes.

get_background() Set[str][source]

get background gene

get_libraries() List[str][source]

return active enrichr library name. Official API

get_results(gene_list: List[str]) Tuple[AnyStr, DataFrame][source]

Enrichr API

parse_background(gmt: Dict[str, List[str]] | None = None)[source]

set background genes

parse_genelists() str[source]

parse gene list

parse_genesets(gene_sets=None)[source]

parse gene_sets input file type

prepare_outdir()[source]

create temp directory.

run()[source]

run enrichr for one sample gene list but multi-libraries

send_genes(payload, url) Dict[source]

send gene list to enrichr server

set_organism()[source]

Select Enrichr organism from below:

Human & Mouse, H. sapiens & M. musculus Fly, D. melanogaster Yeast, S. cerevisiae Worm, C. elegans Fish, D. rerio

6.5. BioMart API

class gseapy.biomart.Biomart(host: str = 'www.ensembl.org', verbose: bool = False)[source]

query from BioMart

add_filter(name: str, value: Iterable[str])[source]

key: filter names value: Iterable[str]

get_attributes(dataset: str = 'hsapiens_gene_ensembl')[source]

Get available attritbutes from dataset you’ve selected

get_datasets(mart: str = 'ENSEMBL_MART_ENSEMBL')[source]

Get available datasets from mart you’ve selected

get_filters(dataset: str = 'hsapiens_gene_ensembl')[source]

Get available filters from dataset you’ve selected

get_marts()[source]

Get available marts and their names.

query(dataset: str = 'hsapiens_gene_ensembl', attributes: List[str] | None = [], filters: Dict[str, Iterable[str]] | None = {}, filename: str | None = None)[source]

mapping ids using BioMart.

Parameters:
  • dataset – str, default: ‘hsapiens_gene_ensembl’

  • attributes – str, list, tuple

  • filters – dict, {‘filter name’: list(filter value)}

  • host – www.ensembl.org, asia.ensembl.org, useast.ensembl.org

Returns:

a dataframe contains all attributes you selected.

Example:

>>> queries = {'ensembl_gene_id': ['ENSG00000125285','ENSG00000182968'] } # need to be a python dict
>>> results = bm.query(dataset='hsapiens_gene_ensembl',
                       attributes=['ensembl_gene_id', 'external_gene_name', 'entrezgene_id', 'go_id'],
                       filters=queries)
query_simple(dataset: str = 'hsapiens_gene_ensembl', attributes: List[str] = [], filters: Dict[str, Iterable[str]] = {}, filename: str | None = None)[source]

This function is a simple version of BioMart REST API. same parameter to query().

However, you could get cross page of mapping. such as Mouse 2 human gene names

Note: it will take a couple of minutes to get the results. A xml template for querying biomart. (see https://gist.github.com/keithshep/7776579)

Example::
>>> from gseapy import Biomart
>>> bm = Biomart()
>>> results = bm.query_simple(dataset='mmusculus_gene_ensembl',
                              attributes=['ensembl_gene_id',
                                          'external_gene_name',
                                          'hsapiens_homolog_associated_gene_name',
                                          'hsapiens_homolog_ensembl_gene'])

6.6. Parser

gseapy.parser.download_library(name: str, organism: str = 'human') Dict[str, List[str]][source]

download enrichr libraries.

Parameters:
  • name (str) – the enrichr library name. see gseapy.get_library_name().

  • organism (str) – Select one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }

Return dict:

gene_sets of the enrichr library from selected organism

gseapy.parser.get_library(name: str, organism: str = 'Human', min_size: int = 0, max_size: int = 2000, gene_list: List[str] | None = None) Dict[str, List[str]][source]

Parse gene_sets.gmt(gene set database) file or download from enrichr server.

Parameters:
  • name (str) – the gene_sets.gmt file or an enrichr library name. checkout full enrichr library name here: https://maayanlab.cloud/Enrichr/#libraries

  • organism (str) – choose one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }. This arugment has not effect if input is a .gmt file.

  • min_size – Minimum allowed number of genes for each gene set. Default: 0.

  • max_size – Maximum allowed number of genes for each gene set. Default: 2000.

  • gene_list – if input a gene list, min and max overlapped genes between gene set and gene_list are kept.

Return dict:

Return a filtered gene set database dictionary.

Note: DO NOT filter gene sets, when use replot(). Because GSEA Desktop have already done this for you.

gseapy.parser.get_library_name(organism: str = 'Human') List[str][source]

return enrichr active enrichr library name. see also: https://maayanlab.cloud/modEnrichr/

Parameters:

organism (str) – Select one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }

Returns:

a list of enrichr libraries from selected database

gseapy.parser.gsea_cls_parser(cls: str) Tuple[str][source]

Extract class(phenotype) name from .cls file.

Parameters:

cls – the a class list instance or .cls file which is identical to GSEA input .

Returns:

phenotype name and a list of class vector.

gseapy.parser.gsea_edb_parser(results_path: str) Dict[str, List[str]][source]

Parse results.edb file stored under edb file folder.

Parameters:

results_path – the path of results.edb file.

Returns:

a dict contains { enrichment_term: [es, nes, pval, fdr, fwer, hit_ind]}

gseapy.parser.read_gmt(path: str) Dict[str, List[str]][source]

Read GMT file

Parameters:

path (str) – the path to a gmt file.

Returns:

a dict object

6.7. Visualization

class gseapy.plot.MidpointNormalize(vmin=None, vmax=None, vcenter=None, clip=False)[source]
gseapy.plot.barplot(df: DataFrame, column: str = 'Adjusted P-value', group: str | None = None, title: str = '', cutoff: float = 0.05, top_term: int = 10, figsize: Tuple[float, float] = (4, 6), color: str | List[str] | Dict[str, str] = 'salmon', ofname: str | None = None, **kwargs)[source]

Visualize GSEApy Results. When multiple datasets exist in the input dataframe, the group argument is your friend.

Parameters:
  • df – GSEApy DataFrame results.

  • column – column name in df to map the x-axis data. Default: Adjusted P-value

  • group – group by the variable in df that will produce bars with different colors.

  • title – figure title.

  • cutoff – terms with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)

  • top_term – number of top enriched terms grouped by hue are shown.

  • figsize – tuple, matplotlib figsize.

  • color – color or list or dict of matplotlib.colors. Must be reconigzed by matplotlib. if dict input, dict keys must be found in the group

  • ofname – output file name. If None, don’t save figure

Returns:

matplotlib.Axes. return None if given ofname. Only terms with column <= cut-off are plotted.

gseapy.plot.dotplot(df: DataFrame, column: str = 'Adjusted P-value', x: str | None = None, y: str = 'Term', x_order: List[str] | bool = False, y_order: List[str] | bool = False, title: str = '', cutoff: float = 0.05, top_term: int = 10, size: float = 5, figsize: Tuple[float, float] = (4, 6), cmap: str = 'viridis_r', ofname: str | None = None, xticklabels_rot: float | None = None, yticklabels_rot: float | None = None, marker: str = 'o', show_ring: bool = False, **kwargs)[source]

Visualize GSEApy Results with categorical scatterplot When multiple datasets exist in the input dataframe, the x argument is your friend.

Parameters:
  • df – GSEApy DataFrame results.

  • column – column name in df that map the dot colors. Default: Adjusted P-value.

  • x – Categorical variable in df that map the x-axis data. Default: None.

  • y – Categorical variable in df that map the y-axis data. Default: Term.

  • x_order – bool, array-like list. Default: False. If True, peformed hierarchical_clustering on X-axis. or input a array-like list of x categorical levels.

  • x_order – bool, array-like list. Default: False. If True, peformed hierarchical_clustering on Y-axis. or input a array-like list of y categorical levels.

  • title – Figure title.

  • cutoff – Terms with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)

  • top_term – Number of enriched terms to show.

  • size – float, scale the dot size to get proper visualization.

  • figsize – tuple, matplotlib figure size.

  • cmap – Matplotlib colormap for mapping the column semantic.

  • ofname – Output file name. If None, don’t save figure

  • marker – The matplotlib.markers. See https://matplotlib.org/stable/api/markers_api.html

  • bool (show_ring) – Whether to draw outer ring.

Returns:

matplotlib.Axes. return None if given ofname. Only terms with column <= cut-off are plotted.

gseapy.plot.enrichment_map(df: DataFrame, column: str = 'Adjusted P-value', cutoff: float = 0.05, top_term: int = 10, **kwargs) Tuple[DataFrame, DataFrame][source]

Visualize GSEApy Results. Node size corresponds to the percentage of gene overlap in a certain term of interest. Colour of the node corresponds to the significance of the enriched terms. Edge size corresponds to the number of genes that overlap between the two connected nodes. Gray edges correspond to both nodes when it is the only colour edge. When there are two different edge colours, red corresponds to positve nodes and blue corresponds to negative nodes.

Parameters:
  • df – GSEApy DataFrame results.

  • column – column name in df to map the node colors. Default: Adjusted P-value or FDR q-val. choose from (“Adjusted P-value”, “P-value”, “FDR q-val”, “NOM p-val”).

  • group – group by the variable in df that will produce bars with different colors.

  • title – figure title.

  • cutoff – nodes with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)

  • top_term – number of top enriched terms are selected as nodes.

Returns:

tuple of dataframe (nodes, edges)

gseapy.plot.gseaplot(term: str, hits: Sequence[int], nes: float, pval: float, fdr: float, RES: Sequence[float], rank_metric: Sequence[float] | None = None, pheno_pos: str = '', pheno_neg: str = '', color: str = '#88C544', figsize: Tuple[float, float] = (6, 5.5), cmap: str = 'seismic', ofname: str | None = None, **kwargs) List[Axes] | None[source]

This is the main function for generating the gsea plot.

Parameters:
  • term – gene_set name

  • hits – hits indices of rank_metric.index presented in gene set S.

  • nes – Normalized enrichment scores.

  • pval – nominal p-value.

  • fdr – false discovery rate.

  • RES – running enrichment scores.

  • rank_metric – pd.Series for rankings, rank_metric.values.

  • pheno_pos – phenotype label, positive correlated.

  • pheno_neg – phenotype label, negative correlated.

  • color – color for RES and hits.

  • figsize – matplotlib figsize.

  • ofname – output file name. If None, don’t save figure

return matplotlib.Figure.

gseapy.plot.gseaplot2(terms: List[str], hits: List[Sequence[int]], RESs: List[Sequence[float]], rank_metric: Sequence[float] | None = None, colors: str | List[str] | None = None, figsize: Tuple[float, float] = (6, 4), legend_kws: Dict[str, Any] | None = None, ofname: str | None = None, **kwargs) List[Axes] | None[source]

Trace plot for combining multiple terms/pathways into one plot :param terms: list of terms to show in trace plot :param hits: list of hits indices correspond to each term. :param RESs: list of running enrichment scores correspond to each term. :param rank_metric: Optional, rankings. :param figsize: matplotlib figsize. :legend_kws: Optional, contol the location of lengends :param ofname: output file name. If None, don’t save figure

return matplotlib.Figure.

gseapy.plot.heatmap(df: DataFrame, z_score: int | None = None, title: str = '', figsize: Tuple[float, float] = (5, 5), cmap: str | None = None, xticklabels: bool = True, yticklabels: bool = True, ofname: str | None = None, **kwargs)[source]

Visualize the dataframe.

Parameters:
  • df – DataFrame from expression table.

  • z_score – 0, 1, or None. z_score axis{0, 1}. If None, not scale.

  • title – figure title.

  • figsize – heatmap figsize.

  • cmap – matplotlib colormap. e.g. “RdBu_r”.

  • xticklabels – bool, whether to show xticklabels.

  • xticklabels – bool, whether to show xticklabels.

  • ofname – output file name. If None, don’t save figure

gseapy.plot.ringplot(df: DataFrame, column: str = 'Adjusted P-value', x: str | None = None, title: str = '', cutoff: float = 0.05, top_term: int = 10, size: float = 5, figsize: Tuple[float, float] = (4, 6), cmap: str = 'viridis_r', ofname: str | None = None, xticklabels_rot: float | None = None, yticklabels_rot: float | None = None, marker='o', show_ring: bool = True, **kwargs)[source]

ringplot is deprecated, use dotplot instead

Parameters:
  • df – GSEApy DataFrame results.

  • x – Group by the variable in df that will produce categorical scatterplot.

  • column – column name in df to map the dot colors. Default: Adjusted P-value

  • title – figure title

  • cutoff – terms with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)

  • top_term – number of enriched terms to show.

  • size – float, scale the dot size to get proper visualization.

  • figsize – tuple, matplotlib figure size.

  • cmap – matplotlib colormap for mapping the column semantic.

  • ofname – output file name. If None, don’t save figure

  • marker – the matplotlib.markers. See https://matplotlib.org/stable/api/markers_api.html

  • bool (show_ring) – whether to show outer ring.

Returns:

matplotlib.Axes. return None if given ofname. Only terms with column <= cut-off are plotted.

gseapy.plot.zscore(data2d: DataFrame, axis: int | None = 0)[source]

Standardize the mean and variance of the data axis Parameters.

Parameters:
  • data2d – DataFrame to normalize.

  • axis – int, Which axis to normalize across. If 0, normalize across rows, if 1, normalize across columns. If None, don’t change data

Returns:

Normalized DataFrame. Normalized data with a mean of 0 and variance of 1 across the specified axis.

6.8. Scientific Journal and Sci- themed Color Palettes

6.9. Utils