5. Developmental Guide

5.1. Module APIs

Run Gene Set Enrichment Analysis.

Parameters:

data – Gene expression data table, Pandas DataFrame, gct file.
gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA. NOTE: If multiple gene sets are provided, the FDR null distribution will be based on the combined gene sets. This may lead to slight differences in FDR values compared to running GSEA separately for each gene set. See github issue for more details: https://github.com/zqfang/GSEApy/issues/323
cls – A list or a .cls file format required for GSEA.
organism (str) – Organism for Enrichr library names (human, mouse, yeast, fly, fish, worm); does not affect custom gene sets (gmt or dict).
outdir (str) – Results output directory. If None, nothing will write to disk.
permutation_num (int) – Number of permutations. Default: 1000. Minimial possible nominal p-value is about 1/nperm.
permutation_type (str) – Type of permutation reshuffling, choose from {“phenotype”: ‘sample.labels’ , “gene_set” : gene.labels}.
min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
max_size (int) – Maximum allowed number of genes from gene set also the data set. Default: 500.
weight (float) – Refer to algorithm.enrichment_score(). Default:1.
method –
The method used to calculate a correlation or ranking. Default: ‘signal_to_noise’. Others methods are:
1. ’signal_to_noise’
  
  You must have at least three samples for each phenotype to use this metric. The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
2. ’t_test’
  
  Uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric. The larger the tTest ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
3. ’ratio_of_classes’ (also referred to as fold change).
  
  Uses the ratio of class means to calculate fold change for natural scale data.
4. ’diff_of_classes’
  
  Uses the difference of class means to calculate fold change for nature scale data
5. ’log2_ratio_of_classes’
  
  Uses the log2 ratio of class means to calculate fold change for natural scale data. This is the recommended statistic for calculating fold change for log scale data.
ascending (bool) – Sorting order of rankings. Default: False.
threads (int) – Number of threads you are going to use. Default: 4.
figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [6.5,6].
format (str) – Matplotlib figure format. Default: ‘pdf’.
graph_num (int) – Plot graphs for top sets of each phenotype.
no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
seed – Random seed. expect an integer. Default:None.
verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.

Returns:

Return a GSEA obj. All results store to a dictionary, obj.results, where contains:

| {
|  term: gene set name,
|  es: enrichment score,
|  nes: normalized enrichment score,
|  pval:  Nominal p-value (from the null distribution of the gene set,
|  fdr: FDR qvalue (adjusted False Discory Rate),
|  fwerp: Family wise error rate p-values,
|  tag %: Percent of gene set before running enrichment peak (ES),
|  gene %: Percent of gene list before running enrichment peak (ES),
|  lead_genes: leading edge genes (gene hits before running enrichment peak),
|  matched genes: genes matched to the data,
| }

gseapy.prerank()[source]

Run Gene Set Enrichment Analysis with pre-ranked correlation defined by user.

Parameters:

rnk – pre-ranked correlation table or pandas DataFrame. Same input with GSEA .rnk file.
gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA. NOTE: If multiple gene sets are provided, the FDR null distribution will be based on the combined gene sets. This may lead to slight differences in FDR values compared to running GSEA separately for each gene set. See github issue for more details: https://github.com/zqfang/GSEApy/issues/323
organism (str) – Organism for Enrichr library names (human, mouse, yeast, fly, fish, worm); does not affect custom gene sets (gmt or dict).
outdir – results output directory. If None, nothing will write to disk.
permutation_num (int) – Number of permutations. Default: 1000. Minimial possible nominal p-value is about 1/nperm.
min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
max_size (int) – Maximum allowed number of genes from gene set also the data set. Defaults: 500.
weight (str) – Refer to algorithm.enrichment_score(). Default:1.
ascending (bool) – Sorting order of rankings. Default: False for descending. If None, do not sort the ranking.
threads (int) – Number of threads you are going to use. Default: 4.
figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [6.5,6].
format (str) – Matplotlib figure format. Default: ‘pdf’.
graph_num (int) – Plot graphs for top sets of each phenotype.
no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
seed – Random seed. expect an integer. Default:None.
verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.
method (str) –
P-value / significance estimation procedure. Default: ‘permutation’. Choose from:
1. ’permutation’
  
  Classic gene-set permutation: the null distribution of ES is built by permuting gene-set membership permutation_num times. NES, nominal p-value and FDR are derived from this null. Supports both a single preranked list and a multi-column ranking DataFrame.
2. ’multilevel’
  
  fgsea multilevel procedure (a faithful port of the fgsea C++ core). Estimates arbitrarily small p-values via adaptive multilevel sampling instead of plain permutation, so it can resolve significance well below 1 / permutation_num. NES is computed from fgsea’s random-gene-set null (NES = ES / mean(same-sign null ES)), which differs by design from the classic permutation NES. Supports a single preranked list only (a multi-column DataFrame raises NotImplementedError).
sample_size (int) – Only used when method='multilevel'. Sample size for the multilevel split step of the fgsea algorithm; larger values give more accurate (but slower) tail p-value estimates. Default: 101.
eps (float) – Only used when method='multilevel'. Lower boundary for the estimated p-value; p-values smaller than eps are reported as eps. Set to 0 to estimate p-values as small as machine precision allows. Default: 1e-50.

Returns:

Return a Prerank obj. All results store to a dictionary, obj.results, where contains:

| {
|  term: gene set name,
|  es: enrichment score,
|  nes: normalized enrichment score,
|  pval:  Nominal p-value (from the null distribution of the gene set,
|  fdr: FDR qvalue (adjusted False Discory Rate),
|  fwerp: Family wise error rate p-values,
|  tag %: Percent of gene set before running enrichment peak (ES),
|  gene %: Percent of gene list before running enrichment peak (ES),
|  lead_genes: leading edge genes (gene hits before running enrichment peak),
|  matched genes: genes matched to the data,
| }

gseapy.ssgsea()[source]

Run Gene Set Enrichment Analysis with single sample GSEA tool

Parameters:

data – Expression table, pd.Series, pd.DataFrame, GCT file, or .rnk file format.
gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.
organism (str) – Organism for Enrichr library names (human, mouse, yeast, fly, fish, worm); does not affect custom gene sets (gmt or dict).
outdir – Results output directory. If None, nothing will write to disk.
sample_norm_method (str) –
Sample normalization method. Choose from {‘rank’, ‘log’, ‘log_rank’, None}. Default: rank. this argument will be used for ordering genes.
1. ’rank’: Rank your expression data, and transform by 10000*rank_dat/gene_numbers
2. ’log’ : Do not rank, but transform data by log(data + exp(1)), while data = data[data<1] =1.
3. ’log_rank’: Rank your expression data, and transform by log(10000*rank_dat/gene_numbers+ exp(1))
4. None or ‘custom’: Do nothing, and use your own rank value to calculate enrichment score.

see here: https://github.com/GSEA-MSigDB/ssGSEAProjection-gpmodule/blob/master/src/ssGSEAProjection.Library.R, line 86

Parameters:

correl_norm_type (str) –
correlation normalization type. Choose from {‘rank’, ‘symrank’, ‘zscore’, None}. Default: rank. After ordering genes by sample_norm_method, further data transformed could be applied to get enrichment score.

when weight == 0, sample_norm_method and correl_norm_type do not matter; when weight > 0, the combination of sample_norm_method and correl_norm_type dictate how the gene expression values in input data are transformed to obtain the score – use this setting with care (the transformations can skew scores towards +ve or -ve values)

sample_norm_method will first transformed and rank original data. the data is named correl_vector for each sample. then correl_vector is transformed again by
1. correl_norm_type is None or ‘rank’ : do nothing, genes are weighted by actual correl_vector.
2. correl_norm_type ==’symrank’: symmetric ranking.
3. correl_norm_type ==’zscore’: standardizes the correl_vector before using them to calculate scores.
min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
max_size (int) – Maximum allowed number of genes from gene set also the data set. Default: 2000.
permutation_num (int) – For ssGSEA, default is 0. However, if you try to use ssgsea method to get pval and fdr, set to an interger.
weight (str) – Refer to algorithm.enrichment_score(). Default:0.25.
ascending (bool) – Sorting order of rankings. Default: False.
threads (int) – Number of threads you are going to use. Default: 4.
figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [7,6].
format (str) – Matplotlib figure format. Default: ‘pdf’.
graph_num (int) – Plot graphs for top sets of each phenotype.
no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
seed – Random seed. expect an integer. Default:None.
verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.

Returns:

Return a ssGSEA obj. All results store to a dictionary, access enrichment score or normalized enrichment score by obj.res2d or obj.results. if permutation_num > 0, additional results contain:

| {
|  term: gene set name,
|  es: enrichment score,
|  nes: normalized enrichment score,
|  pval:  Nominal p-value (from the null distribution of the gene set (if permutation_num > 0),
|  fdr: FDR qvalue (adjusted FDR) (if permutation_num > 0),
|  fwerp: Family wise error rate p-values (if permutation_num > 0),
|  tag %: Percent of gene set before running enrichment peak (ES),
|  gene %: Percent of gene list before running enrichment peak (ES),
|  lead_genes: leading edge genes (gene hits before running enrichment peak),
|  matched genes: genes matched to the data,
| }

gseapy.enrichr()[source]

Enrichr API.

Parameters:

gene_list – str, list, tuple, pd.Series, pd.DataFrame. Also supports input txt file path with one gene id per row. The gene identifiers in gene_list should match the type used in gene_sets.
gene_sets –
str, list of Enrichr Library name(s), or custom defined gene_sets (dict, or gmt file). or custom defined gene_sets (dict, or gmt file).

Examples:

Input Enrichr Libraries (https://maayanlab.cloud/Enrichr/#stats):
str: ‘KEGG_2016’ list: [‘KEGG_2016’,’KEGG_2013’] Use comma to separate each other, e.g. “KEGG_2016,huMAP,GO_Biological_Process_2018”

Input custom files:

dict: gene_sets={‘A’:[‘gene1’, ‘gene2’,…],
’B’:[‘gene2’, ‘gene4’,…], …}

gmt: “genes.gmt”

see also the online docs: https://gseapy.readthedocs.io/en/latest/gseapy_example.html#2.-Enrichr-Example
organism –
Organism for Enrichr library names (human, mouse, yeast, fly, fish, worm); does not affect custom gene sets (gmt or dict).

Does not affect gmt or dict input of gene_sets.
outdir – Output file directory
background –
Background gene set for statistical testing. Type: None | int | list | str.

When is this used? - Only applies to CUSTOM gene sets (gmt file or dict) - Ignored when using Enrichr library names (e.g., ‘KEGG_2016’)

Default behavior: - If None: All genes in your gene_sets will be used as background

Recommended usage (3 options):

Option 1: Gene list (RECOMMENDED)
Provide your experiment-specific background genes:

background=[‘gene1’, ‘gene2’, ‘gene3’, …]

Example: All expressed genes from your RNA-seq experiment Note: Gene identifiers must match those in your gene_sets

Option 2: Gene count (simple but less accurate)
Specify total number of genes tested:

background=20000 # total genes in your experiment

Warning: Assumes all genes could be detected. May affect statistical accuracy if gene sets contain genes not in your actual background.

Option 3: BioMart dataset (automatic download)
Use a BioMart database name for genome-wide background:

background=’hsapiens_gene_ensembl’ # human genes background=’mmusculus_gene_ensembl’ # mouse genes

The program downloads all annotated genes with Entrez IDs. First download may take a few minutes; results are cached.

Cached location: ~/.cache/gseapy/{dataset}.background.genes.txt

Why does background matter? Background genes define the “universe” for hypergeometric testing. Using the correct background (e.g., detected genes in your experiment) improves statistical accuracy compared to using all possible genes in a genome.
cutoff – Show enriched terms which Adjusted P-value < cutoff. Only affects the output figure, not the final output file. Default: 0.05
format – Output figure format supported by matplotlib,(‘pdf’,’png’,’eps’…). Default: ‘pdf’.
figsize – Matplotlib figsize, accept a tuple or list, e.g. (width,height). Default: (6.5,6).
no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
verbose (bool) – Increase output verbosity, print out progress of your job, Default: False.

Returns:

An Enrichr object, which obj.res2d stores your last query, obj.results stores your all queries.

gseapy.enrich()[source]

Perform over-representation analysis (hypergeometric test).

Parameters:

gene_list – str, list, tuple, series, dataframe. Also support input txt file with one gene id per row. The input identifier should be the same type to gene_sets.
gene_sets –
custom defined gene_sets (dict, or gmt file).

Examples:

dict: gene_sets={‘A’:[‘gene1’, ‘gene2’,…],
’B’:[‘gene2’, ‘gene4’,…], …}

gmt: “genes.gmt”
outdir – Output file directory
background –
Background gene set for statistical testing. Type: None | int | list | str.

When is this used? - Only applies to CUSTOM gene sets (gmt file or dict) - Ignored when using Enrichr library names (e.g., ‘KEGG_2016’)

Default behavior: - If None: All genes in your gene_sets will be used as background

Recommended usage (3 options):

Option 1: Gene list (RECOMMENDED)
Provide your experiment-specific background genes:

background=[‘gene1’, ‘gene2’, ‘gene3’, …]

Example: All expressed genes from your RNA-seq experiment Note: Gene identifiers must match those in your gene_sets

Option 2: Gene count (simple but less accurate)
Specify total number of genes tested:

background=20000 # total genes in your experiment

Warning: Assumes all genes could be detected. May affect statistical accuracy if gene sets contain genes not in your actual background.

Option 3: BioMart dataset (automatic download)
Use a BioMart database name for genome-wide background:

background=’hsapiens_gene_ensembl’ # human genes background=’mmusculus_gene_ensembl’ # mouse genes

The program downloads all annotated genes with Entrez IDs. First download may take a few minutes; results are cached.

Cached location: ~/.cache/gseapy/{dataset}.background.genes.txt

Why does background matter? Background genes define the “universe” for hypergeometric testing. Using the correct background (e.g., detected genes in your experiment) improves statistical accuracy compared to using all possible genes in a genome.
cutoff – Show enriched terms which Adjusted P-value < cutoff. Only affects the output figure, not the final output file. Default: 0.05
format – Output figure format supported by matplotlib,(‘pdf’,’png’,’eps’…). Default: ‘pdf’.
figsize – Matplotlib figsize, accept a tuple or list, e.g. (width,height). Default: (6.5,6).
no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
verbose (bool) – Increase output verbosity, print out progress of your job, Default: False.

Returns:

An Enrichr object, which obj.res2d stores your last query, obj.results stores your all queries.

gseapy.replot()[source]

The main function to reproduce GSEA desktop outputs.

Parameters:

indir – GSEA desktop results directory. In the sub folder, you must contain edb file folder.
outdir – Output directory.
weight (float) – weighted score type. choose from {0,1,1.5,2}. Default: 1.
figsize (list) – Matplotlib output figure figsize. Default: [6.5,6].
format (str) – Matplotlib output figure format. Default: ‘pdf’.
min_size (int) – Min size of input genes presented in Gene Sets. Default: 3.
max_size (int) – Max size of input genes presented in Gene Sets. Default: 5000. You are not encouraged to use min_size, or max_size argument in replot() function. Because gmt file has already been filtered.
verbose – Bool, increase output verbosity, print out progress of your job, Default: False.

Returns:

Generate new figures with selected figure format. Default: ‘pdf’.

5.2. GSEA Statistics

class gseapy.gsea.GSEA(data: DataFrame | str, gene_sets: List[str] | str | Dict[str, str], classes: List[str] | str | Dict[str, str], organism: str = 'human', outdir: str | None = None, min_size: int = 15, max_size: int = 500, permutation_num: int = 1000, weight: float = 1.0, permutation_type: str = 'phenotype', method: str = 'signal_to_noise', ascending: bool = False, threads: int = 1, figsize: Tuple[float, float] = (6.5, 6), format: str = 'pdf', graph_num: int = 20, no_plot: bool = False, seed: int = 123, verbose: bool = False)[source]

GSEA main tool

calc_metric(df: DataFrame, method: str, pos: str, neg: str, classes: Dict[str, str], ascending: bool) → Tuple[List[int], Series][source]

The main function to rank an expression table. works for 2d array.

Parameters:

df – gene_expression DataFrame.
method –
The method used to calculate a correlation or ranking. Default: ‘log2_ratio_of_classes’. Others methods are:
1. ’signal_to_noise’ (s2n) or ‘abs_signal_to_noise’ (abs_s2n)
  
  You must have at least three samples for each phenotype. The more distinct the gene expression is in each phenotype, the more the gene acts as a “class marker”.
2. ’t_test’
  
  Uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric. The larger the t-test ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
3. ’ratio_of_classes’ (also referred to as fold change).
  
  Uses the ratio of class means to calculate fold change for natural scale data.
4. ’diff_of_classes’
  
  Uses the difference of class means to calculate fold change for natural scale data
5. ’log2_ratio_of_classes’
  
  Uses the log2 ratio of class means to calculate fold change for natural scale data. This is the recommended statistic for calculating fold change for log scale data.
pos (str) – one of labels of phenotype’s names.
neg (str) – one of labels of phenotype’s names.
classes (dict) – column id to group mapping.
ascending (bool) – bool or list of bool. Sort ascending vs. descending.

Returns:

returns argsort values of a tuple where 0: argsort positions (indices) 1: pd.Series of correlation value. Gene_name is index, and value is rankings.

visit here for more docs: http://software.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html

load_classes(classes: str | List[str] | Dict[str, Any])[source]: Parse group (classes)

load_data() → Tuple[DataFrame, Dict][source]: pre-processed the data frame.new filtering methods will be implement here.

run()[source]: GSEA main procedure

to_cls(outdir: str)[source]: Save group information to cls file

class gseapy.gsea.Prerank(rnk: DataFrame | Series | str, gene_sets: List[str] | str | Dict[str, str], organism: str = 'human', outdir: str | None = None, pheno_pos='Pos', pheno_neg='Neg', min_size: int = 15, max_size: int = 500, permutation_num: int = 1000, weight: float = 1.0, ascending: bool | None = False, threads: int = 1, figsize: Tuple[float, float] = (6.5, 6), format: str = 'pdf', graph_num: int = 20, no_plot: bool = False, seed: int = 123, verbose: bool = False, method: str = 'permutation', sample_size: int = 101, eps: float = 1e-50)[source]

GSEA prerank tool

load_ranking()[source]: parse rnk input

run()[source]: GSEA prerank workflow

class gseapy.gsea.Replot(indir: str, outdir: str = 'GSEApy_Replot', weight: float = 1.0, min_size: int = 3, max_size: int = 1000, figsize: Tuple[float, float] = (6.5, 6), format: str = 'pdf', verbose: bool = False)[source]

To reproduce GSEA desktop output results.

gsea_edb_parser(results_path)[source]

Parse results.edb file stored under edb file folder.

Parameters:: results_path – the path of results.edb file.
Returns:: a dict contains { enrichment_term: [es, nes, pval, fdr, fwer, hit_ind]}

run()[source]: main replot function

class gseapy.base.GMT(mapping: Dict[str, List[str]] | None = None, description: str | None = None, source: str | None = None, name: str | None = 'default')[source]

A collection of gene set dictionaries with metadata.

Attributes:

_collections: Dict[str, Dict[str, Any]] - Stores gene set collections

key: collection name value: {

‘genes’: Dict[str, List[str]] - Gene set mappings ‘description’: str - Collection description ‘source’: str - Source of the gene sets

}

add(mapping: Dict[str, List[str]], description: str | None = None, source: str | None = None, name: str | None = 'default')[source]

Add a gene set collection with metadata.

Args:: mapping: Gene set dictionary to add description: Description of the gene sets source: Source of the gene sets name: Name for this collection

filter(min_size: int | None = None, max_size: int | None = None, gene_list: List[str] | None = None, collections: List[str] | None = None) → GMT[source]

Filter gene sets based on size and gene membership.

Args:: min_size: Minimum number of genes in a set max_size: Maximum number of genes in a set gene_list: Only keep genes present in this list collections: Only keep these named collections
Returns:: A new filtered GMT object

get(name: str = 'default') → Dict[str, List[str]][source]: Get gene sets by collection name.

get_metadata(name: str = 'default') → Dict[str, Any][source]: Get metadata for a collection.

items()[source]: Iterate over (name, gene_sets) pairs.

classmethod read(paths: str, source: str | None = None) → GMT[source]

Read GMT files into a collection.

Args:: paths: Comma-separated list of GMT file paths source: Source annotation for the files

write(ofname: str)[source]: Write GMT file to disk.

class gseapy.base.GSEAbase(outdir: str | None = None, gene_sets: List[str] | str | Dict[str, str] = 'KEGG_2016', module: str = 'base', threads: int = 1, organism: str = 'human', verbose: bool = False)[source]

base class of GSEA.

check_uppercase(gene_list: List[str | int]) → bool[source]

Check whether a list of gene names are mostly in uppercase.

5. Parameters

gene_listlist, int: A list of gene names or Entrez IDs

5. Returns

bool: Whether the list of gene names are mostly in uppercase

enrichment_score(gene_list: Iterable[str], correl_vector: Iterable[float], gene_set: Dict[str, List[str]], weight: float = 1.0, nperm: int = 1000, seed: int = 123, single: bool = False, scale: bool = False)[source]

This is the most important function of GSEApy. It has the same algorithm with GSEA and ssGSEA.

Parameters:

gene_list – The ordered gene list gene_name_list, rank_metric.index.values
gene_set – gene_sets in gmt file, please use gmt_parser to get gene_set.
weight – It’s the same with gsea’s weighted_score method. Weighting by the correlation is a very reasonable choice that allows significant gene sets with less than perfect coherence. options: 0(classic),1,1.5,2. default:1. if one is interested in penalizing sets for lack of coherence or to discover sets with any type of nonrandom distribution of tags, a value p < 1 might be appropriate. On the other hand, if one uses sets with large number of genes and only a small subset of those is expected to be coherent, then one could consider using p > 1. Our recommendation is to use p = 1 and use other settings only if you are very experienced with the method and its behavior.
correl_vector – A vector with the correlations (e.g. signal to noise scores) corresponding to the genes in the gene list. Or rankings, rank_metric.values
nperm – Only use this parameter when computing esnull for statistical testing. Set the esnull value equal to the permutation number.
seed – Random state for initializing gene list shuffling. Default: seed=None

Returns:

ES: Enrichment score (real number between -1 and +1)

ESNULL: Enrichment score calculated from random permutations.

Hits_Indices: Index of a gene in gene_list, if gene is included in gene_set.

RES: Numerical vector containing the running enrichment score for all locations in the gene list .

get_libraries() → List[str][source]: return active enrichr library name.Offical API

load_gmt(gene_list: Iterable[str], gmt: List[str] | str | Dict[str, str]) → Dict[str, List[str]][source]: load gene set dict

load_gmt_only(gmt: List[str] | str | Dict[str, str]) → Dict[str, List[str]][source]

parse gene_sets. gmt: List, Dict, Strings

However,this function will merge different gene sets into one big dict to save computation time for later.

make_unique(rank_metric: DataFrame, col_idx: int) → DataFrame[source]: make gene id column unique by adding a digit, similar to R’s make.unique

parse_gmt(gmt: str) → Dict[str, List[str]][source]: gmt parser when input is a string

plot(terms: str | List[str], colors: str | List[str] | None = None, legend_kws: Dict[str, Any] | None = None, figsize: Tuple[float, float] = (4, 5), show_ranking: bool = True, ofname: str | None = None)[source]: terms: str, list. terms/pathways to show colors: str, list. list of colors for each term/pathway legend_kws: kwargs to pass to ax.legend. e.g. loc, bbox_to_achor. ofname: savefig

prepare_outdir()[source]: create temp directory.

property results: compatible to old style

to_df(gsea_summary: List[Dict], gmt: Dict[str, List[str]], rank_metric: Series | DataFrame, indices: List | None = None)[source]

Convernt GSEASummary to DataFrame

rank_metric: if a Series, then it must be sorted in descending order already: if a DataFrame, indices must not None.

indices: Only works for DataFrame input. Stores the indices of sorted array

5.3. Over-representation Statistics

gseapy.stats.calc_pvalues(query, gene_sets, background=20000, **kwargs)[source]

calculate pvalues for all categories in the graph

Parameters:

query (set) – set of identifiers for which the p value is calculated
gene_sets (dict) – gmt file dict after background was set
background (set) – total number of genes in your annotated database.

Returns:

pvalues x: overlapped gene number n: length of gene_set which belongs to each terms hits: overlapped gene names.

5. For 2*2 contingency table:

in query | not in query | row total

=> in gene_set | a | b | a+b => not in gene_set | c | d | c+d

column total | a+b+c+d = anno database

Then, in R

x=a the number of white balls drawn without replacement: from an urn which contains both black and white balls.

m=a+b the number of white balls in the urn n=c+d the number of black balls in the urn k=a+c the number of balls drawn from the urn

In Scipy: for args in scipy.hypergeom.sf(k, M, n, N, loc=0):

M: the total number of objects, n: the total number of Type I objects. k: the random variate represents the number of Type I objects in N drawn

without replacement from the total population.

Therefore, these two functions are the same when using parameters from 2*2 table: R: > phyper(x-1, m, n, k, lower.tail=FALSE) Scipy: >>> hypergeom.sf(x-1, m+n, m, k)

For Odds ratio in Enrichr (see https://maayanlab.cloud/Enrichr/help#background&q=4)

oddsRatio = (1.0 * x * d) / Math.max(1.0 * b * c, 1)

where:

x are the overlapping genes, b (m-x) are the genes in the annotated set - overlapping genes, c (k-x) are the genes in the input set - overlapping genes, d (bg-m-k+x) are the 20,000 genes (or total genes in the background) - genes in the annotated set - genes in the input set + overlapping genes

gseapy.stats.fdrcorrection(pvals, alpha=0.05)[source]: benjamini hocheberg fdr correction. inspired by statsmodels

gseapy.stats.multiple_testing_correction(ps, alpha=0.05, method='benjamini-hochberg', **kwargs)[source]

correct pvalues for multiple testing and add corrected q value

Parameters:

ps – list of pvalues
alpha – significance level default : 0.05
method – multiple testing correction method [bonferroni|benjamini-hochberg]

Returns (q, rej):

two lists of q-values and rejected nodes

5.4. Enrichr API

class gseapy.enrichr.Enrichr(gene_list: Iterable[str], gene_sets: List[str] | str | Dict[str, str], organism: str = 'human', outdir: str | None = 'Enrichr', background: List[str] | int | str = None, cutoff: float = 0.05, format: str = 'pdf', figsize: Tuple[float, float] = (6.5, 6), top_term: int = 10, no_plot: bool = False, verbose: bool = False)[source]

Enrichr API

check_genes(gene_list: List[str], usr_list_id: str) → None[source]: Compare the genes sent and received to get successfully recognized genes.

check_uppercase(gene_list: List[str]) → bool[source]

Check whether a list of gene names are mostly in uppercase.

5. Parameters

gene_listlist: A list of gene names

5. Returns

bool: Whether the list of gene names are mostly in uppercase

close()[source]: Explicitly close logger handlers.

enrich_local(gmt: Dict[str, List[str]]) → DataFrame | None[source]

Perform local enrichment analysis using hypergeometric test.

p-value: computed using the Fisher exact test (Hypergeometric test) z-score: Odds Ratio combined score: -log(p) * z

See: http://amp.pharm.mssm.edu/Enrichr/help#background&q=4

Columns: Term, Overlap, P-value, Odds Ratio, Combined Score, Adjusted_P-value, Genes

enrich_online(genes_list: str, geneset_libraries: List[str]) → DataFrame | None[source]: Perform online enrichment analysis using Enrichr API.

filter_gmt(gmt: Dict[str, List[str]], background: Set[str]) → Dict[str, List[str]][source]

Filter GMT to only include genes that exist in background.

This substantially affects the significance of the hypergeometric test.

5. Parameters

gmtdict: A dict of gene sets
backgroundset: A set of custom background genes

5. Returns

dict: Filtered gene sets

get_background() → Set[str][source]: Get background genes from file or BioMart.

get_results(gene_list: str, gene_set_libraries: List[str]) → Tuple[str, DataFrame][source]: Get enrichment results from Enrichr API.

get_results_with_background(gene_list: str, background: List[str], gene_set_libraries: List[str]) → Tuple[str, DataFrame][source]: Get enrichment results with custom background.

parse_background(gmt: Dict[str, List[str]] | None = None) → Set[str] | int[source]: Parse and set background genes.

parse_genelists() → str[source]: Parse gene list with single-pass processing.

parse_genesets(gene_sets=None) → List[Dict[str, List[str]] | str][source]: parse gene_sets input file type

prepare_outdir()[source]: create temp directory.

run() → None[source]: Run enrichr for one sample gene list against multiple libraries.

send_genes(payload, url) → Dict[str, int | str][source]: Send gene list to enrichr server.

set_organism() → None[source]: Initialize EnrichrAPI base with the selected organism, setting base_url.

class gseapy.enrichr.EnrichrAPI(organism: str = 'human')[source]

A Python client for the modEnrichr suite and Speedrichr REST APIs.

add_background(background_genes: List[str]) → Dict[str, Any][source]: Uploads a background gene set to Speedrichr.

add_list(genes: List[str], description: str = 'Gene list') → Dict[str, Any][source]: Analyzes a gene set and returns a userListId.

add_list_speedrichr(genes: List[str], description: str = 'Gene list with background') → Dict[str, Any][source]: Uploads a gene set to Speedrichr for background analysis.

download_libraries(libname: str) → Dict[str, List[str]][source]: Download enrichr libraries

find_terms_by_gene(gene: str, include_json: bool = True, include_setup: bool = True) → Dict[str, Any][source]: Finds terms that contain a given gene across Enrichr libraries.

get_background_enrichment(user_list_id: int, background_id: str, gene_set_library: str | List[str]) → Dict[str, Any][source]: Gets enrichment results calculated against a custom background for one or more libraries.

get_enrichment(user_list_id: int, gene_set_library: str | List[str]) → Dict[str, Any][source]: Gets enrichment results for one or more gene set libraries.

get_libraries() → List[str][source]: Fetches a list of all available gene set library names for the current organism.

get_results_dataframe(user_list_id: int, gene_set_library: str | List[str]) → DataFrame[source]: Fetches enrichment analysis results and returns a combined pandas DataFrame.

view_list(user_list_id: int) → Dict[str, Any][source]: Views an added gene set using its userListId.

exception gseapy.enrichr.EnrichrAPIError[source]: Raised when Enrichr API requests fail.

exception gseapy.enrichr.EnrichrError[source]: Base exception for Enrichr-related errors.

exception gseapy.enrichr.EnrichrNetworkError[source]: Raised when network connectivity issues occur.

exception gseapy.enrichr.EnrichrParseError[source]: Raised when parsing responses or files fails.

exception gseapy.enrichr.EnrichrValidationError[source]: Raised when input validation fails.

5.5. BioMart API

class gseapy.biomart.Biomart(host: str = 'www.ensembl.org', verbose: bool = False)[source]

query from BioMart

add_filter(name: str, value: Iterable[str])[source]: key: filter names value: Iterable[str]

get_attributes(dataset: str = 'hsapiens_gene_ensembl')[source]: Get available attritbutes from dataset you’ve selected

get_datasets(mart: str = 'ENSEMBL_MART_ENSEMBL')[source]: Get available datasets from mart you’ve selected

get_filters(dataset: str = 'hsapiens_gene_ensembl')[source]: Get available filters from dataset you’ve selected

get_marts()[source]: Get available marts and their names.

get_xml_body()[source]

Return only the XML body without the URL prefix.

This is suitable for POST requests where the XML is sent in the body as the ‘query’ form field to the biomart endpoint.

query(dataset: str = 'hsapiens_gene_ensembl', attributes: List[str] | None = None, filters: Dict[str, Iterable[str]] | None = None, filename: str | None = None)[source]

mapping ids using BioMart.

Parameters:

dataset – str, default: ‘hsapiens_gene_ensembl’
attributes – str, list, tuple
filters – dict, {‘filter name’: list(filter value)}
host – www.ensembl.org, asia.ensembl.org, useast.ensembl.org

Returns:

a dataframe contains all attributes you selected.

Example:

>>> queries = {'ensembl_gene_id': ['ENSG00000125285','ENSG00000182968'] } # need to be a python dict
>>> results = bm.query(dataset='hsapiens_gene_ensembl',
                       attributes=['ensembl_gene_id', 'external_gene_name', 'entrezgene_id', 'go_id'],
                       filters=queries)

query_simple(dataset: str = 'hsapiens_gene_ensembl', attributes: List[str] | None = None, filters: Dict[str, Iterable[str]] | None = None, filename: str | None = None)[source]

This function is a simple version of BioMart REST API. same parameter to query().

However, you could get cross page of mapping. such as Mouse 2 human gene names

Note: it will take a couple of minutes to get the results. A xml template for querying biomart. (see https://gist.github.com/keithshep/7776579)

Example::

>>> from gseapy import Biomart
>>> bm = Biomart()
>>> results = bm.query_simple(dataset='mmusculus_gene_ensembl',
                              attributes=['ensembl_gene_id',
                                          'external_gene_name',
                                          'hsapiens_homolog_associated_gene_name'])

5.6. Parser

gseapy.parser.download_library(name: str, organism: str = 'human', filename: str = None) → Dict[str, List[str]][source]

download enrichr libraries.

Parameters:

name (str) – the enrichr library name. see gseapy.get_library_name().
organism (str) – Select one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }
filename (str) – the file name to save if not None.

Return dict:

gene_sets of the enrichr library from selected organism

gseapy.parser.get_library(name: str, organism: str = 'Human', min_size: int = 0, max_size: int = 2000, save: str | None = None, gene_list: List[str] | None = None) → Dict[str, List[str]][source]

Parse gene_sets.gmt(gene set database) file or download from enrichr server.

Parameters:

name (str) – the gene_sets.gmt file or an enrichr library name. checkout full enrichr library name here: https://maayanlab.cloud/Enrichr/#libraries
organism (str) – choose one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }. This arugment has not effect if input is a .gmt file.
min_size – Minimum allowed number of genes for each gene set. Default: 0.
max_size – Maximum allowed number of genes for each gene set. Default: 2000.
save (str) – the path to save the filtered gene set database.
gene_list – if input a gene list, min and max overlapped genes between gene set and gene_list are kept.

Return dict:

Return a filtered gene set database dictionary.

Note: DO NOT filter gene sets, when use replot(). Because GSEA Desktop have already done this for you.

gseapy.parser.get_library_name(organism: str = 'Human') → List[str][source]

return enrichr active enrichr library name. see also: https://maayanlab.cloud/modEnrichr/

Parameters:: organism (str) – Select one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }
Returns:: a list of enrichr libraries from selected database

gseapy.parser.gsea_cls_parser(cls: str) → Tuple[str][source]

Extract class(phenotype) name from .cls file.

Parameters:: cls – the a class list instance or .cls file which is identical to GSEA input .
Returns:: phenotype name and a list of class vector.

gseapy.parser.gsea_edb_parser(results_path: str) → Dict[str, List[str]][source]

Parse results.edb file stored under edb file folder.

Parameters:: results_path – the path of results.edb file.
Returns:: a dict contains { enrichment_term: [es, nes, pval, fdr, fwer, hit_ind]}

gseapy.parser.read_gmt(path: str) → Dict[str, List[str]][source]

Read GMT file

Parameters:: path (str) – the path to a gmt file.
Returns:: a dict object

gseapy.parser.write_gmt(gene_sets: Dict[str, List[str]], filename: str, name: str | None = None) → None[source]

Write gene sets to a gmt file.

Parameters:

gene_sets (dict) – a dict object contains gene sets.
filename (str) – the path to save the gmt file.
setname (str) – the name of gene set database.

5.7. Visualization

class gseapy.plot.MidpointNormalize(vmin=None, vmax=None, vcenter=None, clip=False)[source]

inverse(value)[source]

Maps the normalized value (i.e., index in the colormap) back to image data value.

5. Parameters

value: Normalized value.

gseapy.plot.barplot(df: DataFrame, column: str = 'Adjusted P-value', group: str | None = None, title: str = '', cutoff: float = 0.05, top_term: int = 10, ax: Axes | None = None, figsize: Tuple[float, float] = (4, 6), color: str | List[str] | Dict[str, str] = 'salmon', ofname: str | None = None, wrap_width: int | None = None, **kwargs)[source]

Visualize GSEApy Results. When multiple datasets exist in the input dataframe, the group argument is your friend.

Parameters:

df – GSEApy DataFrame results.
column – column name in df to map the x-axis data. Default: Adjusted P-value
group – group by the variable in df that will produce bars with different colors.
title – figure title.
cutoff – terms with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)
top_term – number of top enriched terms grouped by hue are shown.
ax – Matplotlib axes. If None, create a new figure.
figsize – tuple, matplotlib figsize. only used when ax is None.
color – color or list or dict of matplotlib.colors. Must be reconigzed by matplotlib. if dict input, dict keys must be found in the group
ofname – output file name. If None, don’t save figure
wrap_width – int, optional. Maximum characters per line for y-axis labels. Long gene set names are wrapped to fit within the figure. Default: None (no wrapping).

Returns:

matplotlib.Axes. return None if given ofname. Only terms with column <= cut-off are plotted.

gseapy.plot.dotplot(df: DataFrame, column: str = 'Adjusted P-value', x: str | None = None, y: str = 'Term', x_order: List[str] | bool = False, y_order: List[str] | bool = False, title: str = '', cutoff: float = 0.05, top_term: int = 10, size: float = 5, ax: Axes | None = None, figsize: Tuple[float, float] = (4, 6), cmap: str = 'viridis_r', ofname: str | None = None, xticklabels_rot: float | None = None, yticklabels_rot: float | None = None, marker: str = 'o', show_ring: bool = False, wrap_width: int | None = None, **kwargs)[source]

Visualize GSEApy Results with categorical scatterplot When multiple datasets exist in the input dataframe, the x argument is your friend.

Parameters:

df – GSEApy DataFrame results.
column – column name in df that map the dot colors. Default: Adjusted P-value.
x – Categorical variable in df that map the x-axis data. Default: None.
y – Categorical variable in df that map the y-axis data. Default: Term.
x_order – bool, array-like list. Default: False. If True, peformed hierarchical_clustering on X-axis. or input a array-like list of x categorical levels.
x_order – bool, array-like list. Default: False. If True, peformed hierarchical_clustering on Y-axis. or input a array-like list of y categorical levels.
title – Figure title.
cutoff – Terms with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)
top_term – Number of enriched terms to show (based on values in the column (colormap)).
size – float, scale the dot size to get proper visualization.
ax – Matplotlib axes.
figsize – tuple, matplotlib figure size, only used when ax is None.
cmap – Matplotlib colormap for mapping the column semantic.
ofname – Output file name. If None, don’t save figure
marker – The matplotlib.markers. See https://matplotlib.org/stable/api/markers_api.html
bool (show_ring) – Whether to draw outer ring.
wrap_width – int, optional. Maximum characters per line for y-axis labels. Long gene set names are wrapped to fit within the figure. Default: None (no wrapping).

Returns:

matplotlib.Axes if ofname is None. Only terms with column <= cut-off are plotted.

gseapy.plot.enrichment_map(df: DataFrame, column: str = 'Adjusted P-value', cutoff: float = 0.05, top_term: int = 10, **kwargs) → Tuple[DataFrame, DataFrame][source]

Visualize GSEApy Results. Node size corresponds to the percentage of gene overlap in a certain term of interest. Colour of the node corresponds to the significance of the enriched terms. Edge size corresponds to the number of genes that overlap between the two connected nodes. Gray edges correspond to both nodes when it is the only colour edge. When there are two different edge colours, red corresponds to positve nodes and blue corresponds to negative nodes.

Parameters:

df – GSEApy DataFrame results.
column – column name in df to map the node colors. Default: Adjusted P-value or FDR q-val. choose from (“Adjusted P-value”, “P-value”, “FDR q-val”, “NOM p-val”).
group – group by the variable in df that will produce bars with different colors.
title – figure title.
cutoff – nodes with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)
top_term – number of top enriched terms are selected as nodes.

Returns:

tuple of dataframe (nodes, edges)

gseapy.plot.gseaplot(term: str, hits: Sequence[int], nes: float, pval: float, fdr: float, RES: Sequence[float], rank_metric: Sequence[float] | None = None, pheno_pos: str = '', pheno_neg: str = '', color: str = '#88C544', figsize: Tuple[float, float] = (6, 5.5), cmap: str = 'seismic', ofname: str | None = None, **kwargs) → List[Axes] | None[source]

This is the main function for generating the gsea plot.

Parameters:

term – gene_set name
hits – hits indices of rank_metric.index presented in gene set S.
nes – Normalized enrichment scores.
pval – nominal p-value.
fdr – false discovery rate.
RES – running enrichment scores.
rank_metric – pd.Series for rankings, rank_metric.values.
pheno_pos – phenotype label, positive correlated.
pheno_neg – phenotype label, negative correlated.
color – color for RES and hits.
figsize – matplotlib figsize.
ofname – output file name. If None, don’t save figure

return matplotlib.Figure.

gseapy.plot.gseaplot2(terms: List[str], hits: List[Sequence[int]], RESs: List[Sequence[float]], rank_metric: Sequence[float] | None = None, colors: str | List[str] | None = None, figsize: Tuple[float, float] = (6, 4), legend_kws: Dict[str, Any] | None = None, ofname: str | None = None, **kwargs) → List[Axes] | None[source]

Trace plot for combining multiple terms/pathways into one plot :param terms: list of terms to show in trace plot :param hits: list of hits indices correspond to each term. :param RESs: list of running enrichment scores correspond to each term. :param rank_metric: Optional, rankings. :param figsize: matplotlib figsize. :legend_kws: Optional, contol the location of lengends :param ofname: output file name. If None, don’t save figure

return matplotlib.Figure.

gseapy.plot.heatmap(df: DataFrame, z_score: int | None = None, title: str = '', figsize: Tuple[float, float] = (5, 5), cmap: str | None = None, xticklabels: bool = True, yticklabels: bool = True, ofname: str | None = None, ax: Axes | None = None, **kwargs)[source]

Visualize the dataframe.

Parameters:

df – DataFrame from expression table.
z_score – 0, 1, or None. z_score axis{0, 1}. If None, not scale.
title – figure title.
figsize – heatmap figsize.
cmap – matplotlib colormap. e.g. “RdBu_r”.
xticklabels – bool, whether to show xticklabels.
xticklabels – bool, whether to show xticklabels.
ofname – output file name. If None, don’t save figure.
ax – matplotlib axes. Default: None.

Returns:

ax if ofname is None.

gseapy.plot.ringplot(df: DataFrame, column: str = 'Adjusted P-value', x: str | None = None, title: str = '', cutoff: float = 0.05, top_term: int = 10, size: float = 5, figsize: Tuple[float, float] = (4, 6), cmap: str = 'viridis_r', ofname: str | None = None, xticklabels_rot: float | None = None, yticklabels_rot: float | None = None, marker='o', show_ring: bool = True, **kwargs)[source]

ringplot is deprecated, use dotplot instead

Parameters:

df – GSEApy DataFrame results.
x – Group by the variable in df that will produce categorical scatterplot.
column – column name in df to map the dot colors. Default: Adjusted P-value
title – figure title
cutoff – terms with column value < cut-off are shown. Work only for (“Adjusted P-value”, “P-value”, “NOM p-val”, “FDR q-val”)
top_term – number of enriched terms to show.
size – float, scale the dot size to get proper visualization.
figsize – tuple, matplotlib figure size.
cmap – matplotlib colormap for mapping the column semantic.
ofname – output file name. If None, don’t save figure
marker – the matplotlib.markers. See https://matplotlib.org/stable/api/markers_api.html
bool (show_ring) – whether to show outer ring.

Returns:

matplotlib.Axes. return None if given ofname. Only terms with column <= cut-off are plotted.

gseapy.plot.zscore(data2d: DataFrame, axis: int | None = 0)[source]

Standardize the mean and variance of the data axis Parameters.

Parameters:

data2d – DataFrame to normalize.
axis – int, Which axis to normalize across. If 0, normalize across rows, if 1, normalize across columns. If None, don’t change data

Returns:

Normalized DataFrame. Normalized data with a mean of 0 and variance of 1 across the specified axis.

5. Developmental Guide

5.1. Module APIs

5.2. GSEA Statistics

5. Parameters

5. Returns

5.3. Over-representation Statistics

5. For 2*2 contingency table:

5.4. Enrichr API

5. Parameters

5. Returns

5. Parameters

5. Returns

5.5. BioMart API

5.6. Parser

5.7. Visualization

5. Parameters

5.8. Scientific Journal and Sci- themed Color Palettes

5.9. Utils