Developmental Guide

Module APIs

gseapy.gsea()[source]

Run Gene Set Enrichment Analysis.

Parameters:
  • data – Gene expression data table, Pandas DataFrame, gct file.
  • gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.
  • cls – A list or a .cls file format required for GSEA.
  • outdir (str) – Results output directory.
  • permutation_num (int) – Number of permutations for significance computation. Default: 1000.
  • permutation_type (str) – Permutation type, “phenotype” for phenotypes, “gene_set” for genes.
  • min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
  • max_size (int) – Maximum allowed number of genes from gene set also the data set. Default: 500.
  • weighted_score_type (float) – Refer to algorithm.enrichment_score(). Default:1.
  • method

    The method used to calculate a correlation or ranking. Default: ‘log2_ratio_of_classes’. Others methods are:

    1. ’signal_to_noise’

      You must have at least three samples for each phenotype to use this metric. The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    2. ’t_test’

      Uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric. The larger the tTest ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    3. ’ratio_of_classes’ (also referred to as fold change).

      Uses the ratio of class means to calculate fold change for natural scale data.

    4. ’diff_of_classes’

      Uses the difference of class means to calculate fold change for nature scale data

    5. ’log2_ratio_of_classes’

      Uses the log2 ratio of class means to calculate fold change for natural scale data. This is the recommended statistic for calculating fold change for log scale data.

  • ascending (bool) – Sorting order of rankings. Default: False.
  • processes (int) – Number of Processes you are going to use. Default: 1.
  • figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [6.5,6].
  • format (str) – Matplotlib figure format. Default: ‘pdf’.
  • graph_num (int) – Plot graphs for top sets of each phenotype.
  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
  • seed – Random seed. expect an integer. Default:None.
  • verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.
Returns:

Return a GSEA obj. All results store to a dictionary, obj.results, where contains:

| {es: enrichment score,
|  nes: normalized enrichment score,
|  p: P-value,
|  fdr: FDR,
|  size: gene set size,
|  matched_size: genes matched to the data,
|  genes: gene names from the data set
|  ledge_genes: leading edge genes}

gseapy.prerank()[source]

Run Gene Set Enrichment Analysis with pre-ranked correlation defined by user.

Parameters:
  • rnk – pre-ranked correlation table or pandas DataFrame. Same input with GSEA .rnk file.
  • gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.
  • outdir – results output directory.
  • permutation_num (int) – Number of permutations for significance computation. Default: 1000.
  • min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
  • max_size (int) – Maximum allowed number of genes from gene set also the data set. Defaults: 500.
  • weighted_score_type (str) – Refer to algorithm.enrichment_score(). Default:1.
  • ascending (bool) – Sorting order of rankings. Default: False.
  • processes (int) – Number of Processes you are going to use. Default: 1.
  • figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [6.5,6].
  • format (str) – Matplotlib figure format. Default: ‘pdf’.
  • graph_num (int) – Plot graphs for top sets of each phenotype.
  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
  • seed – Random seed. expect an integer. Default:None.
  • verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.
Returns:

Return a Prerank obj. All results store to a dictionary, obj.results, where contains:

| {es: enrichment score,
|  nes: normalized enrichment score,
|  p: P-value,
|  fdr: FDR,
|  size: gene set size,
|  matched_size: genes matched to the data,
|  genes: gene names from the data set
|  ledge_genes: leading edge genes}

gseapy.ssgsea()[source]

Run Gene Set Enrichment Analysis with single sample GSEA tool

Parameters:
  • data – Expression table, pd.Series, pd.DataFrame, GCT file, or .rnk file format.
  • gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.
  • outdir – Results output directory.
  • sample_norm_method (str) –

    “Sample normalization method. Choose from {‘rank’, ‘log’, ‘log_rank’}. Default: rank.

    1. ’rank’: Rank your expression data, and transform by 10000*rank_dat/gene_numbers
    2. ’log’ : Do not rank, but transform data by log(data + exp(1)), while data = data[data<1] =1.
    3. ’log_rank’: Rank your expression data, and transform by log(10000*rank_dat/gene_numbers+ exp(1))
    4. ’custom’: Do nothing, and use your own rank value to calculate enrichment score.

see here: https://github.com/GSEA-MSigDB/ssGSEAProjection-gpmodule/blob/master/src/ssGSEAProjection.Library.R, line 86

Parameters:
  • min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
  • max_size (int) – Maximum allowed number of genes from gene set also the data set. Default: 2000.
  • permutation_num (int) – Number of permutations for significance computation. Default: 0.
  • weighted_score_type (str) – Refer to algorithm.enrichment_score(). Default:0.25.
  • scale (bool) – If True, normalize the scores by number of genes in the gene sets.
  • ascending (bool) – Sorting order of rankings. Default: False.
  • processes (int) – Number of Processes you are going to use. Default: 1.
  • figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [7,6].
  • format (str) – Matplotlib figure format. Default: ‘pdf’.
  • graph_num (int) – Plot graphs for top sets of each phenotype.
  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
  • seed – Random seed. expect an integer. Default:None.
  • verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.
Returns:

Return a ssGSEA obj. All results store to a dictionary, access enrichment score by obj.resultsOnSamples, and normalized enrichment score by obj.res2d. if permutation_num > 0, additional results contain:

| {es: enrichment score,
|  nes: normalized enrichment score,
|  p: P-value,
|  fdr: FDR,
|  size: gene set size,
|  matched_size: genes matched to the data,
|  genes: gene names from the data set
|  ledge_genes: leading edge genes, if permutation_num >0}

gseapy.enrichr()[source]

Enrichr API.

Parameters:
  • gene_list – str, list, tuple, series, dataframe. Also support input txt file with one gene id per row. The input identifier should be the same type to gene_sets.
  • gene_sets

    str, list, tuple of Enrichr Library name(s). or custom defined gene_sets (dict, or gmt file).

    Examples:

    Input Enrichr Libraries (https://maayanlab.cloud/Enrichr/#stats):
    str: ‘KEGG_2016’ list: [‘KEGG_2016’,’KEGG_2013’] Use comma to separate each other, e.g. “KEGG_2016,huMAP,GO_Biological_Process_2018”
    Input custom files:
    dict: gene_sets={‘A’:[‘gene1’, ‘gene2’,…],
    ’B’:[‘gene2’, ‘gene4’,…], …}

    gmt: “genes.gmt”

    see also the online docs: https://gseapy.readthedocs.io/en/latest/gseapy_example.html#2.-Enrichr-Example

  • organism

    Enrichr supported organism. Select from (human, mouse, yeast, fly, fish, worm). This argument only affects the Enrichr library names you’ve chosen. No any affects to gmt or dict input of gene_sets.

    see here for more details: https://maayanlab.cloud/modEnrichr/.

  • description – optional. name of the job.
  • outdir – Output file directory
  • background

    int, list, str. Please ignore this argument if your input are just Enrichr library names.

    However, this argument is not straightforward when gene_sets is given a custom input (a gmt file or dict). There are 3 ways to set this argument:

    1. (Recommended) Input a list of background genes. The background gene list is defined by your experment. e.g. the expressed genes in your RNA-seq. The gene identifer in gmt/dict should be the same type to the backgound genes.
    2. Specify a number, e.g. the number of total expressed genes. This works, but not recommend. It assumes that all your genes could be found in background. If genes exist in gmt but not included in background, they will affect the significance of the statistical test.
    3. (Default) Set a Biomart dataset name. The background will be all annotated genes from the BioMart datasets you’ve choosen. The program will try to retrieve the background information automatically.
      Please Use the example code below to choose the correct dataset name:
      >>> from gseapy.parser import Biomart 
      >>> bm = Biomart()
      >>> datasets = bm.get_datasets(mart='ENSEMBL_MART_ENSEMBL')
      
  • cutoff – Show enriched terms which Adjusted P-value < cutoff. Only affects the output figure, not the final output file. Default: 0.05
  • format – Output figure format supported by matplotlib,(‘pdf’,’png’,’eps’…). Default: ‘pdf’.
  • figsize – Matplotlib figsize, accept a tuple or list, e.g. (width,height). Default: (6.5,6).
  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
  • verbose (bool) – Increase output verbosity, print out progress of your job, Default: False.
Returns:

An Enrichr object, which obj.res2d stores your last query, obj.results stores your all queries.

gseapy.replot()[source]

The main function to reproduce GSEA desktop outputs.

Parameters:
  • indir – GSEA desktop results directory. In the sub folder, you must contain edb file folder.
  • outdir – Output directory.
  • weighted_score_type (float) – weighted score type. choose from {0,1,1.5,2}. Default: 1.
  • figsize (list) – Matplotlib output figure figsize. Default: [6.5,6].
  • format (str) – Matplotlib output figure format. Default: ‘pdf’.
  • min_size (int) – Min size of input genes presented in Gene Sets. Default: 3.
  • max_size (int) – Max size of input genes presented in Gene Sets. Default: 5000. You are not encouraged to use min_size, or max_size argument in replot() function. Because gmt file has already been filtered.
  • verbose – Bool, increase output verbosity, print out progress of your job, Default: False.
Returns:

Generate new figures with selected figure format. Default: ‘pdf’.

Algorithm

gseapy.algorithm.enrichment_score(gene_list, correl_vector, gene_set, weighted_score_type=1, nperm=1000, rs=None, single=False, scale=False)[source]

This is the most important function of GSEApy. It has the same algorithm with GSEA and ssGSEA.

Parameters:
  • gene_list – The ordered gene list gene_name_list, rank_metric.index.values
  • gene_set – gene_sets in gmt file, please use gsea_gmt_parser to get gene_set.
  • weighted_score_type – It’s the same with gsea’s weighted_score method. Weighting by the correlation is a very reasonable choice that allows significant gene sets with less than perfect coherence. options: 0(classic),1,1.5,2. default:1. if one is interested in penalizing sets for lack of coherence or to discover sets with any type of nonrandom distribution of tags, a value p < 1 might be appropriate. On the other hand, if one uses sets with large number of genes and only a small subset of those is expected to be coherent, then one could consider using p > 1. Our recommendation is to use p = 1 and use other settings only if you are very experienced with the method and its behavior.
  • correl_vector – A vector with the correlations (e.g. signal to noise scores) corresponding to the genes in the gene list. Or rankings, rank_metric.values
  • nperm – Only use this parameter when computing esnull for statistical testing. Set the esnull value equal to the permutation number.
  • rs – Random state for initializing gene list shuffling. Default: seed=None
Returns:

ES: Enrichment score (real number between -1 and +1)

ESNULL: Enrichment score calculated from random permutations.

Hits_Indices: Index of a gene in gene_list, if gene is included in gene_set.

RES: Numerical vector containing the running enrichment score for all locations in the gene list .

gseapy.algorithm.enrichment_score_tensor(gene_mat, cor_mat, gene_sets, weighted_score_type, nperm=1000, rs=None, single=False, scale=False)[source]

Next generation algorithm of GSEA and ssGSEA. Works for 3d array

Parameters:
  • gene_mat – the ordered gene list(vector) with or without gene indices matrix.
  • cor_mat – correlation vector or matrix (e.g. signal to noise scores) corresponding to the genes in the gene list or matrix.
  • gene_sets (dict) – gmt file dict.
  • weighted_score_type (float) – weighting by the correlation. options: 0(classic), 1, 1.5, 2. default:1 for GSEA and 0.25 for ssGSEA.
  • nperm (int) – permutation times.
  • scale (bool) – If True, normalize the scores by number of genes_mat.
  • single (bool) – If True, use ssGSEA algorithm, otherwise use GSEA.
  • rs – Random state for initialize gene list shuffling. Default: seed=None
Returns:

a tuple contains:

| ES: Enrichment score (real number between -1 and +1), for ssGSEA, set scale eq to True.
| ESNULL: Enrichment score calculated from random permutation.
| Hits_Indices: Indices of genes if genes are included in gene_set.
| RES: The running enrichment score for all locations in the gene list.

gseapy.algorithm.gsea_compute(data, gmt, n, weighted_score_type, permutation_type, method, pheno_pos, pheno_neg, classes, ascending, processes=1, seed=None, single=False, scale=False)[source]

compute enrichment scores and enrichment nulls.

Parameters:
  • data – preprocessed expression dataframe or a pre-ranked file if prerank=True.
  • gmt (dict) – all gene sets in .gmt file. need to call load_gmt() to get results.
  • n (int) – permutation number. default: 1000.
  • method (str) – ranking_metric method. see above.
  • pheno_pos (str) – one of labels of phenotype’s names.
  • pheno_neg (str) – one of labels of phenotype’s names.
  • classes (list) – a list of phenotype labels, to specify which column of dataframe belongs to what category of phenotype.
  • weighted_score_type (float) – default:1
  • ascending (bool) – sorting order of rankings. Default: False.
  • seed – random seed. Default: np.random.RandomState()
  • scale (bool) – if true, scale es by gene number.
Returns:

a tuple contains:

| zipped results of es, nes, pval, fdr.
| nested list of hit indices of input gene_list.
| nested list of ranked enrichment score of each input gene_sets.
| list of enriched terms

gseapy.algorithm.gsea_compute_tensor(data, gmt, n, weighted_score_type, permutation_type, method, pheno_pos, pheno_neg, classes, ascending, processes=1, seed=None, single=False, scale=False)[source]

compute enrichment scores and enrichment nulls. This function will split large array into smaller pieces to advoid memroy overflow.

param data:

preprocessed expression dataframe or a pre-ranked file if prerank=True.

param dict gmt:

all gene sets in .gmt file. need to call load_gmt() to get results.

param int n:

permutation number. default: 1000.

param str method:
 

ranking_metric method. see above.

param str pheno_pos:
 

one of labels of phenotype’s names.

param str pheno_neg:
 

one of labels of phenotype’s names.

param list classes:
 

a list of phenotype labels, to specify which column of dataframe belongs to what category of phenotype.

param float weighted_score_type:
 

default:1

param bool ascending:
 

sorting order of rankings. Default: False.

param seed:

random seed. Default: np.random.RandomState()

param bool scale:
 

if true, scale es by gene number.

return:

a tuple contains:

| zipped results of es, nes, pval, fdr.
| nested list of hit indices of input gene_list.
| nested list of ranked enrichment score of each input gene_sets.
| list of enriched terms
gseapy.algorithm.gsea_fdr(nEnrichmentScores, nEnrichmentNulls)[source]
Create a histogram of all NES(S,pi) over all S and pi.
Use this null distribution to compute an FDR q value.
Parameters:
  • nEnrichmentScores – normalized ES
  • nEnrichmentNulls – normalized ESnulls
Returns:

FDR

gseapy.algorithm.gsea_pval(es, esnull)[source]

Compute nominal p-value.

From article (PNAS): estimate nominal p-value for S from esnull by using the positive or negative portion of the distribution corresponding to the sign of the observed ES(S).

gseapy.algorithm.gsea_significance(enrichment_scores, enrichment_nulls)[source]

Compute nominal pvals, normalized ES, and FDR q value.

For a given NES(S) = NES* >= 0. The FDR is the ratio of the percentage of all (S,pi) with NES(S,pi) >= 0, whose NES(S,pi) >= NES*, divided by the percentage of observed S wih NES(S) >= 0, whose NES(S) >= NES*, and similarly if NES(S) = NES* <= 0.

gseapy.algorithm.normalize(es, esnull)[source]

normalize the ES(S,pi) and the observed ES(S), separately rescaling the positive and negative scores by dividing the mean of the ES(S,pi).

return: NES, NESnull

gseapy.algorithm.ranking_metric(df, method, pos, neg, classes, ascending)[source]

The main function to rank an expression table. works for 2d array.

Parameters:
  • df – gene_expression DataFrame.
  • method

    The method used to calculate a correlation or ranking. Default: ‘log2_ratio_of_classes’. Others methods are:

    1. ’signal_to_noise’ (s2n) or ‘abs_signal_to_noise’ (abs_s2n)

      You must have at least three samples for each phenotype to use this metric. The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    2. ’t_test’

      Uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric. The larger the tTest ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    3. ’ratio_of_classes’ (also referred to as fold change).

      Uses the ratio of class means to calculate fold change for natural scale data.

    4. ’diff_of_classes’

      Uses the difference of class means to calculate fold change for natural scale data

    5. ’log2_ratio_of_classes’

      Uses the log2 ratio of class means to calculate fold change for natural scale data. This is the recommended statistic for calculating fold change for log scale data.

  • pos (str) – one of labels of phenotype’s names.
  • neg (str) – one of labels of phenotype’s names.
  • classes (dict) – column id to group mapping.
  • ascending (bool) – bool or list of bool. Sort ascending vs. descending.
Returns:

returns a pd.Series of correlation to class of each variable. Gene_name is index, and value is rankings.

visit here for more docs: http://software.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html

gseapy.algorithm.ranking_metric_tensor(exprs, method, permutation_num, pos, neg, classes, ascending, seed=None, skip_last=False)[source]

Build shuffled ranking matrix when permutation_type eq to phenotype. Works for 3d array.

Parameters:
  • exprs – gene_expression DataFrame, gene_name indexed.
  • method (str) – calculate correlation or ranking. methods including: 1. ‘signal_to_noise’ (s2n) or ‘abs_signal_to_noise’ (abs_s2n). 2. ‘t_test’. 3. ‘ratio_of_classes’ (also referred to as fold change). 4. ‘diff_of_classes’. 5. ‘log2_ratio_of_classes’.
  • permuation_num (int) – how many times of classes is being shuffled
  • pos (str) – one of labels of phenotype’s names.
  • neg (str) – one of labels of phenotype’s names.
  • classes (list) – a list of phenotype labels, to specify which column of dataframe belongs to what class of phenotype.
  • ascending (bool) – bool. Sort ascending vs. descending.
  • seed – random_state seed
  • skip_last (bool) – (internal use only) whether to skip the permutation of the last rankings.
Returns:

returns two 2d ndarray with shape (nperm, gene_num).

cor_mat_indices: the indices of sorted and permutated (exclude last row) ranking matrix.
cor_mat: sorted and permutated (exclude last row) ranking matrix.

Enrichr

class gseapy.enrichr.Enrichr(gene_list: Iterable[str], gene_sets: Union[List[str], str, Dict[str, str]], organism: str = 'human', descriptions: Optional[str] = '', outdir: Optional[str] = 'Enrichr', cutoff: float = 0.05, background: Union[List[str], int, str] = 'hsapiens_gene_ensembl', format: str = 'pdf', figsize: Tuple[float, float] = (6.5, 6), top_term: int = 10, no_plot: bool = False, verbose: bool = False)[source]

Enrichr API

check_genes(gene_list, usr_list_id)[source]

Compare the genes sent and received to get successfully recognized genes

enrich(gmt)[source]

use local mode

p = p-value computed using the Fisher exact test (Hypergeometric test)

Not implemented here:

combine score = log(p)·z

see here: http://amp.pharm.mssm.edu/Enrichr/help#background&q=4

columns contain:

Term Overlap P-value Adjusted_P-value Genes
filter_gmt(gmt, background)[source]

the gmt values should be filtered only for genes that exist in background this substantially affect the significance of the test, the hypergeometric distribution.

Parameters:
  • gmt – a dict of gene sets.
  • background – list, set, or tuple. A list of custom backgound genes.
get_background()[source]

get background gene

get_libraries()[source]

return active enrichr library name. Official API

get_results(gene_list)[source]

Enrichr API

parse_genelists()[source]

parse gene list

parse_genesets()[source]

parse gene_sets input file type

prepare_outdir()[source]

create temp directory.

run()[source]

run enrichr for one sample gene list but multi-libraries

send_genes(gene_list, url)[source]

send gene list to enrichr server

set_organism()[source]

Select Enrichr organism from below:

Human & Mouse, H. sapiens & M. musculus Fly, D. melanogaster Yeast, S. cerevisiae Worm, C. elegans Fish, D. rerio

gseapy.enrichr.enrichr(gene_list, gene_sets, organism='human', description='', outdir='Enrichr', background='hsapiens_gene_ensembl', cutoff=0.05, format='pdf', figsize=(8, 6), top_term=10, no_plot=False, verbose=False)[source]

Enrichr API.

Parameters:
  • gene_list – str, list, tuple, series, dataframe. Also support input txt file with one gene id per row. The input identifier should be the same type to gene_sets.
  • gene_sets

    str, list, tuple of Enrichr Library name(s). or custom defined gene_sets (dict, or gmt file).

    Examples:

    Input Enrichr Libraries (https://maayanlab.cloud/Enrichr/#stats):
    str: ‘KEGG_2016’ list: [‘KEGG_2016’,’KEGG_2013’] Use comma to separate each other, e.g. “KEGG_2016,huMAP,GO_Biological_Process_2018”
    Input custom files:
    dict: gene_sets={‘A’:[‘gene1’, ‘gene2’,…],
    ’B’:[‘gene2’, ‘gene4’,…], …}

    gmt: “genes.gmt”

    see also the online docs: https://gseapy.readthedocs.io/en/latest/gseapy_example.html#2.-Enrichr-Example

  • organism

    Enrichr supported organism. Select from (human, mouse, yeast, fly, fish, worm). This argument only affects the Enrichr library names you’ve chosen. No any affects to gmt or dict input of gene_sets.

    see here for more details: https://maayanlab.cloud/modEnrichr/.

  • description – optional. name of the job.
  • outdir – Output file directory
  • background

    int, list, str. Please ignore this argument if your input are just Enrichr library names.

    However, this argument is not straightforward when gene_sets is given a custom input (a gmt file or dict). There are 3 ways to set this argument:

    1. (Recommended) Input a list of background genes. The background gene list is defined by your experment. e.g. the expressed genes in your RNA-seq. The gene identifer in gmt/dict should be the same type to the backgound genes.
    2. Specify a number, e.g. the number of total expressed genes. This works, but not recommend. It assumes that all your genes could be found in background. If genes exist in gmt but not included in background, they will affect the significance of the statistical test.
    3. (Default) Set a Biomart dataset name. The background will be all annotated genes from the BioMart datasets you’ve choosen. The program will try to retrieve the background information automatically.
      Please Use the example code below to choose the correct dataset name:
      >>> from gseapy.parser import Biomart 
      >>> bm = Biomart()
      >>> datasets = bm.get_datasets(mart='ENSEMBL_MART_ENSEMBL')
      
  • cutoff – Show enriched terms which Adjusted P-value < cutoff. Only affects the output figure, not the final output file. Default: 0.05
  • format – Output figure format supported by matplotlib,(‘pdf’,’png’,’eps’…). Default: ‘pdf’.
  • figsize – Matplotlib figsize, accept a tuple or list, e.g. (width,height). Default: (6.5,6).
  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
  • verbose (bool) – Increase output verbosity, print out progress of your job, Default: False.
Returns:

An Enrichr object, which obj.res2d stores your last query, obj.results stores your all queries.

Parser

class gseapy.parser.Biomart(host='www.ensembl.org', verbose=False)[source]

query from BioMart

get_attributes(dataset)[source]

Get available attritbutes from dataset you’ve selected

get_datasets(mart='ENSEMBL_MART_ENSEMBL')[source]

Get available datasets from mart you’ve selected

get_filters(dataset)[source]

Get available filters from dataset you’ve selected

get_marts()[source]

Get available marts and their names.

query(dataset='hsapiens_gene_ensembl', attributes=[], filters={}, filename=None)[source]

mapping ids using BioMart.

Parameters:
  • dataset – str, default: ‘hsapiens_gene_ensembl’
  • attributes – str, list, tuple
  • filters – dict, {‘filter name’: list(filter value)}
  • host – www.ensembl.org, asia.ensembl.org, useast.ensembl.org
Returns:

a dataframe contains all attributes you selected.

Note: it will take a couple of minutes to get the results. A xml template for querying biomart. (see https://gist.github.com/keithshep/7776579) Example:: >>> import requests >>> exampleTaxonomy = “mmusculus_gene_ensembl” >>> exampleGene = “ENSMUSG00000086981,ENSMUSG00000086982,ENSMUSG00000086983” >>> urlTemplate = ‘’’http://ensembl.org/biomart/martservice?query=’’’ ‘’’<?xml version=”1.0” encoding=”UTF-8”?>’’’ ‘’’<!DOCTYPE Query>’’’ ‘’’<Query virtualSchemaName=”default” formatter=”CSV” header=”0” uniqueRows=”0” count=”” datasetConfigVersion=”0.6”>’’’ ‘’’<Dataset name=”%s” interface=”default”><Filter name=”ensembl_gene_id” value=”%s”/>’’’ ‘’’<Attribute name=”ensembl_gene_id”/><Attribute name=”ensembl_transcript_id”/>’’’ ‘’’<Attribute name=”transcript_start”/><Attribute name=”transcript_end”/>’’’ ‘’’<Attribute name=”exon_chrom_start”/><Attribute name=”exon_chrom_end”/>’’’ ‘’’</Dataset>’’’ ‘’’</Query>’’’ >>> exampleURL = urlTemplate % (exampleTaxonomy, exampleGene) >>> req = requests.get(exampleURL, stream=True)

gseapy.parser.get_library_name(organism='Human')[source]

return enrichr active enrichr library name. see also: https://maayanlab.cloud/modEnrichr/

Parameters:database (str) – Select one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }
Returns:a list of enrichr libraries from selected database
gseapy.parser.gsea_cls_parser(cls)[source]

Extract class(phenotype) name from .cls file.

Parameters:cls – the a class list instance or .cls file which is identical to GSEA input .
Returns:phenotype name and a list of class vector.
gseapy.parser.gsea_edb_parser(results_path)[source]

Parse results.edb file stored under edb file folder.

Parameters:results_path – the .results file located inside edb folder.
Returns:a dict contains enrichment_term, hit_index,nes, pval, fdr.
gseapy.parser.gsea_gmt_parser(gmt, organism='Human', min_size=3, max_size=1000, gene_list=None)[source]

Parse gene_sets.gmt(gene set database) file or download from enrichr server.

Parameters:
  • gmt (str) – the gene_sets.gmt file or an enrichr library name. checkout full enrichr library name here: https://maayanlab.cloud/Enrichr/#libraries
  • organism (str) – choose one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }. This arugment has not effect if input is a .gmt file.
  • min_size – Minimum allowed number of genes from gene set also the data set. Default: 3.
  • max_size – Maximum allowed number of genes from gene set also the data set. Default: 1000.
  • gene_list – Used for filtering gene set. Only used this argument for gsea() method.
Returns:

Return a new filtered gene set database dictionary.

DO NOT filter gene sets, when use replot(). Because GSEA Desktop have already done this for you.

Graph

class gseapy.plot.MidpointNormalize(vmin=None, vmax=None, midpoint=None, clip=False)[source]
gseapy.plot.adjust_spines(ax, spines)[source]

function for removing spines and ticks.

Parameters:
  • ax – axes object
  • spines – a list of spines names to keep. e.g [left, right, top, bottom] if spines = []. remove all spines and ticks.
gseapy.plot.barplot(df, column='Adjusted P-value', title='', cutoff=0.05, top_term=10, figsize=(6.5, 6), color='salmon', ofname=None, **kwargs)[source]

Visualize enrichr results.

Parameters:
  • df – GSEApy DataFrame results.
  • column – which column of DataFrame to show. Default: Adjusted P-value
  • title – figure title.
  • cutoff – terms with ‘column’ value < cut-off are shown.
  • top_term – number of top enriched terms to show.
  • figsize – tuple, matplotlib figsize.
  • color – color for bars.
  • ofname – output file name. If None, don’t save figure
gseapy.plot.dotplot(df, column='Adjusted P-value', title='', cutoff=0.05, top_term=10, sizes=None, norm=None, legend=True, figsize=(6, 5.5), cmap='RdBu_r', ofname=None, **kwargs)[source]

Visualize enrichr results.

Parameters:
  • df – GSEApy DataFrame results.
  • column – which column of DataFrame to show. Default: Adjusted P-value
  • title – figure title
  • cutoff – terms with ‘column’ value < cut-off are shown.
  • top_term – number of enriched terms to show.
  • ascending – bool, the order of y axis.
  • sizes – tuple, (min, max) scatter size. Not functional for now
  • norm – maplotlib.colors.Normalize object.
  • legend – bool, whether to show legend.
  • figsize – tuple, figure size.
  • cmap – matplotlib colormap
  • ofname – output file name. If None, don’t save figure
gseapy.plot.gseaplot(rank_metric, term, hit_indices, nes, pval, fdr, RES, pheno_pos='', pheno_neg='', figsize=(6, 5.5), cmap='seismic', ofname=None, **kwargs)[source]

This is the main function for reproducing the gsea plot.

Parameters:
  • rank_metric – pd.Series for rankings, rank_metric.values.
  • term – gene_set name
  • hit_indices – hits indices of rank_metric.index presented in gene set S.
  • nes – Normalized enrichment scores.
  • pval – nominal p-value.
  • fdr – false discovery rate.
  • RES – running enrichment scores.
  • pheno_pos – phenotype label, positive correlated.
  • pheno_neg – phenotype label, negative correlated.
  • figsize – matplotlib figsize.
  • ofname – output file name. If None, don’t save figure
gseapy.plot.heatmap(df, z_score=None, title='', figsize=(5, 5), cmap='RdBu_r', xticklabels=True, yticklabels=True, ofname=None, **kwargs)[source]

Visualize the dataframe.

Parameters:
  • df – DataFrame from expression table.
  • z_score – z_score axis{0, 1}. If None, don’t normalize data.
  • title – gene set name.
  • outdir – path to save heatmap.
  • figsize – heatmap figsize.
  • cmap – matplotlib colormap.
  • ofname – output file name. If None, don’t save figure
gseapy.plot.zscore(data2d, axis=0)[source]

Standardize the mean and variance of the data axis Parameters.

Parameters:
  • data2d – DataFrame to normalize.
  • axis – int, Which axis to normalize across. If 0, normalize across rows, if 1, normalize across columns. If None, don’t change data
Returns:

Normalized DataFrame. Normalized data with a mean of 0 and variance of 1 across the specified axis.