How to Use GSEAPY

For command line usage:

The enrichr module will call enrichr-api. You need a gene_list file for input. That’s all.

# An example to use enrichr api
$ gseapy enrichr -i gene_list.txt -g KEGG_2016 -d pathway_enrichment -o test

The replot module reproduce GSEA desktop version results. The only input for replot module is the location to GSEA results.

# An example to reproduce figures using replot module.
$ gseapy replot -i ./gsea -o gseapy_out

The gsea module produce GSEAPY results.

The input requries a txt file( FPKM, Expected Counts, TPM, et.al), a cls file, and gene_sets file in gmt format.

# an example to compute using gseapy gsea module
$ gseapy gsea -d gsea_data.txt -c test.cls -g gene_sets.gmt

The prerank module input expects a pre-ranked gene list dataset with correlation values, which in .rnk format, and gene_sets file in gmt format. prerank module is an API to GSEA pre-rank tools.

$ gseapy prerank -r gsea_data.rnk -g gene_sets.gmt -o test

The input files’ formats are identical to GSEA desktop version. See Example for details, or, see GSEA documentation for more information.

Run gseapy inside python:

import gseapy

This will import the following:

  • The enrichr() function to perform gene set enrichment analysis by calling enrichr online api.
# call enrichr
gseapy.enrichr(gene_list='gene_list.txt', description='pathway_analysis', gene_set='KEGG_2016', outdir='test')
  • The replot() function to reproduce GSEA desktop results
# An example to reproduce figures using replot module.
gseapy.replot('gsea','gseapy_out')
  • The gsea() function to computing es,nes,pval,fdr,and generate plots de novo.
# An example to calculate es, nes, pval,fdrs, and produce figures using gseapy.
gseapy.gsea(data='gsea_dat.txt', gene_sets='gene_sets.gmt', cls='test.cls', outdir='gseapy_out',
          min_size=15, max_size=1000, permutation_n = 1000, weighted_score_type=1,
          permutation_type = 'gene_set', method='log2_ratio_of_classes', ascending=False,
          figsize=(6.5,6), format='png')
  • The prerank() function to computing es,nes,pval,fdr,and generate plots using a pre-ranked gene list.
# An example to calculate es, nes, pval,fdrs, and produce figures using gseapy.
gseapy.prerank(rnk='gsea_data.rnk', gene_sets='ene_sets.gmt', outdir='gseapy_out', min_size=15,
               max_size=1000, permutation_n=1000, weighted_score_type=1, ascending=False,
               figsize=(6.5,6), format='png')

To See help information of GSEAPY

1. gseapy subcommands

$ gseapy --help

 usage: gseapy [-h] [--version] {gsea,prerank,ssgsea,replot,enrichr} ...

 gseapy -- Gene Set Enrichment Analysis in Python

 positional arguments:
   {gsea,prerank,ssgsea,replot,enrichr}
     gsea       Main GSEAPY Function: run GSEAPY instead of GSEA.
     prerank    Using pre-ranked tool to run GSEAPY.
     ssgsea     Run Single Sample GSEA.
     replot     Reproduce GSEA desktop figures.
     enrichr    Peform GSEA using enrichr API.
 optional arguments:
   -h, --help   show this help message and exit
   --version    show program's version number and exit

For command line options of each command, type: gseapy COMMAND -h

2. The subcommands help

$ gseapy replot -h
# or
$ gseapy gsea -h
# or
$ gseapy prerank -h
# or
$ gseapy ssgsea -h
# or
$ gseapy enrichr -h

Module APIs

gseapy.gsea()[source]

Run Gene Set Enrichment Analysis.

Parameters:
  • data – Gene expression data table, Pandas DataFrame, gct file.
  • gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.
  • cls – A list or a .cls file format required for GSEA.
  • outdir (str) – Results output directory.
  • permutation_num (int) – Number of permutations for significance computation. Default: 1000.
  • permutation_type (str) – Permutation type, “phenotype” for phenotypes, “gene_set” for genes.
  • min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
  • max_size (int) – Maximum allowed number of genes from gene set also the data set. Default: 500.
  • weighted_score_type (float) – Refer to algorithm.enrichment_score(). Default:1.
  • method

    The method used to calculate a correlation or ranking. Default: ‘log2_ratio_of_classes’. Others methods are:

    1. ’signal_to_noise’

      You must have at least three samples for each phenotype to use this metric. The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    2. ’t_test’

      Uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric. The larger the tTest ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    3. ’ratio_of_classes’ (also referred to as fold change).

      Uses the ratio of class means to calculate fold change for natural scale data.

    4. ’diff_of_classes’

      Uses the difference of class means to calculate fold change for nature scale data

    5. ’log2_ratio_of_classes’

      Uses the log2 ratio of class means to calculate fold change for natural scale data. This is the recommended statistic for calculating fold change for log scale data.

  • ascending (bool) – Sorting order of rankings. Default: False.
  • processes (int) – Number of Processes you are going to use. Default: 1.
  • figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [6.5,6].
  • format (str) – Matplotlib figure format. Default: ‘pdf’.
  • graph_num (int) – Plot graphs for top sets of each phenotype.
  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
  • seed – Random seed. expect an integer. Default:None.
  • verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.
Returns:

Return a GSEA obj. All results store to a dictionary, obj.results, where contains:

| {es: enrichment score,
|  nes: normalized enrichment score,
|  p: P-value,
|  fdr: FDR,
|  size: gene set size,
|  matched_size: genes matched to the data,
|  genes: gene names from the data set
|  ledge_genes: leading edge genes}

gseapy.prerank()[source]

Run Gene Set Enrichment Analysis with pre-ranked correlation defined by user.

Parameters:
  • rnk – pre-ranked correlation table or pandas DataFrame. Same input with GSEA .rnk file.
  • gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.
  • outdir – results output directory.
  • permutation_num (int) – Number of permutations for significance computation. Default: 1000.
  • min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
  • max_size (int) – Maximum allowed number of genes from gene set also the data set. Defaults: 500.
  • weighted_score_type (str) – Refer to algorithm.enrichment_score(). Default:1.
  • ascending (bool) – Sorting order of rankings. Default: False.
  • processes (int) – Number of Processes you are going to use. Default: 1.
  • figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [6.5,6].
  • format (str) – Matplotlib figure format. Default: ‘pdf’.
  • graph_num (int) – Plot graphs for top sets of each phenotype.
  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
  • seed – Random seed. expect an integer. Default:None.
  • verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.
Returns:

Return a Prerank obj. All results store to a dictionary, obj.results, where contains:

| {es: enrichment score,
|  nes: normalized enrichment score,
|  p: P-value,
|  fdr: FDR,
|  size: gene set size,
|  matched_size: genes matched to the data,
|  genes: gene names from the data set
|  ledge_genes: leading edge genes}

gseapy.ssgsea()[source]

Run Gene Set Enrichment Analysis with single sample GSEA tool

Parameters:
  • data – Expression table, pd.Series, pd.DataFrame, GCT file, or .rnk file format.
  • gene_sets – Enrichr Library name or .gmt gene sets file or dict of gene sets. Same input with GSEA.
  • outdir – Results output directory.
  • sample_norm_method (str) –

    “Sample normalization method. Choose from {‘rank’, ‘log’, ‘log_rank’}. Default: rank.

    1. ’rank’: Rank your expression data, and transform by 10000*rank_dat/gene_numbers
    2. ’log’ : Do not rank, but transform data by log(data + exp(1)), while data = data[data<1] =1.
    3. ’log_rank’: Rank your expression data, and transform by log(10000*rank_dat/gene_numbers+ exp(1))
    4. ’custom’: Do nothing, and use your own rank value to calculate enrichment score.

see here: https://github.com/GSEA-MSigDB/ssGSEAProjection-gpmodule/blob/master/src/ssGSEAProjection.Library.R, line 86

Parameters:
  • min_size (int) – Minimum allowed number of genes from gene set also the data set. Default: 15.
  • max_size (int) – Maximum allowed number of genes from gene set also the data set. Default: 2000.
  • permutation_num (int) – Number of permutations for significance computation. Default: 0.
  • weighted_score_type (str) – Refer to algorithm.enrichment_score(). Default:0.25.
  • scale (bool) – If True, normalize the scores by number of genes in the gene sets.
  • ascending (bool) – Sorting order of rankings. Default: False.
  • processes (int) – Number of Processes you are going to use. Default: 1.
  • figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. [width,height]. Default: [7,6].
  • format (str) – Matplotlib figure format. Default: ‘pdf’.
  • graph_num (int) – Plot graphs for top sets of each phenotype.
  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
  • seed – Random seed. expect an integer. Default:None.
  • verbose (bool) – Bool, increase output verbosity, print out progress of your job, Default: False.
Returns:

Return a ssGSEA obj. All results store to a dictionary, access enrichment score by obj.resultsOnSamples, and normalized enrichment score by obj.res2d. if permutation_num > 0, additional results contain:

| {es: enrichment score,
|  nes: normalized enrichment score,
|  p: P-value,
|  fdr: FDR,
|  size: gene set size,
|  matched_size: genes matched to the data,
|  genes: gene names from the data set
|  ledge_genes: leading edge genes, if permutation_num >0}

gseapy.enrichr()[source]

Enrichr API.

Parameters:
  • gene_list – Flat file with list of genes, one gene id per row, or a python list object
  • gene_sets – Enrichr Library to query. Required enrichr library name(s). Separate each name by comma.
  • organism – Enrichr supported organism. Select from (human, mouse, yeast, fly, fish, worm). see here for details: https://amp.pharm.mssm.edu/modEnrichr
  • description – name of analysis. optional.
  • outdir – Output file directory
  • cutoff (float) – Adjusted P-value (benjamini-hochberg correction) cutoff. Default: 0.05
  • background (int) – BioMart dataset name for retrieving background gene information. This argument only works when gene_sets input is a gmt file or python dict. You could also specify a number by yourself, e.g. total expressed genes number. In this case, you will skip retrieving background infos from biomart.

Use the code below to see valid background dataset names from BioMart. Here are example code: >>> from gseapy.parser import Biomart >>> bm = Biomart(verbose=False, host=”asia.ensembl.org”) >>> ## view validated marts >>> marts = bm.get_marts() >>> ## view validated dataset >>> datasets = bm.get_datasets(mart=’ENSEMBL_MART_ENSEMBL’)

Parameters:
  • format (str) – Output figure format supported by matplotlib,(‘pdf’,’png’,’eps’…). Default: ‘pdf’.
  • figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. (width,height). Default: (6.5,6).
  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
  • verbose (bool) – Increase output verbosity, print out progress of your job, Default: False.
Returns:

An Enrichr object, which obj.res2d stores your last query, obj.results stores your all queries.

gseapy.replot()[source]

The main function to reproduce GSEA desktop outputs.

Parameters:
  • indir – GSEA desktop results directory. In the sub folder, you must contain edb file folder.
  • outdir – Output directory.
  • weighted_score_type (float) – weighted score type. choose from {0,1,1.5,2}. Default: 1.
  • figsize (list) – Matplotlib output figure figsize. Default: [6.5,6].
  • format (str) – Matplotlib output figure format. Default: ‘pdf’.
  • min_size (int) – Min size of input genes presented in Gene Sets. Default: 3.
  • max_size (int) – Max size of input genes presented in Gene Sets. Default: 5000. You are not encouraged to use min_size, or max_size argument in replot() function. Because gmt file has already been filtered.
  • verbose – Bool, increase output verbosity, print out progress of your job, Default: False.
Returns:

Generate new figures with selected figure format. Default: ‘pdf’.

Algorithm

gseapy.algorithm.enrichment_score(gene_list, correl_vector, gene_set, weighted_score_type=1, nperm=1000, rs=<mtrand.RandomState object>, single=False, scale=False)[source]

This is the most important function of GSEApy. It has the same algorithm with GSEA and ssGSEA.

Parameters:
  • gene_list – The ordered gene list gene_name_list, rank_metric.index.values
  • gene_set – gene_sets in gmt file, please use gsea_gmt_parser to get gene_set.
  • weighted_score_type – It’s the same with gsea’s weighted_score method. Weighting by the correlation is a very reasonable choice that allows significant gene sets with less than perfect coherence. options: 0(classic),1,1.5,2. default:1. if one is interested in penalizing sets for lack of coherence or to discover sets with any type of nonrandom distribution of tags, a value p < 1 might be appropriate. On the other hand, if one uses sets with large number of genes and only a small subset of those is expected to be coherent, then one could consider using p > 1. Our recommendation is to use p = 1 and use other settings only if you are very experienced with the method and its behavior.
  • correl_vector – A vector with the correlations (e.g. signal to noise scores) corresponding to the genes in the gene list. Or rankings, rank_metric.values
  • nperm – Only use this parameter when computing esnull for statistical testing. Set the esnull value equal to the permutation number.
  • rs – Random state for initializing gene list shuffling. Default: np.random.RandomState(seed=None)
Returns:

ES: Enrichment score (real number between -1 and +1)

ESNULL: Enrichment score calculated from random permutations.

Hits_Indices: Index of a gene in gene_list, if gene is included in gene_set.

RES: Numerical vector containing the running enrichment score for all locations in the gene list .

gseapy.algorithm.enrichment_score_tensor(gene_mat, cor_mat, gene_sets, weighted_score_type, nperm=1000, rs=<mtrand.RandomState object>, single=False, scale=False)[source]

Next generation algorithm of GSEA and ssGSEA.

Parameters:
  • gene_mat – the ordered gene list(vector) with or without gene indices matrix.
  • cor_mat – correlation vector or matrix (e.g. signal to noise scores) corresponding to the genes in the gene list or matrix.
  • gene_sets (dict) – gmt file dict.
  • weighted_score_type (float) – weighting by the correlation. options: 0(classic), 1, 1.5, 2. default:1 for GSEA and 0.25 for ssGSEA.
  • nperm (int) – permutation times.
  • scale (bool) – If True, normalize the scores by number of genes_mat.
  • single (bool) – If True, use ssGSEA algorithm, otherwise use GSEA.
  • rs – Random state for initialize gene list shuffling. Default: np.random.RandomState(seed=None)
Returns:

a tuple contains:

| ES: Enrichment score (real number between -1 and +1), for ssGSEA, set scale eq to True.
| ESNULL: Enrichment score calculated from random permutation.
| Hits_Indices: Indices of genes if genes are included in gene_set.
| RES: The running enrichment score for all locations in the gene list.

gseapy.algorithm.gsea_compute(data, gmt, n, weighted_score_type, permutation_type, method, pheno_pos, pheno_neg, classes, ascending, processes=1, seed=None, single=False, scale=False)[source]

compute enrichment scores and enrichment nulls.

Parameters:
  • data – preprocessed expression dataframe or a pre-ranked file if prerank=True.
  • gmt (dict) – all gene sets in .gmt file. need to call load_gmt() to get results.
  • n (int) – permutation number. default: 1000.
  • method (str) – ranking_metric method. see above.
  • pheno_pos (str) – one of labels of phenotype’s names.
  • pheno_neg (str) – one of labels of phenotype’s names.
  • classes (list) – a list of phenotype labels, to specify which column of dataframe belongs to what category of phenotype.
  • weighted_score_type (float) – default:1
  • ascending (bool) – sorting order of rankings. Default: False.
  • seed – random seed. Default: np.random.RandomState()
  • scale (bool) – if true, scale es by gene number.
Returns:

a tuple contains:

| zipped results of es, nes, pval, fdr.
| nested list of hit indices of input gene_list.
| nested list of ranked enrichment score of each input gene_sets.
| list of enriched terms

gseapy.algorithm.gsea_compute_tensor(data, gmt, n, weighted_score_type, permutation_type, method, pheno_pos, pheno_neg, classes, ascending, processes=1, seed=None, single=False, scale=False)[source]

compute enrichment scores and enrichment nulls.

Parameters:
  • data – preprocessed expression dataframe or a pre-ranked file if prerank=True.
  • gmt (dict) – all gene sets in .gmt file. need to call load_gmt() to get results.
  • n (int) – permutation number. default: 1000.
  • method (str) – ranking_metric method. see above.
  • pheno_pos (str) – one of labels of phenotype’s names.
  • pheno_neg (str) – one of labels of phenotype’s names.
  • classes (list) – a list of phenotype labels, to specify which column of dataframe belongs to what category of phenotype.
  • weighted_score_type (float) – default:1
  • ascending (bool) – sorting order of rankings. Default: False.
  • seed – random seed. Default: np.random.RandomState()
  • scale (bool) – if true, scale es by gene number.
Returns:

a tuple contains:

| zipped results of es, nes, pval, fdr.
| nested list of hit indices of input gene_list.
| nested list of ranked enrichment score of each input gene_sets.
| list of enriched terms

gseapy.algorithm.gsea_pval(es, esnull)[source]

Compute nominal p-value.

From article (PNAS): estimate nominal p-value for S from esnull by using the positive or negative portion of the distribution corresponding to the sign of the observed ES(S).

gseapy.algorithm.gsea_significance(enrichment_scores, enrichment_nulls)[source]

Compute nominal pvals, normalized ES, and FDR q value.

For a given NES(S) = NES* >= 0. The FDR is the ratio of the percentage of all (S,pi) with NES(S,pi) >= 0, whose NES(S,pi) >= NES*, divided by the percentage of observed S wih NES(S) >= 0, whose NES(S) >= NES*, and similarly if NES(S) = NES* <= 0.

gseapy.algorithm.normalize(es, esnull)[source]

normalize the ES(S,pi) and the observed ES(S), separately rescaling the positive and negative scores by dividing the mean of the ES(S,pi).

return: NES, NESnull

gseapy.algorithm.ranking_metric(df, method, pos, neg, classes, ascending)[source]

The main function to rank an expression table.

Parameters:
  • df – gene_expression DataFrame.
  • method

    The method used to calculate a correlation or ranking. Default: ‘log2_ratio_of_classes’. Others methods are:

    1. ’signal_to_noise’

      You must have at least three samples for each phenotype to use this metric. The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    2. ’t_test’

      Uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric. The larger the tTest ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

    3. ’ratio_of_classes’ (also referred to as fold change).

      Uses the ratio of class means to calculate fold change for natural scale data.

    4. ’diff_of_classes’

      Uses the difference of class means to calculate fold change for natural scale data

    5. ’log2_ratio_of_classes’

      Uses the log2 ratio of class means to calculate fold change for natural scale data. This is the recommended statistic for calculating fold change for log scale data.

  • pos (str) – one of labels of phenotype’s names.
  • neg (str) – one of labels of phenotype’s names.
  • classes (list) – a list of phenotype labels, to specify which column of dataframe belongs to what category of phenotype.
  • ascending (bool) – bool or list of bool. Sort ascending vs. descending.
Returns:

returns a pd.Series of correlation to class of each variable. Gene_name is index, and value is rankings.

visit here for more docs: http://software.broadinstitute.org/gsea/doc/GSEAUserGuideFrame.html

gseapy.algorithm.ranking_metric_tensor(exprs, method, permutation_num, pos, neg, classes, ascending, rs=<mtrand.RandomState object>)[source]

Build shuffled ranking matrix when permutation_type eq to phenotype.

Parameters:
  • exprs – gene_expression DataFrame, gene_name indexed.
  • method (str) – calculate correlation or ranking. methods including: 1. ‘signal_to_noise’. 2. ‘t_test’. 3. ‘ratio_of_classes’ (also referred to as fold change). 4. ‘diff_of_classes’. 5. ‘log2_ratio_of_classes’.
  • permuation_num (int) – how many times of classes is being shuffled
  • pos (str) – one of labels of phenotype’s names.
  • neg (str) – one of labels of phenotype’s names.
  • classes (list) – a list of phenotype labels, to specify which column of dataframe belongs to what class of phenotype.
  • ascending (bool) – bool. Sort ascending vs. descending.
Returns:

returns two 2d ndarray with shape (nperm, gene_num).

cor_mat_indices: the indices of sorted and permutated (exclude last row) ranking matrix.
cor_mat: sorted and permutated (exclude last row) ranking matrix.

Enrichr

class gseapy.enrichr.Enrichr(gene_list, gene_sets, organism='human', descriptions='', outdir='Enrichr', cutoff=0.05, background='hsapiens_gene_ensembl', format='pdf', figsize=(6.5, 6), top_term=10, no_plot=False, verbose=False)[source]

Enrichr API

check_genes(gene_list, usr_list_id)[source]

Compare the genes sent and received to get successfully recognized genes

enrich(gmt)[source]

use local mode

p = p-value computed using the Fisher exact test (Hypergeometric test)

Not implemented here:

combine score = log(p)·z

see here: http://amp.pharm.mssm.edu/Enrichr/help#background&q=4

columns contain:

Term Overlap P-value Adjusted_P-value Genes
get_background()[source]

get background gene

get_libraries()[source]

return active enrichr library name. Official API

get_organism()[source]

Select Enrichr organism from below:

Human & Mouse: H. sapiens & M. musculus Fly: D. melanogaster Yeast: S. cerevisiae Worm: C. elegans Fish: D. rerio

get_results(gene_list)[source]

Enrichr API

parse_genelists()[source]

parse gene list

parse_genesets()[source]

parse gene_sets input file type

prepare_outdir()[source]

create temp directory.

run()[source]

run enrichr for one sample gene list but multi-libraries

send_genes(gene_list, url)[source]

send gene list to enrichr server

gseapy.enrichr.enrichr(gene_list, gene_sets, organism='human', description='', outdir='Enrichr', background='hsapiens_gene_ensembl', cutoff=0.05, format='pdf', figsize=(8, 6), top_term=10, no_plot=False, verbose=False)[source]

Enrichr API.

Parameters:
  • gene_list – Flat file with list of genes, one gene id per row, or a python list object
  • gene_sets – Enrichr Library to query. Required enrichr library name(s). Separate each name by comma.
  • organism – Enrichr supported organism. Select from (human, mouse, yeast, fly, fish, worm). see here for details: https://amp.pharm.mssm.edu/modEnrichr
  • description – name of analysis. optional.
  • outdir – Output file directory
  • cutoff (float) – Adjusted P-value (benjamini-hochberg correction) cutoff. Default: 0.05
  • background (int) – BioMart dataset name for retrieving background gene information. This argument only works when gene_sets input is a gmt file or python dict. You could also specify a number by yourself, e.g. total expressed genes number. In this case, you will skip retrieving background infos from biomart.

Use the code below to see valid background dataset names from BioMart. Here are example code: >>> from gseapy.parser import Biomart >>> bm = Biomart(verbose=False, host=”asia.ensembl.org”) >>> ## view validated marts >>> marts = bm.get_marts() >>> ## view validated dataset >>> datasets = bm.get_datasets(mart=’ENSEMBL_MART_ENSEMBL’)

Parameters:
  • format (str) – Output figure format supported by matplotlib,(‘pdf’,’png’,’eps’…). Default: ‘pdf’.
  • figsize (list) – Matplotlib figsize, accept a tuple or list, e.g. (width,height). Default: (6.5,6).
  • no_plot (bool) – If equals to True, no figure will be drawn. Default: False.
  • verbose (bool) – Increase output verbosity, print out progress of your job, Default: False.
Returns:

An Enrichr object, which obj.res2d stores your last query, obj.results stores your all queries.

Parser

class gseapy.parser.Biomart(host='www.ensembl.org', verbose=False)[source]

query from BioMart

get_attributes(dataset)[source]

Get available attritbutes from dataset you’ve selected

get_datasets(mart='ENSEMBL_MART_ENSEMBL')[source]

Get available datasets from mart you’ve selected

get_filters(dataset)[source]

Get available filters from dataset you’ve selected

get_marts()[source]

Get available marts and their names.

query(dataset='hsapiens_gene_ensembl', attributes=[], filters={}, filename=None)[source]

mapping ids using BioMart.

Parameters:
  • dataset – str, default: ‘hsapiens_gene_ensembl’
  • attributes – str, list, tuple
  • filters – dict, {‘filter name’: list(filter value)}
  • host – www.ensembl.org, asia.ensembl.org, useast.ensembl.org
Returns:

a dataframe contains all attributes you selected.

Note: it will take a couple of minutes to get the results. A xml template for querying biomart. (see https://gist.github.com/keithshep/7776579)

exampleTaxonomy = “mmusculus_gene_ensembl” exampleGene = “ENSMUSG00000086981,ENSMUSG00000086982,ENSMUSG00000086983” urlTemplate = ‘’‘http://ensembl.org/biomart/martservice?query=’‘’ ‘’‘<?xml version=”1.0” encoding=”UTF-8”?>’‘’ ‘’‘<!DOCTYPE Query>’‘’ ‘’‘<Query virtualSchemaName=”default” formatter=”CSV” header=”0” uniqueRows=”0” count=”” datasetConfigVersion=”0.6”>’‘’ ‘’‘<Dataset name=”%s” interface=”default”><Filter name=”ensembl_gene_id” value=”%s”/>’‘’ ‘’‘<Attribute name=”ensembl_gene_id”/><Attribute name=”ensembl_transcript_id”/>’‘’ ‘’‘<Attribute name=”transcript_start”/><Attribute name=”transcript_end”/>’‘’ ‘’‘<Attribute name=”exon_chrom_start”/><Attribute name=”exon_chrom_end”/>’‘’ ‘’‘</Dataset>’‘’ ‘’‘</Query>’‘’

exampleURL = urlTemplate % (exampleTaxonomy, exampleGene) req = requests.get(exampleURL, stream=True)

gseapy.parser.get_library_name(database='Human')[source]

return enrichr active enrichr library name. :param str database: Select one from { ‘Human’, ‘Mouse’, ‘Yeast’, ‘Fly’, ‘Fish’, ‘Worm’ }

gseapy.parser.gsea_cls_parser(cls)[source]

Extract class(phenotype) name from .cls file.

Parameters:cls – the a class list instance or .cls file which is identical to GSEA input .
Returns:phenotype name and a list of class vector.
gseapy.parser.gsea_edb_parser(results_path, index=0)[source]

Parse results.edb file stored under edb file folder.

Parameters:
  • results_path – the .results file located inside edb folder.
  • index – gene_set index of gmt database, used for iterating items.
Returns:

enrichment_term, hit_index,nes, pval, fdr.

gseapy.parser.gsea_gmt_parser(gmt, min_size=3, max_size=1000, gene_list=None)[source]

Parse gene_sets.gmt(gene set database) file or download from enrichr server.

Parameters:
  • gmt – the gene_sets.gmt file of GSEA input or an enrichr library name. checkout full enrichr library name here: http://amp.pharm.mssm.edu/Enrichr/#stats
  • min_size – Minimum allowed number of genes from gene set also the data set. Default: 3.
  • max_size – Maximum allowed number of genes from gene set also the data set. Default: 5000.
  • gene_list – Used for filtering gene set. Only used this argument for call() method.
Returns:

Return a new filtered gene set database dictionary.

DO NOT filter gene sets, when use replot(). Because GSEA Desktop have already done this for you.

Graph

gseapy.plot.adjust_spines(ax, spines)[source]

function for removing spines and ticks.

Parameters:
  • ax – axes object
  • spines – a list of spines names to keep. e.g [left, right, top, bottom] if spines = []. remove all spines and ticks.
gseapy.plot.barplot(df, column='Adjusted P-value', title='', cutoff=0.05, top_term=10, figsize=(6.5, 6), color='salmon', ofname=None, **kwargs)[source]

Visualize enrichr results.

Parameters:
  • df – GSEApy DataFrame results.
  • column – which column of DataFrame to show. Default: Adjusted P-value
  • title – figure title.
  • cutoff – cut-off of the cloumn you’ve chosen.
  • top_term – number of top enriched terms to show.
  • figsize – tuple, matplotlib figsize.
  • color – color for bars.
  • ofname – output file name. If None, don’t save figure
gseapy.plot.dotplot(df, column='Adjusted P-value', title='', cutoff=0.05, top_term=10, sizes=None, norm=None, legend=True, figsize=(6, 5.5), cmap='RdBu_r', ofname=None, **kwargs)[source]

Visualize enrichr results.

Parameters:
  • df – GSEApy DataFrame results.
  • column – which column of DataFrame to show. Default: Adjusted P-value
  • title – figure title
  • cutoff – p-adjust cut-off.
  • top_term – number of enriched terms to show.
  • ascending – bool, the order of y axis.
  • sizes – tuple, (min, max) scatter size. Not functional for now
  • norm – maplotlib.colors.Normalize object.
  • legend – bool, whether to show legend.
  • figsize – tuple, figure size.
  • cmap – matplotlib colormap
  • ofname – output file name. If None, don’t save figure
gseapy.plot.gseaplot(rank_metric, term, hits_indices, nes, pval, fdr, RES, pheno_pos='', pheno_neg='', figsize=(6, 5.5), cmap='seismic', ofname=None, **kwargs)[source]

This is the main function for reproducing the gsea plot.

Parameters:
  • rank_metric – pd.Series for rankings, rank_metric.values.
  • term – gene_set name
  • hits_indices – hits indices of rank_metric.index presented in gene set S.
  • nes – Normalized enrichment scores.
  • pval – nominal p-value.
  • fdr – false discovery rate.
  • RES – running enrichment scores.
  • pheno_pos – phenotype label, positive correlated.
  • pheno_neg – phenotype label, negative correlated.
  • figsize – matplotlib figsize.
  • ofname – output file name. If None, don’t save figure
gseapy.plot.heatmap(df, z_score=None, title='', figsize=(5, 5), cmap='RdBu_r', xticklabels=True, yticklabels=True, ofname=None, **kwargs)[source]

Visualize the dataframe.

Parameters:
  • df – DataFrame from expression table.
  • z_score – z_score axis{0, 1}. If None, don’t normalize data.
  • title – gene set name.
  • outdir – path to save heatmap.
  • figsize – heatmap figsize.
  • cmap – matplotlib colormap.
  • ofname – output file name. If None, don’t save figure
gseapy.plot.zscore(data2d, axis=0)[source]

Standardize the mean and variance of the data axis Parameters.

Parameters:
  • data2d – DataFrame to normalize.
  • axis – int, Which axis to normalize across. If 0, normalize across rows, if 1, normalize across columns. If None, don’t change data
Returns:

Normalized DataFrame. Normalized data with a mean of 0 and variance of 1 across the specified axis.