2. GSEApy Tutorial

GSEApy is a Python/Rust toolkit for Gene Set Enrichment Analysis and related methods. This notebook walks through every public entry point with small, runnable demos using the data bundled in the repository’s tests/ folder.

2.1. What this notebook covers

Section

Function

What it does

  1. Setup

imports and version check

  1. Biomart

Biomart

convert gene IDs, map orthologs between species

  1. MSigDB

Msigdb

download MSigDB gene-set collections (.gmt)

  1. Enrichr libraries

get_library_name, get_library

list / fetch Enrichr gene-set libraries

  1. Over-representation (online)

enrichr

Enrichr web service

  1. Over-representation (offline)

enrich

local hypergeometric test

  1. Preranked GSEA

prerank

GSEA on a single ranked list

  1. Standard GSEA

gsea

phenotype-permutation GSEA

  1. ssGSEA

ssgsea

single-sample GSEA

  1. GSVA

gsva

gene set variation analysis

  1. Replot

replot

re-plot results from the Broad GSEA Java app

Tip: run the cells in order. Sections 2–5 call online services (Biomart, MSigDB, Enrichr) and need an internet connection; everything else runs fully offline on the bundled test data.

2.1.1. Setup

Import GSEApy together with pandas and matplotlib. The two %autoreload magics are only useful if you are editing the GSEApy source while you work — they reload changed modules automatically. They are harmless otherwise.

[1]:
# %matplotlib inline
# %config InlineBackend.figure_format = 'retina'  # crisper figures on HiDPI/Mac screens
%load_ext autoreload
%autoreload 2

import pandas as pd
import matplotlib.pyplot as plt
import gseapy as gp

Check the installed GSEApy version (this tutorial targets v1.x).

[2]:
gp.__version__
[2]:
'1.3.0'

2.1.2. Biomart API

Biomart is a thin client over the Ensembl BioMart web service. Use it to convert gene identifiers (Ensembl ID ↔ gene symbol ↔ Entrez ID) and to map orthologs between species.

Note: BioMart support is limited and the public server can be slow or occasionally unavailable. Skip this section if you do not need ID conversion.

2.2. Convert gene identifiers

[3]:
from gseapy import Biomart

bm = Biomart()
[4]:
# The commented helpers below let you discover valid marts/datasets/attributes/filters.
# Run them once to explore, then build your real query.
# marts    = bm.get_marts()                                   # available marts
# datasets = bm.get_datasets(mart='ENSEMBL_MART_ENSEMBL')     # datasets in a mart
# attrs    = bm.get_attributes(dataset='hsapiens_gene_ensembl')  # columns you can request
# filters  = bm.get_filters(dataset='hsapiens_gene_ensembl')     # columns you can filter on

# A query needs: a dataset, the attributes (columns) to return, and optional filters.
# `filters` must be a dict mapping a filter name to the values to keep.
queries = {'ensembl_gene_id': ['ENSG00000125285', 'ENSG00000182968']}

results = bm.query(
    dataset='hsapiens_gene_ensembl',
    attributes=['ensembl_gene_id', 'external_gene_name', 'entrezgene_id', 'go_id'],
    filters=queries,
)
results.tail()
[4]:
ensembl_gene_id external_gene_name entrezgene_id go_id
27 ENSG00000182968 SOX1 6656 GO:0007420
28 ENSG00000182968 SOX1 6656 GO:0030182
29 ENSG00000182968 SOX1 6656 GO:0043565
30 ENSG00000182968 SOX1 6656 GO:0048713
31 ENSG00000182968 SOX1 6656 GO:1904936

Inspect the column types of the returned table.

[5]:
results.dtypes
[5]:
ensembl_gene_id       object
external_gene_name    object
entrezgene_id          Int32
go_id                 object
dtype: object

2.3. Map gene symbols between mouse and human

This is handy when you need to translate gene symbols across species — for example, to run a human gene-set library against mouse data. We pull the full ortholog mapping in both directions (no filter ⇒ all genes), so each query returns the whole genome and may take a little while.

[6]:
from gseapy import Biomart

bm = Biomart()

# Mouse -> Human orthologs. Note the dataset and attribute names differ per direction.
m2h = bm.query(
    dataset='mmusculus_gene_ensembl',
    attributes=['ensembl_gene_id', 'external_gene_name',
                'hsapiens_homolog_ensembl_gene',
                'hsapiens_homolog_associated_gene_name'],
)

# Human -> Mouse orthologs.
h2m = bm.query(
    dataset='hsapiens_gene_ensembl',
    attributes=['ensembl_gene_id', 'external_gene_name',
                'mmusculus_homolog_ensembl_gene',
                'mmusculus_homolog_associated_gene_name'],
)
[7]:
# Peek at a random sample of the human -> mouse mapping.
h2m.sample(5)
[7]:
ensembl_gene_id external_gene_name mmusculus_homolog_ensembl_gene mmusculus_homolog_associated_gene_name
79602 ENSG00000105750 ZNF85 ENSMUSG00000048280 Zfp738
85423 ENSG00000137135 ARHGEF39 ENSMUSG00000051517 Arhgef39
71287 ENSG00000137077 CCL21 ENSMUSG00000094065 Ccl21d
20341 ENSG00000306096 NaN NaN NaN
5195 ENSG00000273092 TAS2R20 NaN NaN

2.4. Translate a .gmt gene-set file to another species

GSEApy gene symbols are case-sensitive when used offline, and most public libraries (Enrichr, MSigDB) ship human symbols. To run them against mouse data, convert the gene members with the ortholog map built above.

Example: convert a human KEGG library to mouse symbols.

[8]:
# Build a {human_symbol: mouse_symbol} lookup, dropping rows with missing values.
h2m_dict = {}
for _, row in h2m[['external_gene_name', 'mmusculus_homolog_associated_gene_name']].iterrows():
    if row.isna().any():
        continue
    h2m_dict[row['external_gene_name']] = row['mmusculus_homolog_associated_gene_name']

# Read a human KEGG .gmt into a {term: [genes]} dict.
kegg = gp.read_gmt(path="tests/extdata/enrichr.KEGG_2016.gmt")
print("Human symbols:", kegg['MAPK signaling pathway Homo sapiens hsa04010'][:10])
Human symbols: ['EGF', 'IL1R1', 'IL1R2', 'HSPA1L', 'CACNA2D2', 'CACNA2D1', 'CACNA2D4', 'CACNA2D3', 'MAPK8IP3', 'MAPK8IP1']
[9]:
# Translate every gene in every term to its mouse ortholog (skip genes with no mapping).
kegg_mouse = {}
for term, genes in kegg.items():
    kegg_mouse[term] = [h2m_dict[g] for g in genes if g in h2m_dict]

print("Mouse symbols:", kegg_mouse['MAPK signaling pathway Homo sapiens hsa04010'][:10])
Mouse symbols: ['Egf', 'Il1r1', 'Il1r2', 'Hspa1l', 'Cacna2d2', 'Cacna2d1', 'Cacna2d4', 'Cacna2d3', 'Mapk8ip3', 'Mapk8ip1']

2.4.1. MSigDB API

Msigdb downloads gene-set collections from the MSigDB (Broad Institute). You pick a category (e.g. hallmark, curated, GO) and a database version (dbver), and get back a {term: [genes]} dict ready to pass to any GSEApy function.

[10]:
from gseapy import Msigdb

msig = Msigdb()
# Mouse hallmark gene sets (category 'mh.all') from the 2023.1.Mm release.
gmt = msig.get_gmt(category='mh.all', dbver="2023.1.Mm")

Two helpers let you discover valid values before downloading:

msig.list_dbver()                  # list available database versions
msig.list_category(dbver="2023.1.Hs")  # list categories for a given version
[11]:
# Inspect one hallmark set.
print(gmt['HALLMARK_WNT_BETA_CATENIN_SIGNALING'])
['Ctnnb1', 'Jag1', 'Myc', 'Notch1', 'Ptch1', 'Trp53', 'Axin1', 'Ncstn', 'Rbpj', 'Psen2', 'Wnt1', 'Axin2', 'Hey2', 'Fzd1', 'Frat1', 'Csnk1e', 'Dvl2', 'Hey1', 'Gnai1', 'Lef1', 'Notch4', 'Ppard', 'Adam17', 'Tcf7', 'Numb', 'Ccnd2', 'Ncor2', 'Kat2a', 'Nkd1', 'Hdac2', 'Dkk1', 'Wnt5b', 'Wnt6', 'Dll1', 'Skp2', 'Hdac5', 'Fzd8', 'Dkk4', 'Cul1', 'Jag2', 'Hdac11', 'Maml1']

2.4.2. Enrichr gene-set libraries

Enrichr hosts hundreds of curated gene-set libraries. GSEApy can list them and download any of them as a {term: [genes]} dict for use with enrichr, enrich, prerank, gsea, ssgsea, or gsva.

List the available library names.

Choose the organism from: 'Human', 'Mouse', 'Yeast', 'Fly', 'Fish', 'Worm'.

[12]:
# Default organism is Human.
names = gp.get_library_name()
names[:10]
[12]:
['ARCHS4_Cell-lines',
 'ARCHS4_IDG_Coexp',
 'ARCHS4_Kinases_Coexp',
 'ARCHS4_TFs_Coexp',
 'ARCHS4_Tissues',
 'Achilles_fitness_decrease',
 'Achilles_fitness_increase',
 'Aging_Perturbations_from_GEO_down',
 'Aging_Perturbations_from_GEO_up',
 'Allen_Brain_Atlas_10x_scRNA_2021']
[13]:
# Library names differ per organism.
yeast = gp.get_library_name(organism='Yeast')
yeast[:10]
[13]:
['Cellular_Component_AutoRIF',
 'Cellular_Component_AutoRIF_Predicted_zscore',
 'GO_Biological_Process_2018',
 'GO_Biological_Process_AutoRIF',
 'GO_Biological_Process_AutoRIF_Predicted_zscore',
 'GO_Cellular_Component_2018',
 'GO_Cellular_Component_AutoRIF',
 'GO_Cellular_Component_AutoRIF_Predicted_zscore',
 'GO_Molecular_Function_2018',
 'GO_Molecular_Function_AutoRIF']

Download a library into a ``{term: [genes]}`` dict.

[14]:
# Fetch one library by name (or pass a local .gmt path to read_gmt instead).
go_mf = gp.get_library(name='GO_Molecular_Function_2018', organism='Yeast')
print(go_mf['ATP binding (GO:0005524)'])
['MLH1', 'ECM10', 'RLI1', 'SSB1', 'SSB2', 'MSH2', 'YTA12', 'CDC6', 'HMI1', 'YNL247W', 'MSH6', 'SSQ1', 'MCM7', 'SRS2', 'HSP104', 'SSA1', 'MCX1', 'SSC1', 'ARP2', 'ARP3', 'SSE1', 'SMC2', 'SSZ1', 'TDA10', 'ORC5', 'VPS4', 'RBK1', 'NEW1', 'SSA4', 'ORC1', 'KAR2', 'SSA2', 'DYN1', 'SSA3', 'PGK1', 'VPS33', 'LHS1', 'CDC123', 'PMS1']

2.4.3. Over-representation analysis (Enrichr web service)

gp.enrichr() sends a gene list to the Enrichr web service and returns enrichment results. The only required input is a list of gene symbols.

Accepted inputs

  • gene_list: a list, pandas.Series, single-column DataFrame, or a .txt file with one gene symbol per line.

  • gene_sets: one or more Enrichr library names (comma-separated string or list), a local .gmt path, or a {term: [genes]} dict.

gene_list = "./data/gene_list.txt"      # or a Python list / Series
gene_sets = "KEGG_2016"                  # one library
gene_sets = "KEGG_2016,KEGG_2013"        # several, comma-separated
gene_sets = ["KEGG_2016", "KEGG_2013"]   # several, as a list

Note: for the online service, gene symbols are not case-sensitive — they are upper-cased automatically.

[15]:
# Load an example gene list (one symbol per line).
gene_list = pd.read_csv("./tests/data/gene_list.txt", header=None, sep="\t")
gene_list.head()
[15]:
0
0 IGKV4-1
1 CD55
2 IGKC
3 PPFIBP1
4 ABHD4
[16]:
# A DataFrame/Series can be flattened to a plain Python list if you prefer.
glist = gene_list.squeeze().str.strip().to_list()
print(glist[:10])
['IGKV4-1', 'CD55', 'IGKC', 'PPFIBP1', 'ABHD4', 'PCSK6', 'PGD', 'ARHGDIB', 'ITGB2', 'CARD6']

2.5. Enrichr without a custom background

Set outdir=None to keep results in memory only (nothing written to disk). The returned object stores the table on both .res2d (last query) and .results (all queries).

[17]:
enr = gp.enrichr(
    gene_list=gene_list,                              # or "./tests/data/gene_list.txt"
    gene_sets=['MSigDB_Hallmark_2020', 'KEGG_2021_Human'],
    organism='human',                                 # set to your organism, e.g. 'Yeast'
    outdir=None,                                       # keep results in memory only
)
[18]:
# All enrichment results across the requested libraries.
enr.results.head(5)
[18]:
Gene_set Term Overlap P-value Adjusted P-value Old P-value Old Adjusted P-value Odds Ratio Combined Score Genes
0 MSigDB_Hallmark_2020 IL-6/JAK/STAT3 Signaling 19/87 1.197225e-09 5.986123e-08 0 0 6.844694 140.612324 IL4R;TGFB1;IL1R1;IFNGR1;IL10RB;ITGB3;IFNGR2;IL...
1 MSigDB_Hallmark_2020 TNF-alpha Signaling via NF-kB 27/200 3.220898e-08 5.368163e-07 0 0 3.841568 66.270963 BTG2;BCL2A1;PLEK;IRS2;LITAF;IFIH1;PANX1;DRAM1;...
2 MSigDB_Hallmark_2020 Complement 27/200 3.220898e-08 5.368163e-07 0 0 3.841568 66.270963 FCN1;LRP1;PLEK;LIPA;CA2;CASP3;LAMP2;S100A12;FY...
3 MSigDB_Hallmark_2020 Inflammatory Response 24/200 1.635890e-06 2.044862e-05 0 0 3.343018 44.540108 LYN;IFITM1;BTG2;IL4R;CD82;IL1R1;IFNGR2;ITGB3;F...
4 MSigDB_Hallmark_2020 heme Metabolism 23/200 5.533816e-06 5.533816e-05 0 0 3.181358 38.509172 SLC22A4;MPP1;BNIP3L;BTG2;ARHGEF12;NEK7;GDE1;FO...

2.6. Enrichr with a custom background

Provide your own background gene list (e.g. all genes expressed in your assay).

Note: when a background is supplied, the output omits the Overlap column.

[19]:
enr_bg = gp.enrichr(
    gene_list=gene_list,
    gene_sets=['MSigDB_Hallmark_2020', 'KEGG_2021_Human'],
    background="tests/data/background.txt",            # the 'organism' arg is ignored when set
    outdir=None,
)
[20]:
enr_bg.results.head()
[20]:
Gene_set Term P-value Adjusted P-value Old P-value Old adjusted P-value Odds Ratio Combined Score Genes
0 MSigDB_Hallmark_2020 IL-6/JAK/STAT3 Signaling 3.559435e-11 1.779718e-09 0 0 8.533251 205.300064 IL4R;TGFB1;IL1R1;IFNGR1;IL10RB;ITGB3;IFNGR2;IL...
1 MSigDB_Hallmark_2020 TNF-alpha Signaling via NF-kB 3.401526e-10 6.356588e-09 0 0 4.824842 105.189414 BTG2;BCL2A1;PLEK;IRS2;LITAF;IFIH1;PANX1;DRAM1;...
2 MSigDB_Hallmark_2020 Complement 3.813953e-10 6.356588e-09 0 0 4.796735 104.027683 FCN1;LRP1;PLEK;LIPA;CA2;CASP3;LAMP2;S100A12;FY...
3 MSigDB_Hallmark_2020 Inflammatory Response 3.380686e-08 4.225857e-07 0 0 4.197067 72.200480 LYN;IFITM1;BTG2;IL4R;CD82;IL1R1;IFNGR2;ITGB3;F...
4 MSigDB_Hallmark_2020 heme Metabolism 8.943634e-08 8.943634e-07 0 0 4.111306 66.725423 SLC22A4;MPP1;BNIP3L;BTG2;ARHGEF12;NEK7;GDE1;FO...

2.6.1. Over-representation analysis (offline hypergeometric test)

gp.enrich() runs the same over-representation test locally — it does not call the Enrichr web service. Use it for custom or private gene sets, or when you have no internet access.

Important

  1. Input gene symbols are case-sensitive here.

  2. The identifier type in your gene_list must match the one used in the gene sets (.gmt/dict).

  3. gene_sets accepts a .gmt path or a {term: [genes]} dict:

gene_sets = "./data/genes.gmt"
gene_sets = {'A': ['gene1', 'gene2', ...], 'B': ['gene2', 'gene4', ...]}
[21]:
# Note: the offline function is `enrich` (no trailing "r"), unlike the online `enrichr`.
# gene_sets here mixes a .gmt file, a name that won't resolve ("unknown"), and a dict (kegg).
enr2 = gp.enrich(
    gene_list="./tests/data/gene_list.txt",           # or a Python list
    gene_sets=["./tests/data/genes.gmt", "unknown", kegg],
    background=None,    # None | int | gene list | text file | a Biomart dataset name
    outdir=None,
    verbose=True,
)
2026-06-24 11:23:41,352 [INFO] User defined gene sets is given: ./tests/data/genes.gmt
2026-06-24 11:23:41,356 [INFO] Input dict object named with gs_ind_2
2026-06-24 11:23:41,735 [WARNING] Input library not found: unknown. Skip
2026-06-24 11:23:41,736 [INFO] Off-line enrichment analysis with library: genes.gmt
2026-06-24 11:23:41,738 [INFO]   Background is not set! Use all 682 genes in genes.gmt.
2026-06-24 11:23:41,743 [INFO] Off-line enrichment analysis with library: gs_ind_2
2026-06-24 11:23:41,768 [INFO]   Background is not set! Use all 7010 genes in gs_ind_2.
2026-06-24 11:23:41,791 [INFO] Done.
[22]:
enr2.results.head()
[22]:
Gene_set Term Overlap P-value Adjusted P-value Odds Ratio Combined Score Genes
0 genes.gmt BvA_UpIN_A 8/139 0.457390 0.568432 1.169168 0.914547 PADI2;MAP3K5;IL1R1;MBOAT2;MSRB2;HAL;PCSK6;IQGAP2
1 genes.gmt BvA_UpIN_B 12/130 0.026744 0.187208 2.275467 8.240477 SUOX;MBNL3;LPAR1;ST3GAL6;DYSF;FAM65B;HEBP1;ARH...
2 genes.gmt CvA_UpIN_A 1/12 0.481190 0.568432 2.334966 1.708011 MBOAT2
3 genes.gmt DvA_UpIN_A 16/284 0.426669 0.568432 1.134623 0.966412 VNN1;NMNAT1;PADI2;MAP3K5;ATP6V1B2;IL1R1;KIF1B;...
4 genes.gmt DvA_UpIN_D 13/236 0.487227 0.568432 1.088533 0.782682 FAM198B;MBNL3;LPAR1;ST3GAL6;DYSF;GNB4;FAM65B;T...

2.7. Choosing a background for the offline test

By default, all genes in ``gene_sets`` are used as the background. A more accurate background is usually one of the following:

  1. (Recommended) A list of background genes, ['gene1', 'gene2', ...] — typically the genes detectable in your experiment (e.g. expressed genes from your RNA-seq). The identifier type must match the gene sets.

  2. A total gene count, e.g. 20000. Simple, but it assumes every gene set gene is present in the background; genes in the sets but missing from the background still distort the test.

  3. A BioMart dataset name, e.g. "hsapiens_gene_ensembl". GSEApy fetches all annotated genes for that dataset and uses them as the background.

2.8. Plotting enrichment results

dotplot and barplot summarize an enrichment table. Below we show the top 5 terms per library, ranked by adjusted p-value.

[23]:
from gseapy import barplot, dotplot
[24]:
# Dot plot: dot size encodes gene overlap, color encodes significance.
# Setting x='Gene_set' faces the plot by library so you can compare them side by side.
ax = dotplot(
    enr.results,
    column="Adjusted P-value",
    x='Gene_set',          # split the x-axis by library (multi-library comparison)
    size=5,
    top_term=5,
    figsize=(3, 5),
    title="KEGG",
    xticklabels_rot=45,    # rotate x tick labels for readability
    show_ring=True,        # set False to remove the outer ring around dots
    marker='o',
)
_images/gseapy_example_42_0.png
[25]:
# Bar plot grouped by library, with an explicit color per group.
ax = barplot(
    enr.results,
    column="Adjusted P-value",
    group='Gene_set',      # group bars by library (multi-library comparison)
    size=5,
    top_term=5,
    figsize=(3, 5),
    color={'KEGG_2021_Human': 'salmon', 'MSigDB_Hallmark_2020': 'darkblue'},
)
_images/gseapy_example_43_0.png
[26]:
# Single-library views use .res2d (the most recent query).
# Pass ofname='figure.pdf' to save the figure to disk.
ax = dotplot(enr.res2d, title='KEGG_2021_Human', cmap='viridis', size=5, figsize=(3, 5))
_images/gseapy_example_44_0.png
[27]:
ax = barplot(enr.res2d, title='KEGG_2021_Human', figsize=(4, 5), color='darkred')
_images/gseapy_example_45_0.png

2.9. Command-line equivalent

The same analysis is available from the gseapy enrichr CLI. The -v flag prints progress.

[28]:
# !gseapy enrichr -i ./tests/data/gene_list.txt \
#                 -g GO_Biological_Process_2017 \
#                 -v -o test/enrichr_BP

2.9.1. Preranked GSEA (prerank)

gp.prerank() runs GSEA on a single, pre-ranked gene list — for example, genes ranked by a differential-expression statistic.

Accepted ``rnk`` inputs

  • a 2-column DataFrame (gene, score), or a 1-column DataFrame indexed by gene

  • a pandas.Series indexed by gene

  • a .rnk text file (no header; anything after a # is ignored)

Accepted ``gene_sets`` inputs — same as the other functions: Enrichr library name(s), a .gmt path, or a {term: [genes]} dict.

Note on symbols: Enrichr libraries use UPPERCASE human symbols. Convert your gene identifiers to match (see Section 2.3 for cross-species conversion).

[29]:
# A .rnk file: column 0 = gene, column 1 = ranking metric.
rnk = pd.read_csv("./tests/data/temp.rnk", header=None, index_col=0, sep="\t")
rnk.head()
[29]:
1
0
ATXN1 16.456753
UBQLN4 13.989493
CALM1 13.745533
DLG4 12.796588
MRE11A 12.787631
[30]:
rnk.shape
[30]:
(22922, 1)

2.10. Run prerank

permutation_num is reduced to 1000 here to keep the demo fast. outdir=None keeps everything in memory. We use method='multilevel' (the fgsea multilevel p-value, explained just below); the default is method='permutation'.

[31]:
pre_res2 = gp.prerank(
    rnk="./tests/data/temp.rnk",   # or rnk=rnk
    gene_sets='KEGG_2016',          # Enrichr library name (downloaded automatically)
    threads=4,
    min_size=5,
    max_size=1000,
    permutation_num=1000,           # raise for publication-quality p-values
    outdir=None,
    seed=6,                         # set a seed for reproducibility
    method='multilevel',            # fgsea multilevel p-values (see below)
    verbose=True,
)
pre_res2.res2d.head(5)
2026-06-24 11:23:42,569 [WARNING] Duplicated values found in preranked stats: 4.97% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2026-06-24 11:23:42,569 [INFO] Parsing data files for GSEA.............................
2026-06-24 11:23:42,583 [INFO] Enrichr library gene sets already downloaded in: /Users/fangzq/.cache/gseapy, use local file
2026-06-24 11:23:42,590 [INFO] 0001 gene_sets have been filtered out when max_size=1000 and min_size=5
2026-06-24 11:23:42,591 [INFO] 0292 gene_sets used for further statistical testing.....
2026-06-24 11:23:42,591 [INFO] Start to run GSEA...Might take a while..................
2026-06-24 11:23:57,327 [INFO] Congratulations. GSEApy runs successfully................

[31]:
Name Term ES NES NOM p-val FDR q-val Tag % Gene % Lead_genes log2err
0 prerank Adherens junction Homo sapiens hsa04520 0.784625 1.906344 9.365577e-17 1.012870e-15 47/74 10.37% CTNNB1;EGFR;RAC1;TGFBR1;SMAD4;MET;EP300;CDC42;... 1.039888
1 prerank Estrogen signaling pathway Homo sapiens hsa04915 0.766347 1.898225 2.680224e-19 5.217502e-18 74/99 16.57% CALM1;PRKACA;GRB2;SP1;EGFR;KRAS;HRAS;HSP90AB1;... 1.120089
2 prerank Glioma Homo sapiens hsa05214 0.784678 1.894231 2.712993e-15 2.263411e-14 52/65 16.29% CALM1;GRB2;EGFR;PRKCA;KRAS;HRAS;TP53;MAPK1;PRK... 0.991232
3 prerank Thyroid hormone signaling pathway Homo sapiens... 0.757700 1.893138 1.934581e-22 8.069965e-21 84/118 16.29% CTNNB1;PRKACA;PRKCA;KRAS;NOTCH1;EP300;CREBBP;H... 1.210593
4 prerank ErbB signaling pathway Homo sapiens hsa04012 0.765574 1.880505 1.112913e-17 1.477140e-16 65/87 16.29% GRB2;EGFR;PRKCA;KRAS;HRAS;MAPK1;PRKCB;SRC;NRAS... 1.069466

For the comparison later in this section, also run the default permutation method on the same ranking.

[32]:
pre_res = gp.prerank(
    rnk="./tests/data/temp.rnk",
    gene_sets='KEGG_2016',
    threads=4,
    min_size=5,
    max_size=1000,
    permutation_num=1000,
    outdir=None,
    seed=6,
    # method='permutation' is the default
    verbose=True,
)
pre_res.res2d.head(5)
2026-06-24 11:23:57,360 [WARNING] Duplicated values found in preranked stats: 4.97% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
2026-06-24 11:23:57,361 [INFO] Parsing data files for GSEA.............................
2026-06-24 11:23:57,374 [INFO] Enrichr library gene sets already downloaded in: /Users/fangzq/.cache/gseapy, use local file
2026-06-24 11:23:57,382 [INFO] 0001 gene_sets have been filtered out when max_size=1000 and min_size=5
2026-06-24 11:23:57,382 [INFO] 0292 gene_sets used for further statistical testing.....
2026-06-24 11:23:57,382 [INFO] Start to run GSEA...Might take a while..................
2026-06-24 11:24:02,252 [INFO] Congratulations. GSEApy runs successfully................

[32]:
Name Term ES NES NOM p-val FDR q-val FWER p-val Tag % Gene % Lead_genes
0 prerank Adherens junction Homo sapiens hsa04520 0.784625 1.917092 0.0 0.0 0.0 47/74 10.37% CTNNB1;EGFR;RAC1;TGFBR1;SMAD4;MET;EP300;CDC42;...
1 prerank Estrogen signaling pathway Homo sapiens hsa04915 0.766347 1.896928 0.0 0.0 0.0 74/99 16.57% CALM1;PRKACA;GRB2;SP1;EGFR;KRAS;HRAS;HSP90AB1;...
2 prerank Glioma Homo sapiens hsa05214 0.784678 1.895572 0.0 0.0 0.0 52/65 16.29% CALM1;GRB2;EGFR;PRKCA;KRAS;HRAS;TP53;MAPK1;PRK...
3 prerank Thyroid hormone signaling pathway Homo sapiens... 0.757700 1.890973 0.0 0.0 0.0 84/118 16.29% CTNNB1;PRKACA;PRKCA;KRAS;NOTCH1;EP300;CREBBP;H...
4 prerank Long-term potentiation Homo sapiens hsa04720 0.778249 1.890416 0.0 0.0 0.0 42/66 9.01% CALM1;PRKACA;PRKCA;KRAS;EP300;CREBBP;HRAS;PRKA...

2.11. The fgsea multilevel method

With the default gene-permutation method, the smallest reportable nominal p-value is bounded by 1 / permutation_num. Passing ``method=’multilevel’`` switches to GSEApy’s faithful Rust port of the fgsea multilevel algorithm, which resolves p-values far below 1/nperm (down to eps) via adaptive split-sampling.

Things to keep in mind:

  • It runs on a single ranked list, like classic preranked GSEA.

  • The NES is fgsea-style: NES = ES / mean(|null ES| of the same sign), where the null comes from random gene sets. This differs from the gene-permutation NES, so NES values are not directly comparable between the two methods.

  • An extra ``log2err`` column reports the estimated log2 standard error of the multilevel p-value.

[33]:
pre_mult = gp.prerank(
    rnk="./tests/data/temp.rnk",    # single ranked list
    gene_sets="KEGG_2016",
    method="multilevel",            # use fgsea multilevel p-values
    sample_size=101,                # multilevel sampling depth (default 101)
    eps=1e-50,                      # smallest p-value resolved; 0 => machine precision
    threads=4,
    min_size=5,
    max_size=1000,
    seed=6,
    outdir=None,
)
pre_mult.res2d.head()
2026-06-24 11:24:02,297 [WARNING] Duplicated values found in preranked stats: 4.97% of genes
The order of those genes will be arbitrary, which may produce unexpected results.
[33]:
Name Term ES NES NOM p-val FDR q-val Tag % Gene % Lead_genes log2err
0 prerank Adherens junction Homo sapiens hsa04520 0.784625 1.906344 9.365577e-17 1.012870e-15 47/74 10.37% CTNNB1;EGFR;RAC1;TGFBR1;SMAD4;MET;EP300;CDC42;... 1.039888
1 prerank Estrogen signaling pathway Homo sapiens hsa04915 0.766347 1.898225 2.680224e-19 5.217502e-18 74/99 16.57% CALM1;PRKACA;GRB2;SP1;EGFR;KRAS;HRAS;HSP90AB1;... 1.120089
2 prerank Glioma Homo sapiens hsa05214 0.784678 1.894231 2.712993e-15 2.263411e-14 52/65 16.29% CALM1;GRB2;EGFR;PRKCA;KRAS;HRAS;TP53;MAPK1;PRK... 0.991232
3 prerank Thyroid hormone signaling pathway Homo sapiens... 0.757700 1.893138 1.934581e-22 8.069965e-21 84/118 16.29% CTNNB1;PRKACA;PRKCA;KRAS;NOTCH1;EP300;CREBBP;H... 1.210593
4 prerank ErbB signaling pathway Homo sapiens hsa04012 0.765574 1.880505 1.112913e-17 1.477140e-16 65/87 16.29% GRB2;EGFR;PRKCA;KRAS;HRAS;MAPK1;PRKCB;SRC;NRAS... 1.069466

pre_mult.res2d carries the usual preranked columns plus log2err, and drops the FWER p-val column (that statistic is specific to the gene-permutation null):

Column

Meaning

ES / NES

enrichment score and fgsea-style normalized ES

NOM p-val

multilevel nominal p-value (can be far below 1/nperm)

FDR q-val

Benjamini–Hochberg adjusted p-value

log2err

estimated log2 standard error of the multilevel p-value

Tag % / Gene %

leading-edge tag and gene fractions

Lead_genes

leading-edge gene members

[34]:
# Compare the permutation run (pre_res) against the multilevel run (pre_mult) on
# the same ranking. The NES columns are on different scales between methods, but
# the multilevel NOM p-val resolves far below 1/permutation_num (here 1e-3),
# exposing tails the permutation null cannot reach.
cols = ["Term", "NES", "NOM p-val", "FDR q-val"]
cmp = pre_res.res2d[cols].merge(
    pre_mult.res2d[cols], on="Term", suffixes=("_perm", "_multilevel")
)
cmp.sort_values("FDR q-val_multilevel").head(10)
[34]:
Term NES_perm NOM p-val_perm FDR q-val_perm NES_multilevel NOM p-val_multilevel FDR q-val_multilevel
60 Pathways in cancer Homo sapiens hsa05200 1.718575 0.0 0.000066 1.717224 5.820484e-39 1.699581e-36
23 Epstein-Barr virus infection Homo sapiens hsa0... 1.805814 0.0 0.000000 1.800147 4.800341e-28 7.008498e-26
27 Proteoglycans in cancer Homo sapiens hsa05205 1.798016 0.0 0.000036 1.796120 2.580557e-27 2.511742e-25
28 cAMP signaling pathway Homo sapiens hsa04024 1.796236 0.0 0.000035 1.783906 2.453085e-26 1.790752e-24
43 Viral carcinogenesis Homo sapiens hsa05203 1.752638 0.0 0.000023 1.746884 1.766868e-23 1.031851e-21
49 Focal adhesion Homo sapiens hsa04510 1.734311 0.0 0.000020 1.730402 9.352231e-23 4.551419e-21
3 Thyroid hormone signaling pathway Homo sapiens... 1.890973 0.0 0.000000 1.893138 1.934581e-22 8.069965e-21
30 MicroRNAs in cancer Homo sapiens hsa05206 1.791565 0.0 0.000032 1.793027 9.644428e-22 3.520216e-20
116 PI3K-Akt signaling pathway Homo sapiens hsa04151 1.565976 0.0 0.001157 1.561759 3.286497e-21 1.066286e-19
94 HTLV-I infection Homo sapiens hsa05166 1.631478 0.0 0.000465 1.626543 1.115876e-20 3.258357e-19
  • ``sample_size`` (default 101) — samples drawn per multilevel level. Larger values reduce the variance of very small p-values (smaller log2err) at the cost of runtime.

  • ``eps`` (default 1e-50) — lower bound on resolvable p-values; anything below eps is reported as eps. Set ``eps=0`` to push to machine precision (slowest, most accurate tails).

The multilevel method is exposed by the gseapy prerank CLI via --method, --sample-size, and --eps.

[35]:
# !gseapy prerank -r ./tests/data/temp.rnk -g KEGG_2016 \
#         --method multilevel --sample-size 101 --eps 1e-50 \
#         --min-size 5 --max-size 1000 --threads 4 \
#         -o prerank_multilevel_report

2.12. GSEA enrichment plots

Visualize the running enrichment score with obj.plot() (or the standalone gseaplot / gseaplot2 helpers). Pass ofname='figure.pdf' to save to disk.

[36]:
pre_res.res2d.head(5)
[36]:
Name Term ES NES NOM p-val FDR q-val FWER p-val Tag % Gene % Lead_genes
0 prerank Adherens junction Homo sapiens hsa04520 0.784625 1.917092 0.0 0.0 0.0 47/74 10.37% CTNNB1;EGFR;RAC1;TGFBR1;SMAD4;MET;EP300;CDC42;...
1 prerank Estrogen signaling pathway Homo sapiens hsa04915 0.766347 1.896928 0.0 0.0 0.0 74/99 16.57% CALM1;PRKACA;GRB2;SP1;EGFR;KRAS;HRAS;HSP90AB1;...
2 prerank Glioma Homo sapiens hsa05214 0.784678 1.895572 0.0 0.0 0.0 52/65 16.29% CALM1;GRB2;EGFR;PRKCA;KRAS;HRAS;TP53;MAPK1;PRK...
3 prerank Thyroid hormone signaling pathway Homo sapiens... 0.757700 1.890973 0.0 0.0 0.0 84/118 16.29% CTNNB1;PRKACA;PRKCA;KRAS;NOTCH1;EP300;CREBBP;H...
4 prerank Long-term potentiation Homo sapiens hsa04720 0.778249 1.890416 0.0 0.0 0.0 42/66 9.01% CALM1;PRKACA;PRKCA;KRAS;EP300;CREBBP;HRAS;PRKA...
[37]:
# Plot a single term by name. obj.plot() is the convenient wrapper (v1.0.5+).
terms = pre_res.res2d.Term
axs = pre_res.plot(terms=terms[1])

# For finer control, use the standalone helper:
# from gseapy import gseaplot
# gseaplot(rank_metric=pre_res.ranking, term=terms[0],
#          ofname='your.plot.pdf', **pre_res.results[terms[0]])
_images/gseapy_example_64_0.png

Plot several pathways together in one figure.

[38]:
axs = pre_res.plot(
    terms=terms[1:5],
    show_ranking=True,     # show the ranking metric on a secondary y-axis
    figsize=(3, 4),
    # legend_kws={'loc': (1.2, 0)},  # reposition the legend if needed
)

# Equivalent low-level call:
# from gseapy import gseaplot2
# terms = pre_res.res2d.Term[1:5]
# hits  = [pre_res.results[t]['hits'] for t in terms]
# runes = [pre_res.results[t]['RES'] for t in terms]
# fig = gseaplot2(terms=terms, ress=runes, hits=hits,
#                 rank_metric=pre_res.ranking, figsize=(4, 5))
_images/gseapy_example_66_0.png

dotplot also works on GSEA results.

[39]:
from gseapy import dotplot

ax = dotplot(
    pre_res.res2d,
    column="FDR q-val",
    title='KEGG_2016',
    cmap=plt.cm.viridis,
    size=6,                # dot size scaling
    figsize=(4, 5),
    cutoff=0.25,           # only show terms with FDR q-val <= cutoff
    show_ring=False,
)
_images/gseapy_example_68_0.png

2.13. Network visualization (enrichment map)

enrichment_map builds a network where nodes are enriched terms and edges connect terms that share genes. It returns nodes and edges DataFrames that you can also export for Cytoscape.

[40]:
from gseapy import enrichment_map

# Returns two DataFrames: node table and edge table.
nodes, edges = enrichment_map(pre_res.res2d)
[41]:
import networkx as nx
[42]:
# Build a graph from the edge table; carry similarity coefficients as edge attributes.
G = nx.from_pandas_edgelist(
    edges,
    source='src_idx',
    target='targ_idx',
    edge_attr=['jaccard_coef', 'overlap_coef', 'overlap_genes'],
)
[43]:
fig, ax = plt.subplots(figsize=(8, 8))

# Lay out the nodes.
pos = nx.layout.spiral_layout(G)

# Nodes: color by NES, size by the fraction of genes hit.
nx.draw_networkx_nodes(
    G, pos=pos,
    cmap=plt.cm.RdYlBu,
    node_color=list(nodes.NES),
    node_size=list(nodes.Hits_ratio * 1000),
)
# Node labels (term names).
nx.draw_networkx_labels(G, pos=pos, labels=nodes.Term.to_dict())

# Edges: width proportional to Jaccard similarity.
edge_weight = nx.get_edge_attributes(G, 'jaccard_coef').values()
nx.draw_networkx_edges(
    G, pos=pos,
    width=list(map(lambda x: x * 10, edge_weight)),
    edge_color='#CDDBD4',
)
plt.show()
_images/gseapy_example_73_0.png
[44]:
# !gseapy prerank -r temp.rnk -g temp.gmt -o prerank_report_temp

2.13.1. Standard GSEA (gsea)

gp.gsea() runs the classic GSEA on an expression matrix with phenotype labels, computing the ranking metric and a phenotype-permutation null internally.

Accepted inputs

  • data: a DataFrame (genes × samples), a .gct file, or a tab-delimited text file.

  • cls: a .cls file, or a Python list of class labels (one per sample).

  • gene_sets: an Enrichr library name, a .gmt path, or a {term: [genes]} dict.

Note on symbols: as with prerank, match your gene identifiers to the library (Enrichr libraries use UPPERCASE human symbols).

[45]:
# A .cls file labels each sample with its phenotype/class.
phenoA, phenoB, class_vector = gp.parser.gsea_cls_parser("./tests/extdata/Leukemia.cls")
[46]:
# class_vector holds the per-sample class label.
print(class_vector)
['ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'ALL', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML', 'AML']
[47]:
# Expression matrix: genes in rows, samples in columns.
gene_exp = pd.read_csv("./tests/extdata/Leukemia_hgu95av2.trim.txt", sep="\t")
gene_exp.head()
[47]:
Gene NAME ALL_1 ALL_2 ALL_3 ALL_4 ALL_5 ALL_6 ALL_7 ALL_8 ... AML_15 AML_16 AML_17 AML_18 AML_19 AML_20 AML_21 AML_22 AML_23 AML_24
0 MAPK3 1000_at 1633.6 2455.0 866.0 1000.0 3159.0 1998.0 1580.0 1955.0 ... 1826.0 2849.0 2980.0 1442.0 3672.0 294.0 2188.0 1245.0 1934.0 13154.0
1 TIE1 1001_at 284.4 159.0 173.0 216.0 1187.0 647.0 352.0 1224.0 ... 1556.0 893.0 1278.0 301.0 797.0 248.0 167.0 941.0 1398.0 -502.0
2 CYP2C19 1002_f_at 285.8 114.0 429.0 -43.0 18.0 366.0 119.0 -88.0 ... -177.0 64.0 -359.0 68.0 2.0 -464.0 -127.0 -279.0 301.0 509.0
3 CXCR5 1003_s_at -126.6 -388.0 143.0 -915.0 -439.0 -371.0 -448.0 -862.0 ... 237.0 -834.0 -1940.0 -684.0 -1236.0 -1561.0 -895.0 -1016.0 -2238.0 -1362.0
4 CXCR5 1004_at -83.3 33.0 195.0 85.0 54.0 -6.0 55.0 101.0 ... 86.0 -5.0 487.0 102.0 33.0 -153.0 -50.0 257.0 439.0 386.0

5 rows × 50 columns

[48]:
print("Positively correlated phenotype:", phenoA)
print("Negatively correlated phenotype:", phenoB)
Positively correlated phenotype: ALL
Negatively correlated phenotype: AML
[49]:
gs_res = gp.gsea(
    data=gene_exp,                                       # or a .gct / .txt path
    gene_sets='./tests/extdata/h.all.v7.0.symbols.gmt',  # or an Enrichr library name
    cls="./tests/extdata/Leukemia.cls",                  # or cls=class_vector
    permutation_type='phenotype',  # use 'phenotype' when each group has >= 15 samples
    permutation_num=1000,          # reduced for the demo
    outdir=None,
    method='signal_to_noise',      # ranking metric
    threads=4,
    seed=7,
)
2026-06-24 11:24:17,956 [WARNING] Found duplicated gene names, values averaged by gene names!

You can also use the GSEA class directly and set the phenotype labels manually via pheno_pos / pheno_neg.

[50]:
from gseapy import GSEA

gs = GSEA(
    data=gene_exp,
    gene_sets='KEGG_2016',
    classes=class_vector,          # or cls=class_vector
    permutation_type='phenotype',
    permutation_num=1000,
    outdir=None,
    method='signal_to_noise',
    threads=4,
    seed=8,
)
gs.pheno_pos = "AML"               # name of the positive phenotype
gs.pheno_neg = "ALL"               # name of the negative phenotype
gs.run()
2026-06-24 11:24:19,582 [WARNING] Found duplicated gene names, values averaged by gene names!

2.14. GSEA plots

The gsea module can auto-generate per-set heatmaps in the background. To plot yourself, use the snippets below.

[51]:
# Running enrichment score for the top 5 terms.
terms = gs_res.res2d.Term
axs = gs_res.plot(terms[:5], show_ranking=False, legend_kws={'loc': (1.05, 0)})
_images/gseapy_example_85_0.png
[52]:
# Low-level equivalent with gseaplot2:
# from gseapy import gseaplot2
# terms = gs_res.res2d.Term[:5]
# hits  = [gs_res.results[t]['hits'] for t in terms]
# runes = [gs_res.results[t]['RES'] for t in terms]
# fig = gseaplot2(terms=terms, ress=runes, hits=hits,
#                 rank_metric=gs_res.ranking, figsize=(4, 5))
[53]:
from gseapy import heatmap

# Heatmap of the leading-edge genes for one enriched term.
i = 2
genes = gs_res.res2d.Lead_genes[i].split(";")
# z_score=0 standardizes across rows (genes). Pass ofname='figure.pdf' to save.
ax = heatmap(df=gs_res.heatmat.loc[genes], z_score=0, title=terms[i], figsize=(14, 4))
_images/gseapy_example_87_0.png
[54]:
# The underlying expression values behind that heatmap.
gs_res.heatmat.loc[genes]
[54]:
ALL_1 ALL_2 ALL_3 ALL_4 ALL_5 ALL_6 ALL_7 ALL_8 ALL_9 ALL_10 ... AML_15 AML_16 AML_17 AML_18 AML_19 AML_20 AML_21 AML_22 AML_23 AML_24
Gene
LEF1 8544.10 12552.0 2869.0 15265.0 7446.0 6991.0 15520.0 13114.0 22604.0 21795.0 ... 682.0 152.0 -348.0 30.0 210.0 350.0 -242.0 -47.0 176.0 14.0
SKP2 23.80 -45.0 -95.0 -71.0 -65.0 -547.0 -24.0 230.0 159.0 -162.0 ... -865.0 -642.0 -1005.0 -413.0 -733.0 -812.0 -464.0 -490.0 -333.0 7.0
CUL1 1712.75 3309.0 1273.5 1726.5 947.5 2160.0 2065.0 2524.5 1882.5 2684.5 ... 851.5 614.5 1560.0 523.0 952.0 935.0 646.0 1214.5 770.0 2088.5
HDAC2 4542.90 6030.0 1195.0 9368.0 2281.0 3407.0 3175.0 3962.0 2616.0 6848.0 ... 1072.0 1918.0 1545.0 1653.0 2328.0 1061.0 1571.0 1749.0 2942.0 1174.0
GNAI1 588.50 163.0 364.0 882.0 1317.0 17.0 518.0 89.0 1136.0 816.0 ... 470.0 313.0 -163.0 210.0 684.0 -331.0 115.0 55.0 -80.0 -94.0
MAML1 871.40 1871.0 578.0 1589.0 1448.0 1364.0 2494.0 1989.0 1538.0 1946.0 ... 390.0 233.0 1075.0 962.0 997.0 1316.0 48.0 609.0 2090.0 1056.0
WNT1 -872.50 -875.0 -1012.0 -535.0 -654.0 -694.0 -421.0 -827.0 -770.0 -1001.0 ... -2506.0 -2791.0 -2249.0 -1201.0 -1819.0 -2599.0 -995.0 -1861.0 -1835.0 -714.0
HDAC5 2137.20 2374.0 1651.0 2012.0 3132.0 2279.0 2314.0 2349.0 2376.0 5455.0 ... 1215.0 1024.0 760.0 1368.0 1923.0 1927.0 2872.0 848.0 1629.0 2763.0
AXIN1 -433.50 -722.0 -808.0 -623.0 -1167.0 -326.0 448.0 -661.0 -1315.0 -1991.0 ... -2590.0 -2417.0 -1321.0 -466.0 -1628.0 -1910.0 93.0 -2951.0 -1666.0 471.0

9 rows × 48 columns

[55]:
from gseapy import dotplot

ax = dotplot(
    gs_res.res2d,
    column="FDR q-val",
    title='Hallmark',
    cmap=plt.cm.viridis,
    size=5,
    figsize=(4, 5),
    cutoff=1,              # show all terms (no significance filter)
)
_images/gseapy_example_89_0.png

2.15. Command-line equivalent

[56]:
# !gseapy gsea -d ./tests/extdata/Leukemia_hgu95av2.trim.txt \
#              -g KEGG_2016 -c ./tests/extdata/Leukemia.cls \
#              -o test/gsea_report \
#              -v --no-plot \
#              -t phenotype

2.15.1. Single-sample GSEA (ssgsea)

gp.ssgsea() scores each sample independently, producing an enrichment score per gene set per sample — useful when you do not have two phenotype groups to contrast.

Not sure whether to use prerank or ssgsea? See the FAQ.

Accepted ``data`` inputs: a .txt/.gct file, a DataFrame (genes × samples), or a Series indexed by gene. gene_sets accepts the usual name / .gmt / dict.

[57]:
ss = gp.ssgsea(
    data='./tests/extdata/Leukemia_hgu95av2.trim.txt',
    gene_sets='./tests/extdata/h.all.v7.0.symbols.gmt',
    outdir=None,
    sample_norm_method='rank',   # 'rank' ranks genes per sample; 'custom' uses raw values
    no_plot=True,
)
2026-06-24 11:24:23,205 [WARNING] Found duplicated gene names, values averaged by gene names!
[58]:
ss.res2d.head()
[58]:
Name Term ES NES
0 ALL_2 HALLMARK_MYC_TARGETS_V1 3393.823575 0.707975
1 ALL_12 HALLMARK_MYC_TARGETS_V1 3385.626111 0.706265
2 AML_11 HALLMARK_MYC_TARGETS_V1 3359.186716 0.700749
3 ALL_14 HALLMARK_MYC_TARGETS_V1 3348.938881 0.698611
4 ALL_17 HALLMARK_MYC_TARGETS_V1 3335.065348 0.695717
[59]:
# ssGSEA also accepts a DataFrame or Series (gene symbols as the index).
ssdf = pd.read_csv("./tests/data/temp.rnk", header=None, index_col=0, sep="\t")
ssdf.head()
[59]:
1
0
ATXN1 16.456753
UBQLN4 13.989493
CALM1 13.745533
DLG4 12.796588
MRE11A 12.787631
[60]:
# Collapse the single-column DataFrame to a Series.
ssdf2 = ssdf.squeeze()
[61]:
# Series / DataFrame input is supported directly.
temp = gp.ssgsea(data=ssdf2, gene_sets="./tests/data/temp.gmt")

2.16. Access the enrichment scores (ES / NES)

Results live on obj.res2d. Pivot to a term × sample matrix for downstream use.

[62]:
ss.res2d.sort_values('Name').head()
[62]:
Name Term ES NES
601 ALL_1 HALLMARK_PANCREAS_BETA_CELLS -1280.654659 -0.267153
934 ALL_1 HALLMARK_APOPTOSIS 970.818772 0.202519
1774 ALL_1 HALLMARK_HEDGEHOG_SIGNALING 431.446694 0.090003
279 ALL_1 HALLMARK_INTERFERON_ALPHA_RESPONSE 1721.458034 0.359108
1778 ALL_1 HALLMARK_BILE_ACID_METABOLISM -429.127871 -0.089519
[63]:
# Reshape to NES values, terms in rows and samples in columns.
nes = ss.res2d.pivot(index='Term', columns='Name', values='NES')
nes.head()
[63]:
Name ALL_1 ALL_10 ALL_11 ALL_12 ALL_13 ALL_14 ALL_15 ALL_16 ALL_17 ALL_18 ... AML_22 AML_23 AML_24 AML_3 AML_4 AML_5 AML_6 AML_7 AML_8 AML_9
Term
HALLMARK_ADIPOGENESIS 0.287384 0.274548 0.290059 0.285388 0.322757 0.305239 0.275686 0.266209 0.315803 0.282617 ... 0.277755 0.261477 0.200083 0.312948 0.342963 0.253282 0.298924 0.410395 0.387433 0.343606
HALLMARK_ALLOGRAFT_REJECTION 0.061770 0.028062 0.096589 0.080713 0.082701 0.102735 0.125250 0.147262 0.124621 0.091077 ... 0.185738 0.157852 0.055585 0.218827 0.172395 0.199077 0.158945 0.138350 0.110787 0.121643
HALLMARK_ANDROGEN_RESPONSE 0.133453 0.113911 0.193074 0.201531 0.151001 0.129670 0.173563 0.144836 0.180214 0.180801 ... 0.180443 0.188891 0.197979 0.174892 0.142850 0.184843 0.157449 0.162843 0.180475 0.181878
HALLMARK_ANGIOGENESIS -0.113481 -0.182411 -0.195637 -0.094817 -0.163717 -0.139243 -0.119084 -0.154526 -0.068290 -0.121156 ... 0.054883 -0.023782 0.119022 -0.067741 0.048430 0.012808 0.032505 -0.024058 -0.039492 -0.043769
HALLMARK_APICAL_JUNCTION 0.051372 0.063763 0.054601 0.014385 0.049019 0.052690 0.064787 0.052192 0.056070 0.064936 ... 0.109270 0.090065 0.155801 0.091556 0.110045 0.101659 0.128808 0.095511 0.080076 0.098644

5 rows × 48 columns

Warning: if you set permutation_num > 0, ssgsea behaves like prerank but with the ssGSEA statistic. Do not do this unless you know exactly why you need it.

ss_permut = gp.ssgsea(
    data="./tests/extdata/Leukemia_hgu95av2.trim.txt",
    gene_sets="./tests/extdata/h.all.v7.0.symbols.gmt",
    outdir=None,
    sample_norm_method='rank',
    permutation_num=20,   # > 0 makes it behave like prerank
    no_plot=True,
    threads=4, seed=9,
)
ss_permut.res2d.head(5)

2.17. Command-line equivalent

[64]:
# !gseapy ssgsea -d ./tests/extdata/Leukemia_hgu95av2.trim.txt \
#                -g ./tests/data/temp.gmt \
#                -o test/ssgsea_report \
#                -p 4 --no-plot

2.17.1. Gene Set Variation Analysis (gsva)

gp.gsva() is GSEApy’s port of GSVA — an unsupervised method that turns a gene × sample expression matrix into a gene-set × sample matrix of enrichment scores, with no phenotype labels required.

[65]:
es = gp.gsva(
    data='./tests/extdata/Leukemia_hgu95av2.trim.txt',
    gene_sets='./tests/extdata/h.all.v7.0.symbols.gmt',
    outdir=None,
)
2026-06-24 11:24:23,575 [WARNING] Found duplicated gene names, values averaged by gene names!
[66]:
# Reshape to a gene-set x sample matrix of GSVA enrichment scores.
es.res2d.pivot(index='Term', columns='Name', values='ES').head()
[66]:
Name ALL_1 ALL_10 ALL_11 ALL_12 ALL_13 ALL_14 ALL_15 ALL_16 ALL_17 ALL_18 ... AML_22 AML_23 AML_24 AML_3 AML_4 AML_5 AML_6 AML_7 AML_8 AML_9
Term
HALLMARK_ADIPOGENESIS -0.213310 -0.080960 0.003289 -0.017909 0.207841 0.023294 -0.085392 -0.221273 0.161470 -0.018250 ... 0.033440 -0.190436 -0.098500 0.105208 0.196799 -0.296305 -0.084042 0.450832 0.226921 0.209835
HALLMARK_ALLOGRAFT_REJECTION -0.210468 -0.373787 -0.086016 -0.169623 -0.158775 -0.016488 -0.050703 0.104430 -0.075816 -0.193654 ... 0.023653 0.032892 -0.113577 0.307030 0.134581 0.188905 0.132169 0.024078 -0.092054 -0.195987
HALLMARK_ANDROGEN_RESPONSE -0.136330 -0.308572 0.008126 0.048490 -0.061181 -0.203036 0.070416 -0.125240 0.080075 0.022248 ... 0.031898 0.064394 0.070232 0.199349 -0.079399 -0.016658 -0.127327 0.018847 0.121426 0.163149
HALLMARK_ANGIOGENESIS 0.035895 -0.287645 -0.214951 -0.291145 -0.311917 -0.236717 -0.345662 -0.250202 -0.233296 -0.318353 ... 0.244374 -0.076852 -0.010928 -0.210787 0.387912 0.269447 0.348230 0.157249 0.075479 -0.064515
HALLMARK_APICAL_JUNCTION -0.088652 -0.128757 -0.050282 -0.248682 -0.145164 0.001997 -0.082962 -0.091691 -0.168941 -0.139766 ... 0.005859 -0.067385 0.062719 -0.022434 0.076593 0.138664 0.240647 0.039307 0.016764 0.057512

5 rows × 48 columns

2.18. Command-line equivalent

[67]:
# !gseapy gsva -d ./tests/data/expr.gsva.csv \
#              -g ./tests/data/geneset.gsva.gmt \
#              -o test/gsva_report

2.18.1. Replot (replot)

gp.replot() re-creates GSEA plots from the output of the Broad Institute’s Java GSEA application. It needs the edb/ folder produced by that tool, with this layout:

data/
└── edb/
    ├── C1OE.cls
    ├── gene_sets.gmt
    ├── gsea_data.gsea_data.rnk
    └── results.edb
[68]:
rep = gp.replot(indir="./tests/data", outdir="tests/replot_test")

2.19. Command-line equivalent

[69]:
# !gseapy replot -i data -o test/replot_test