5. A Protocol to Prepare files for GSEApy
As a biological researcher, I like protocols.
Here is a short tutorial for you to walk you through gseapy.
For file format explanation, please see here
In order to run gseapy successfully, install gseapy use pip.
pip install gseapy
# if you have conda
conda install -c bioconda gseapy
5.1. Use gsea
command, or gsea()
Follow the steps blow.
One thing you should know is that the gseapy input files are the same as
GSEA
desktop required. You can use these files below to run GSEA
desktop, too.
5.1.1. Prepare an tabular text file of gene expression like this:
RNA-seq,ChIP-seq, Microarry data are all supported.
Here is to see what the structure of expression table looks like
import pandas as pd
df = pd.read_table('./test/gsea_data.txt')
df.head()
#or assign dataframe to the parameter 'data'
Gene_Symbol | ACD2 | BCD2 | CCD2 | APYD2 | BPYD2 | CPYD2 | |
---|---|---|---|---|---|---|---|
0 | SCARA3 | 6.287 | 6.821 | 6.005 | 2.525 | 1.911 | 1.500 |
1 | POU5F1 | 11.168 | 11.983 | 10.469 | 7.795 | 7.911 | 6.174 |
2 | CTLA2B | 4.362 | 5.708 | 4.633 | 1.493 | 0.000 | 1.369 |
3 | CRYAB | 11.339 | 11.662 | 11.714 | 7.698 | 7.928 | 7.779 |
4 | PMP22 | 7.259 | 7.548 | 6.803 | 4.418 | 2.239 | 3.071 |
5.1.2. An cls file is also expected.
This file is used to specify column attributes in step 1, just like GSEA
asked.
An example of cls file looks like below.
with open('gsea/edb/C1OE.cls') as cls:
print(cls.read())
# or assign a list object to parameter 'cls' like this
# cls=['C1OE', 'C1OE', 'C1OE', 'Vector', 'Vector', 'Vector']
6 2 1
# C1OE Vector
C1OE C1OE C1OE Vector Vector Vector
So you could prepare the cls file in python like this
groups = ['C1OE', 'C1OE', 'C1OE', 'Vector', 'Vector', 'Vector']
with open('gsea/edb/C1OE.cls', "w") as cl:
line = f"{len(groups)} 2 1\n# C10E Vector\n"
cl.write(line)
cl.write(" ".join(groups) + "\n")
5.1.3. Gene_sets file in gmt format.
All you need to do is to download gene set database file from GSEA
or Enrichr
website.
Or you could use enrichr library. In this case, just provide library name to parameter ‘gene_sets’
If you would like to use you own gene_sets.gmts files, build such a file use excel:
An example of gmt file looks like below:
with open('gsea/edb/gene_sets.gmt') as gmt:
print(gmt.read())
ES-SPECIFIC Arid3a_used ACTA1 CALML4 CORO1A DHX58 DPYS EGR1 ESRRB GLI2 GPX2 HCK INHBB
HDAC-UNIQUE Arid3a_used 1700017B05RIK 8430427H17RIK ABCA3 ANKRD44 ARL4A BNC2 CLDN3
XEN-SPECIFIC Arid3a_used 1110036O03RIK A130022J15RIK B2M B3GALNT1 CBX4 CITED1 CLU CTSH CYP26A1
GATA-SPECIFIC Arid3a_used 1200009I06RIK 5430407P10RIK BAIAP2L1 BMP8B CITED1 CLDN3 COBLL1 CORO1A CRYAB CTDSPL DKKL1
TS-SPECIFIC Arid3a_used 5430407P10RIK AFAP1L1 AHNAK ANXA2 ANXA3 ANXA5 B2M BIK BMP8B CAMK1D CBX4 CLDN3 CSRP1 DKKL1 DSC2
5.2. Use enrichr
command, or enrichr()
The only thing you need to prepare is a gene list file.
Note: Enrichr uses a list of Entrez gene symbols as input.
For enrichr
, you could assign a list object
# assign a list object to enrichr
l = ['SCARA3', 'LOC100044683', 'CMBL', 'CLIC6', 'IL13RA1', 'TACSTD2', 'DKKL1', 'CSF1',
'SYNPO2L', 'TINAGL1', 'PTX3', 'BGN', 'HERC1', 'EFNA1', 'CIB2', 'PMP22', 'TMEM173']
gseapy.enrichr(gene_list=l, gene_sets='KEGG_2016', outfile='test')
or a gene list file in txt format(one gene id per row)
gseapy.enrichr(gene_list='gene_list.txt', gene_sets='KEGG_2016', outfile='test')
Let’s see what the txt file looks like.
with open('data/gene_list.txt') as genes:
print(genes.read())
CTLA2B
SCARA3
LOC100044683
CMBL
CLIC6
IL13RA1
TACSTD2
DKKL1
CSF1
CITED1
SYNPO2L
TINAGL1
PTX3
Select the library you want to do enrichment analysis. To get a list of all available libraries, run
#s get_library_name(), it will print out all library names.
import gseapy
names = gseapy.get_library_name()
print(names)
['Genome_Browser_PWMs',
'TRANSFAC_and_JASPAR_PWMs',
'ChEA_2013',
'Drug_Perturbations_from_GEO_2014',
'ENCODE_TF_ChIP-seq_2014',
'BioCarta_2013',
'Reactome_2013',
'WikiPathways_2013',
'Disease_Signatures_from_GEO_up_2014',
'KEGG_2013',
'TF-LOF_Expression_from_GEO',
'TargetScan_microRNA',
'PPI_Hub_Proteins',
'GO_Molecular_Function_2015',
'GeneSigDB',
'Chromosome_Location',
'Human_Gene_Atlas',
'Mouse_Gene_Atlas',
'GO_Cellular_Component_2015',
'GO_Biological_Process_2015',
'Human_Phenotype_Ontology',
'Epigenomics_Roadmap_HM_ChIP-seq',
'KEA_2013',
'NURSA_Human_Endogenous_Complexome',
'CORUM',
'SILAC_Phosphoproteomics',
'MGI_Mammalian_Phenotype_Level_3',
'MGI_Mammalian_Phenotype_Level_4',
'Old_CMAP_up',
'Old_CMAP_down',
'OMIM_Disease',
'OMIM_Expanded',
'VirusMINT',
'MSigDB_Computational',
'MSigDB_Oncogenic_Signatures',
'Disease_Signatures_from_GEO_down_2014',
'Virus_Perturbations_from_GEO_up',
'Virus_Perturbations_from_GEO_down',
'Cancer_Cell_Line_Encyclopedia',
'NCI-60_Cancer_Cell_Lines',
'Tissue_Protein_Expression_from_ProteomicsDB',
'Tissue_Protein_Expression_from_Human_Proteome_Map',
'HMDB_Metabolites',
'Pfam_InterPro_Domains',
'GO_Biological_Process_2013',
'GO_Cellular_Component_2013',
'GO_Molecular_Function_2013',
'Allen_Brain_Atlas_up',
'ENCODE_TF_ChIP-seq_2015',
'ENCODE_Histone_Modifications_2015',
'Phosphatase_Substrates_from_DEPOD',
'Allen_Brain_Atlas_down',
'ENCODE_Histone_Modifications_2013',
'Achilles_fitness_increase',
'Achilles_fitness_decrease',
'MGI_Mammalian_Phenotype_2013',
'BioCarta_2015',
'HumanCyc_2015',
'KEGG_2015',
'NCI-Nature_2015',
'Panther_2015',
'WikiPathways_2015',
'Reactome_2015',
'ESCAPE',
'HomoloGene',
'Disease_Perturbations_from_GEO_down',
'Disease_Perturbations_from_GEO_up',
'Drug_Perturbations_from_GEO_down',
'Genes_Associated_with_NIH_Grants',
'Drug_Perturbations_from_GEO_up',
'KEA_2015',
'Single_Gene_Perturbations_from_GEO_up',
'Single_Gene_Perturbations_from_GEO_down',
'ChEA_2015',
'dbGaP',
'LINCS_L1000_Chem_Pert_up',
'LINCS_L1000_Chem_Pert_down',
'GTEx_Tissue_Sample_Gene_Expression_Profiles_down',
'GTEx_Tissue_Sample_Gene_Expression_Profiles_up',
'Ligand_Perturbations_from_GEO_down',
'Aging_Perturbations_from_GEO_down',
'Aging_Perturbations_from_GEO_up',
'Ligand_Perturbations_from_GEO_up',
'MCF7_Perturbations_from_GEO_down',
'MCF7_Perturbations_from_GEO_up',
'Microbe_Perturbations_from_GEO_down',
'Microbe_Perturbations_from_GEO_up',
'LINCS_L1000_Ligand_Perturbations_down',
'LINCS_L1000_Ligand_Perturbations_up',
'LINCS_L1000_Kinase_Perturbations_down',
'LINCS_L1000_Kinase_Perturbations_up',
'Reactome_2016',
'KEGG_2016',
'WikiPathways_2016',
'ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X',
'Kinase_Perturbations_from_GEO_down',
'Kinase_Perturbations_from_GEO_up',
'BioCarta_2016',
'Humancyc_2016',
'NCI-Nature_2016',
'Panther_2016']
For more details, please track the official links: http://amp.pharm.mssm.edu/Enrichr/
5.3. Use replot
Command, or replot()
You may also want to use replot()
to reproduce GSEA
desktop plots.
The only input of replot()
is the directory of GSEA
desktop output.
The input directory(e.g. gsea), must contained edb folder, gseapy need 4 data files inside edb folder.The gsea document tree looks like this:
gsea
└─edb
└─test.cls
└─gene_sets.gmt
└─gsea_data.rnk
└─results.edb
After this, you can start to run gseapy.
import gseapy
gseapy.replot(indir ='gsea', outdir = 'gseapy_out')
If you prefer to run in command line, it’s more simple.
gseapy replot -i gsea -o gseapy_out