7. Frequently Asked Questions

7.1. Q: What kind of gene identifiers are supported in GSEApy?

A:

  • If you select Enrichr library as your input gene_sets (gmt format), then gene symbols in upper cases are needed.

  • If you use your own GMT file, you need to use the same type of your gene identifiers in GMT and input gene list.

7.2. Q: Why gene symbols in Enrichr library are all UPPER cases for mouse, fly, fish, worm ?

A:: GSEApy can’t change the Enrichr databases. So convert your gene symbols into UPPER cases first, then run the analysis you want.

7.3. Q: Why P-value or FDR is 0, not a very small number?

A: GSEA methodology use random permutation procedure (e.g. 1000 permutation) to obtain a null distribution. Then, an observed ES is compared to the 1000 shuffled ES to calculate a P-value. When observed ES is not within the null ESs, you’ll get 0s. if you don’t want 0, you could

  • set the smallest pvalue to 1 / ( number of permutations)

  • increase the permutation number (but more running time needed)

7.4. Q: What are gene %, and tag % mean in the output?

\[\begin{split}\text{gene \\%} = \text{(the position of the gene coresponding to enrichment score peak)} \div \text{(number of all genes in ranking list)}\end{split}\]
\[\begin{split}\text{tag \\%} = \text{( number of leading genes)} \div \text{(number of genes in certain pathway that overalpped with the input ranking list)}\end{split}\]

7.5. Q: What Enrichr database are supported?

A: Support modEnrich (https://amp.pharm.mssm.edu/modEnrichr/) . Now, Human, Mouse, Fly, Yeast, Worm, Fish are all supported.

7.6. Q: Use custom defined GMT file input in Jupyter ?

A: argument gene_sets accept dict input. This is useful when define your own gene_sets. An example dict looks like this:

gene_sets = {
          "term_1": ["gene_A", "gene_B", ...],
          "term_2": ["gene_B", "gene_C", ...],
           ...
          "term_100": ["gene_A", "gene_T", ...]
         }

APIs support dict object input: gsea, prerank, ssgsea, enrichr

7.7. Q: How to use Yeast database in gseapy.enrichr()?

Because some library names are the same in different Enrichr database, you have to set an additional augment organism when no use Human

gss = gseapy.get_library_name(organism='Yeast')
enr = gseapy.enrichr(gene_list=...,
                    gene_sets=gss,
                    organism='Yeast', # don't forget to set organism="Yeast"
                    )

7.8. Q: How to use Yeast database in gseapy.prerank()?

There is no augment organism in prerank, gsea, ssgea, but you could input these Enrichr libraries as follow:

# get libraries you'd like to use
gss = gseapy.get_library_name(organism='Yeast')
# get a custom gmt_dict
gmt_dict = gseapy.parser.gsea_gmt_parser('GO_Biological_Process_2018', organism='Yeast')
# run
prn_res = gseapy.prerank( ..., gene_sets=gmt_dict, ...)

7.9. Q: How to save plots using gseaplot, barplot, dotplot,``heatmap`` in Jupyter ?

A: e.g. gseaplot(…, ofname=’your.plot.pdf’). That’s it

7.10. Q: What cutoff mean in functions, like enrichr(), dotplot, barplot ?

A: This argument control the terms (e.g FDR < 0.05) that will be shown on figures, not the result table output.

7.11. Q: ssGSEA missing p value and FDR?

A: The original ssGSEA alogrithm will not give you pval or FDR, so, please ignore the gseaplot generated by ssgsea. It’s useless and misleading, therefore, fdr, and pval are not shown on the plot. If you’er seeking for ssGSEA with p-value output, please see here: https://github.com/broadinstitute/ssGSEA2.0 Actually, ssGSEA2.0 use the same method with GSEApy to calculate P-value, but FDR is not.

7.12. Q: What the difference between ssGSEA and Prerank

A: In short, - prerank is used for comparing two group of samples (e.g. control and treatment), where the gene ranking are defined by your custom rank method (like t-statistic, signal-to-noise, et.al). - ssGSEA is used for comparing individual samples to the rest of all, trying to find the gene signatures which samples shared the same (use ssGSEA when you have a lot of samples).

The statistic between prerank (GSEA) and ssGSEA are different. Assume that we have calculated each running enrichment score of your ranked input genes, then

  • es for GSEA: max(running enrichment scores) or min(running enrichment scores)

  • es for ssGSEA: sum(running enrichment scores)