Data Repository: PubChem#

About this recipe

What is PubChem?#

PubChem (https://pubchem.ncbi.nlm.nih.gov) [KCC+23] is a very popular chemistry information resource for biomedical research communities in many areas, including cheminformatics, chemical biology, medicinal chemistry, and drug discovery. PubChem’s information content, collected from hundreds of data sources, is organized into multiple data collections, including Substance, Compound, BioAssay, Gene, Protein, Pathway, Cell Line, Taxonomy, and Patent [KCH+22].

Substance archives the chemical data submitted by individual data sources and Compound stores the unique chemical structures extracted from Substance through chemical structure standardization. BioAssay contains biological assay descriptions and test results deposited by assay data providers. The record identifiers (IDs) used in Substance, Compound, and BioAssay are called Substance ID (SID), Compound ID (CID), and Assay ID (AID), respectively. The other data collections (i.e., Gene, Protein, Pathway, Cell Line, Taxonomy, and Patent) provide alternative views of PubChem data, related to a specific gene, protein, pathway, cell line, taxon, and patent document, respectively. Each record in the data collections has a dedicated web page (called a Summary page), which presents information available in PubChem for that record. This page also presents relevant annotations collected by PubChem from authoritative data sources. Here are some example Summary pages for PubChem records.

PubChem Tutorials#

For novice users, an interactive online PubChem tutorial is available at the following webpage:

https://www.nlm.nih.gov/oet/ed/pubchem/tutorial/index.html

In addition, the following paper [Kim21] provides step-by-step instructions on how to explore data contained in PubChem, along with examples of commonly requested tasks.

Kim S. Exploring Chemical Information in PubChem. Curr. Protoc.; 2021 Aug 9; 1(8):e217. doi: https://doi.org/10.1002/cpz1.217.
[PubMed PMID: 34370395] [PubMed Central PMCID: PMC8363119] [Free Full Text]

This paper includes several protocols designed to help users to get familiar with PubChem’s data and tools.

Finally, one of the developers at PubChem, Dr. Sunghwan Kim, has developed some tutorials about the PubChem API, the Power User Group - Representation State Transfer (PUG-REST) service.

References#

Kim21

Sunghwan Kim. "exploring chemical information in pubchem". Current Protocols, 1(8):e217, 2021. URL: https://doi.org/10.1002/cpz1.217.

KCC+23

Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A. Shoemaker, Paul A. Thiessen, Bo Yu, Leonid Zaslavsky, Jian Zhang, and Evan E. Bolton. "pubchem 2023 update". Nucleic Acids Research, 51(D1):D1373–D1380, 2023. URL: https://doi.org/10.1093/nar/gkac956.

KCH+22

Sunghwan Kim, Tiejun Cheng, Siqian He, Paul A. Thiessen, Qingliang Li, Asta Gindulyte, and Evan E. Bolton. "pubchem protein, gene, pathway, and taxonomy data collections: bridging biology and chemistry through target-centric views of pubchem data". Journal of Molecular Biology, 434(11):167514, 2022. URL: https://doi.org/10.1016/j.jmb.2022.167514.