Data Repository: PubChem#
About this recipe
Author: Sunghwan Kim
Reviewer: Sam Munday
Topics: The PubChem database, PubChem dataypes, PubChem tutorial
Format: Markdown file
Scenarios: Retrieve chemical data from an online database
Skills: You should be familiar with
Learning outcomes: After completing this recipe you should understand:
The kinds of data that PubChem makes available
What a PubChem summary page is how to access it
Available PubChem tools (via tutorial)
Citation: ‘Data Repository: PubChem’, Sunghwan Kim, The IUPAC FAIR Chemistry Cookbook, Contributed: 2023-02-28 https://w3id.org/ifcc/IFCC004.
Reuse: This notebook is made available under a CC-BY-4.0 license.
What is PubChem?#
PubChem (https://pubchem.ncbi.nlm.nih.gov) [KCC+23] is a very popular chemistry information resource for biomedical research communities in many areas, including cheminformatics, chemical biology, medicinal chemistry, and drug discovery. PubChem’s information content, collected from hundreds of data sources, is organized into multiple data collections, including Substance, Compound, BioAssay, Gene, Protein, Pathway, Cell Line, Taxonomy, and Patent [KCH+22].
Substance archives the chemical data submitted by individual data sources and Compound stores the unique chemical structures extracted from Substance through chemical structure standardization. BioAssay contains biological assay descriptions and test results deposited by assay data providers. The record identifiers (IDs) used in Substance, Compound, and BioAssay are called Substance ID (SID), Compound ID (CID), and Assay ID (AID), respectively. The other data collections (i.e., Gene, Protein, Pathway, Cell Line, Taxonomy, and Patent) provide alternative views of PubChem data, related to a specific gene, protein, pathway, cell line, taxon, and patent document, respectively. Each record in the data collections has a dedicated web page (called a Summary page), which presents information available in PubChem for that record. This page also presents relevant annotations collected by PubChem from authoritative data sources. Here are some example Summary pages for PubChem records.
Compound (CID 60823, aspirin):
https://pubchem.ncbi.nlm.nih.gov/compound/2244Substance (SID 829042, depositor-provided structure of aspirin)
https://pubchem.ncbi.nlm.nih.gov/substance/829042Assay (AID 463075, high-throughput assay to identify inhibitors of TNF-
alpha-induced cell death)
https://pubchem.ncbi.nlm.nih.gov/bioassay/463075Gene (human tumor necrosis factor (TNF); NCBI GeneID 7124)
https://pubchem.ncbi.nlm.nih.gov/gene/7124Protein (mouse Cytochrome P450 1A1 (CYP1A1); NCBI accession P00184)
https://pubchem.ncbi.nlm.nih.gov/protein/P00184Pathway (Glycolysis in human; Reactome ID R-HSA-70171)
https://pubchem.ncbi.nlm.nih.gov/pathway/Reactome:R-HSA-70171Cell Line (Michigan Cancer Foundation-7 (MCF-7) breast cancer cell line)
https://pubchem.ncbi.nlm.nih.gov/cell/mcf-7Taxonomy (Saccharomyces cerevisiae (baker’s yeast); NCBI Taxonomy ID 4932)
https://pubchem.ncbi.nlm.nih.gov/taxonomy/4932Patent (US Patent US-2021379090-A1)
https://pubchem.ncbi.nlm.nih.gov/patent/US-2021379090-A1
PubChem Tutorials#
For novice users, an interactive online PubChem tutorial is available at the following webpage:
https://www.nlm.nih.gov/oet/ed/pubchem/tutorial/index.html
In addition, the following paper [Kim21] provides step-by-step instructions on how to explore data contained in PubChem, along with examples of commonly requested tasks.
Kim S. Exploring Chemical Information in PubChem. Curr. Protoc.; 2021 Aug 9; 1(8):e217. doi: https://doi.org/10.1002/cpz1.217.
[PubMed PMID: 34370395] [PubMed Central PMCID: PMC8363119] [Free Full Text]
This paper includes several protocols designed to help users to get familiar with PubChem’s data and tools.
Basic Protocol 1: Finding genes and proteins that interact with a given compound
Basic Protocol 2: Finding drug-like compounds similar to a query compound through a two-dimensional (2-D) similarity search
Basic Protocol 3: Finding compounds similar to a query compound through a three-dimensional (3-D) similarity search
Support Protocol: Computing similarity scores between compounds
Basic Protocol 4: Getting the bioactivity data for the hit compounds from substructure search
Basic Protocol 5: Finding drugs that target a particular gene
Basic Protocol 6: Getting bioactivity data of all chemicals tested against a protein
Basic Protocol 7: Finding compounds annotated with classifications or ontological terms
Basic Protocol 8: Finding stereoisomers and isotopomers of a compound through identity search
Finally, one of the developers at PubChem, Dr. Sunghwan Kim, has developed some tutorials about the PubChem API, the Power User Group - Representation State Transfer (PUG-REST) service.
References#
- Kim21
Sunghwan Kim. "exploring chemical information in pubchem". Current Protocols, 1(8):e217, 2021. URL: https://doi.org/10.1002/cpz1.217.
- KCC+23
Sunghwan Kim, Jie Chen, Tiejun Cheng, Asta Gindulyte, Jia He, Siqian He, Qingliang Li, Benjamin A. Shoemaker, Paul A. Thiessen, Bo Yu, Leonid Zaslavsky, Jian Zhang, and Evan E. Bolton. "pubchem 2023 update". Nucleic Acids Research, 51(D1):D1373–D1380, 2023. URL: https://doi.org/10.1093/nar/gkac956.
- KCH+22
Sunghwan Kim, Tiejun Cheng, Siqian He, Paul A. Thiessen, Qingliang Li, Asta Gindulyte, and Evan E. Bolton. "pubchem protein, gene, pathway, and taxonomy data collections: bridging biology and chemistry through target-centric views of pubchem data". Journal of Molecular Biology, 434(11):167514, 2022. URL: https://doi.org/10.1016/j.jmb.2022.167514.