Accessing PubChem through PUG-REST: Part III#
About this interactive recipe
Author(s): Sunghwan Kim
Reviewer: Samuel Munday
Topic(s): How to retrieve chemical data using chemical identifiers.
Format: Interactive Jupyter Notebook (Python)
Scenario: You need to access and potentially download chemical data.
Skills: You should be familar with:
Learning outcomes:
How to access PubChem chemical data using a chemical identifiers
How to search PubChem using 2-D and 3-D molecular similarity
How to search PubChem using substructures and superstructures
Citation: ‘Accessing PubChem through PUG-REST - Part III’, Sunghwan Kim, The IUPAC FAIR Chemistry Cookbook, Contributed: 2023-02-28 https://w3id.org/ifcc/IFCC008.
Reuse: This notebook is made available under a CC-BY-4.0 license.
import requests
import time
import io
import csv
from IPython.display import Image, display
1. Using a SMILES or InChI string as an input query#
smiles = "CC(=O)OC1=CC=CC=C1C(=O)O"
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/" + smiles + "/cids/txt").text.strip())
2244
Some SMILES strings contain characters not compatible with the PUG-REST request URL syntax. For example, isomeric SMILES uses the “/” character (forward slash) to represent the E/Z or cis/trans stereochemistry of a molecule. However, because the “/” character is also used in the request URL to separate the segments of the URL path, the use of such SMILES strings as an input structure will result an error.
smiles = "CC(C)C1=NC(=NC(=C1/C=C/[C@H](C[C@H](CC(=O)O)O)O)C2=CC=C(C=C2)F)N(C)S(=O)(=O)C"
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/" + smiles + "/cids/txt").text.strip())
Status: 400
Code: PUGREST.BadRequest
Message: Unable to standardize the given structure - perhaps some special characters need to be escaped or data packed in a MIME form?
Detail: error:
Detail: status: 400
Detail: output: Caught ncbi::CException: Standardization failed
Detail: Output Log:
Detail: Record 1: Warning: Cactvs Ensemble cannot be created from input string
Detail: Record 1: Error: Unable to convert input into a compound object
Detail:
Detail:
To circumvent this issue, the SMILES input should be provided in one of the following two ways:
as a URL parameter
in the HTTP header (using the HTTP POST method).
smiles = "CC(C)C1=NC(=NC(=C1/C=C/[C@H](C[C@H](CC(=O)O)O)O)C2=CC=C(C=C2)F)N(C)S(=O)(=O)C"
# As a URL parameter
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/cids/txt" + "?smiles=" + smiles).text.strip())
# In the HTTP header (using HTTP Post)
print(requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/smiles/cids/txt", data={'smiles':smiles}).text.strip())
446157
446157
InChI encodes the chemical structure information into multiple layers and sublayers, separated by the “/” character. For this reason, InChI strings should also be provided as a URL parameter or in the HTTP header (using HTTP host).
inchi = "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)"
# With the request URL : WILL NOT WORK
#print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchi/" + inchi + "/cids/txt").text.strip())
# As a URL parameter
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchi/cids/txt" + "?inchi=" + inchi).text.strip())
# In the HTTP header (using HTTP Post)
print(requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchi/cids/txt", data={'inchi':inchi}).text.strip())
2244
2244
2. Performing identity search#
smiles = "CC(C)/C=C/I"
# Compounds with the same stereochemistry and isotopism (default)
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/cid/14571425/cids/txt").text.strip())
print(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/cid/14571425/cids/txt?identity_type=same_stereo_isotope").text.strip())
14571425
14571425
# Compounds with the same isotopism (stereochemistry can be different)
cids1 = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/smiles/cids/txt?identity_type=same_isotope", data={'smiles':smiles}).text.strip().split()
print(cids1)
for mycid in cids1:
display(Image(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/" + mycid + "/record/PNG?image_size=200x200").content))
print("CID " + mycid, ":", requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/" + mycid + "/property/IsomericSMILES/TXT").text)
time.sleep(0.2)
['14571425', '14571426', '71380237']
CID 14571425 : CC(C)/C=C/I
CID 14571426 : CC(C)/C=C\I
CID 71380237 : CC(C)C=CI
# Compounds with the same stereochemistry (isotopism can be different)
cids2 = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/smiles/cids/txt?identity_type=same_stereo", data={'smiles':smiles}).text.strip().split()
print(cids2)
for mycid in cids2:
display(Image(requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/" + mycid + "/record/PNG?image_size=200x200").content))
print("CID " + mycid, ":", requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/" + mycid + "/property/IsomericSMILES/TXT").text)
time.sleep(0.2)
['14571425', '118122558']
CID 14571425 : CC(C)/C=C/I
CID 118122558 : [2H]C[C@@H](C)/C=C/I
# Compounds with the same connectivity (stereochemistry and isotopism can be different)
cids3 = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastidentity/smiles/cids/txt?identity_type=same_connectivity", data={'smiles':smiles}).text.strip().split()
print(cids3) # All compounds in cids1 and cids2 are returned.
['14571425', '14571426', '71380237', '118122558', '123616558']
3. Performing 2-D and 3-D similarity search#
smiles = "CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
cids = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/cids/txt", data={'smiles':smiles}).text.strip().split()
print(len(cids))
print(cids)
609
['155903259', '162396372', '162396442', '162396452', '162396453', '162396458', '162396459', '162685338', '162712460', '162712462', '162712482', '162712489', '162712498', '163283236', '163283243', '163283284', '163361997', '163362001', '166149071', '166157194', '167203580', '167203608', '167331698', '167518234', '163421961', '162396450', '162396460', '162396461', '168290284', '162712471', '166157023', '166157057', '166157065', '166157205', '167331610', '167331612', '168870655', '157010397', '168301133', '168291677', '171347857', '171355745', '162478807', '162479130', '163283238', '163283239', '163283322', '163283330', '163283343', '163283370', '163283371', '163283387', '163283390', '163283407', '163341926', '163361849', '163361998', '163362005', '163362008', '163362009', '163362014', '163362016', '163362026', '163362029', '163362031', '163941851', '164040179', '164158215', '164159966', '164701733', '164701736', '164701738', '164701741', '164701742', '164701743', '164701748', '164701749', '164701757', '164701758', '164701760', '164701766', '164701772', '164701773', '164701775', '164701776', '164701777', '164701789', '164701794', '164701800', '164701805', '164701807', '164701812', '164701813', '164701815', '164701816', '164701817', '164701818', '164701819', '164701821', '164766036', '164766041', '164785132', '164850071', '164971185', '165179472', '165179485', '165179486', '165179488', '166149062', '166149074', '166149077', '166149089', '166149094', '166149137', '166156664', '166156828', '166156836', '166157017', '166157043', '166157215', '166157269', '166494662', '166494843', '166537627', '166840326', '166840392', '166840457', '166840466', '166840731', '166891533', '166891547', '166891596', '166891665', '166893182', '166893274', '166893275', '166893276', '166893348', '166893351', '167065080', '167065082', '167065154', '167065160', '167094315', '167094316', '167094334', '167094341', '167203581', '167206597', '167206598', '167206599', '167206807', '167207061', '167229685', '167251496', '167278770', '167331630', '167331639', '167338845', '167338850', '167338914', '167338917', '167338943', '167480975', '167480997', '167481005', '167481033', '167481034', '167481037', '167481072', '167481103', '167481130', '167481158', '167481199', '167481201', '167481202', '167481210', '167481232', '167481234', '167481250', '167481257', '167481266', '167481309', '167481324', '167481327', '167481349', '167481356', '167481364', '167481374', '167481379', '167481386', '167481398', '167481400', '167481430', '167481462', '167481474', '167481480', '167481535', '167481542', '167481547', '167481557', '167481565', '167481566', '167481575', '167481608', '167481637', '167481640', '167481663', '167481675', '167481747', '167481808', '167481821', '167481860', '167481876', '167481880', '167481900', '167481908', '167481917', '167481924', '167481927', '167481929', '167481944', '167481950', '167481970', '167481972', '167481999', '167482005', '167482039', '167482043', '167482074', '167482083', '167482106', '167482167', '167482178', '167482205', '167482212', '167482213', '167482218', '167482232', '167482240', '167482261', '167482283', '167482292', '167482295', '167482314', '167482317', '167518190', '167521394', '167579537', '168142141', '168142152', '168142166', '168142167', '168142213', '168750579', '168750586', '168750590', '168750593', '168878798', '168941969', '168942044', '169084224', '169122295', '169193197', '169193261', '169193357', '169193547', '169193552', '169193733', '169193751', '169193754', '169193839', '169193963', '169193968', '169194030', '169194114', '169194124', '169194140', '169194186', '169194204', '169194236', '169194258', '169194427', '169240696', '169240697', '169240701', '169283836', '169283867', '169283881', '169291133', '169436363', '169595627', '169595871', '169595876', '169595900', '169595902', '169595937', '169595991', '169595994', '169597577', '169655654', '169655658', '169655661', '169655677', '169655713', '169655746', '169655747', '169655787', '169655789', '169676887', '169686050', '169686051', '169707956', '169707957', '169732235', '169732236', '169734644', '169735016', '169737661', '169816596', '169860851', '169860855', '169860863', '169860866', '169860882', '169861043', '169861121', '169861122', '169861123', '169861126', '169861127', '169861128', '169861209', '169861410', '169861544', '169861545', '169861547', '169861552', '169861641', '169861643', '169861899', '169861924', '169862040', '169878303', '169878305', '169878307', '169878308', '169878350', '169878352', '169878353', '169878366', '169878439', '169878441', '169878442', '169878445', '169878446', '169878447', '169878448', '169878449', '169878451', '169878452', '169878453', '169878454', '169878455', '169878511', '169878512', '169878513', '169878514', '169878517', '169878631', '169878890', '169907886', '169907978', '169914647', '169914666', '169914913', '169966328', '169979396', '169979540', '169989970', '169994700', '169994701', '169994702', '169994703', '169994705', '169994706', '169994745', '169994895', '169994956', '169995108', '169995153', '169995154', '169995401', '169995678', '169995679', '169995904', '170005512', '170036051', '170036248', '170036541', '170036799', '170036842', '170036848', '170036858', '170036928', '170036943', '170036979', '170037210', '170037247', '170038496', '170052972', '170052975', '170053129', '170053614', '170053627', '170053669', '170054633', '170054800', '170060745', '170060748', '170151866', '170164590', '170164688', '170164691', '170164767', '170165331', '170165489', '170165491', '170165562', '170165698', '170165798', '170166112', '170166207', '170166212', '170166479', '170166597', '170166774', '170166889', '170166912', '170166968', '170167024', '170167108', '170167453', '170167521', '170167675', '170167936', '170168021', '170203998', '170404012', '170552067', '170552068', '170631558', '170631565', '170631680', '170631745', '170701828', '170701834', '170701836', '170701844', '170701848', '170701854', '170701858', '171108715', '171108717', '171530265', '171534254', '171534270', '171843250', '171843260', '172118140', '172118146', '172118147', '172118153', '172118162', '172118182', '172118188', '172118204', '172118214', '172118221', '172118226', '172118235', '172118239', '172118244', '172118277', '172118278', '172118286', '172118291', '172118304', '172118309', '172118317', '172118328', '172118346', '172118347', '172118353', '172118354', '172118355', '172118358', '172118360', '163285815', '166149063', '166156829', '167481104', '167482299', '167482300', '168010016', '168310836', '168310837', '168476190', '168878799', '169240702', '169436361', '171379688', '171394959', '171530266', '162712468', '162712490', '162712506', '163283283', '163362019', '164701798', '164701802', '164850073', '164850290', '166157157', '166964883', '166964886', '166964888', '166964890', '166964899', '166964903', '166964904', '166964908', '166964913', '166986000', '166986065', '166986067', '166986077', '167203321', '167481489', '167482130', '167537951', '167574477', '167574478', '167674037', '168093174', '168154285', '168993735', '169084219', '169084221', '169084228', '169095061', '169193396', '169193437', '169193895', '169193967', '169193984', '169194217', '169194293', '169194319', '169194330', '169194348', '169595592', '169595872', '169595897', '169595998', '169655830', '169728040', '169728041', '169728431', '169728432', '169736642', '169736643', '169862121', '169878349', '169914451', '169915447', '169915455', '169915574', '169995311', '170014641', '170036241', '170036499', '170036602', '170052965', '170053029', '170053921', '170080981', '170164555', '170165348', '170165794', '170165926', '170166075', '170167073', '170167420', '170167651', '170167680', '170168510', '170168543', '170200729', '170201013', '170291885', '172579449', '172579509', '172579567', '172579762', '172579767', '166642632', '168993728', '172463305']
You can adjust the similarity threshold using the optional parameter “Threshold”. T The following request performs a 2-D similarity search with a tighter similarity threshold (95)
smiles = "CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
cids = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/cids/txt?Threshold=99", data={'smiles':smiles}).text.strip().split()
print(len(cids))
print(cids)
44
['155903259', '162396459', '163362001', '166157194', '167331698', '168290284', '168301133', '163341926', '163361849', '163362005', '163362031', '164040179', '164701758', '164701760', '164701812', '164701813', '164701818', '167065082', '167094315', '167094316', '169240696', '169595994', '169686051', '169707956', '169707957', '169860851', '169860863', '169861121', '169861123', '169861544', '169861545', '169878303', '169878307', '169914913', '170060748', '170404012', '170701834', '171530265', '163285815', '168476190', '169240702', '171394959', '166642632', '172463305']
Note that the use of the higher threshold (99) than the default (90) results in fewer structures.
It is also possible to get line notations and molecular properties for the compounds returned from chemical structure search.
data = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_2d/smiles/property/HeavyAtomCount,MolecularFormula,IsomericSMILES/csv?Threshold=99", data={'smiles':smiles}).text.strip()
print(data)
"CID","HeavyAtomCount","MolecularFormula","IsomericSMILES"
155903259,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
162396459,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@H](C[C@@H]3CCNC3=O)C#N)C"
163362001,35,"C23H32F3N5O4","CC(C)C[C@@H](C(=O)N1C[C@H]2[C@@H]([C@H]1C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C2(C)C)NC(=O)C(F)(F)F"
166157194,34,"C23H33F2N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
167331698,34,"C23H33F2N5O4","CC(C)C[C@@H](C(=O)N1C[C@H]2[C@@H]([C@H]1C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C2(C)C)NC(=O)C(F)F"
168290284,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@H]3CCNC3=O)C#N)C"
168301133,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1C(N(C2)C(=O)C(C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
163341926,35,"C23H32F3N5O4","CC1(C2C1C(N(C2)C(=O)C(C(C)(C)C)NC(=O)C(F)(F)F)C(=O)NC(CC3CCNC3=O)C#N)C"
163361849,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](CC3CCNC3=O)C#N)C"
163362005,35,"C23H32F3N5O4","CC(C)[C@@H](C(=O)N1C[C@H]2[C@@H]([C@H]1C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C2(C)C)NC(=O)CC(F)(F)F"
163362031,36,"C24H34F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](CC(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
164040179,36,"C24H36F3N5O4","C.CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
164701758,35,"C23H32F3N5O4","CC(C)[C@@H](C(=O)N1CC2C(C1C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C2(C)C)NC(=O)CC(F)(F)F"
164701760,36,"C24H34F3N5O4","CC1(C2C1C(N(C2)C(=O)[C@H](CC(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
164701812,35,"C23H32F3N5O4","CC1(C2C1C(N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
164701813,35,"C23H32F3N5O4","CC(C)C[C@@H](C(=O)N1CC2C(C1C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C2(C)C)NC(=O)C(F)(F)F"
164701818,35,"C23H32F3N5O4","CC1(C2C1C(N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@H](C[C@@H]3CCNC3=O)C#N)C"
167065082,35,"C23H32F3N5O4","CC1(C2C1[C@H](N(C2)C(=O)C(C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
167094315,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)C(C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
167094316,35,"C23H32F3N5O4","CC1([C@H]2C1CN([C@@H]2C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C(=O)C(C(C)(C)C)NC(=O)C(F)(F)F)C"
169240696,35,"C23H32F3N5O4","CC1(C2C1C(N(C2)C(=O)[C@H](C(C)(C)C)N([C@@H](C[C@@H]3CCNC3=O)C#N)C(=O)C(F)(F)F)C(=O)N)C"
169595994,35,"C24H35F2N5O4","CC(C)C[C@@H](C(=O)N1C[C@]2([C@@H]([C@H]1C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C2(C)C)C)NC(=O)C(F)F"
169686051,35,"C23H32F3N5O4","CC1([C@H]2C1CN([C@@H]2C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C"
169707956,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)C(C(C)(C)C)NC(=O)C(F)(F)F)C(=O)NC(C[C@@H]3CCNC3=O)C#N)C"
169707957,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)C(C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](CC3CCNC3=O)C#N)C"
169860851,38,"C25H37F3N6O4","CC1(C2C1C(N(C2)C(=O)C(CCCCN(C)C)NC(=O)C(F)(F)F)C(=O)NC(CC3CCNC3=O)C#N)C"
169860863,37,"C24H35F3N6O4","CC1(C2C1C(N(C2)C(=O)C(CCCN(C)C)NC(=O)C(F)(F)F)C(=O)NC(CC3CCNC3=O)C#N)C"
169861121,38,"C25H37F3N6O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)C(CCCCN(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](CC3CCNC3=O)C#N)C"
169861123,37,"C24H35F3N6O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)C(CCCN(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](CC3CCNC3=O)C#N)C"
169861544,38,"C25H37F3N6O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](CCCCN(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
169861545,37,"C24H35F3N6O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](CCCN(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
169878303,36,"C24H34F3N5O4","CC1CNC(=O)[C@@H]1C[C@@H](C#N)NC(=O)[C@@H]2[C@@H]3[C@@H](C3(C)C)CN2C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F"
169878307,35,"C23H32F3N5O4","CC1CNC(=O)[C@@H]1C[C@@H](C#N)NC(=O)[C@@H]2[C@@H]3[C@@H](C3(C)C)CN2C(=O)[C@H](C(C)C)NC(=O)C(F)(F)F"
169914913,36,"C24H34F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)CC(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
170060748,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)NC(C[C@@H]3CCNC3=O)C#N)C"
170404012,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1C(N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
170701834,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)NCC[C@H]3C[C@H](NC3=O)C#N)C"
171530265,35,"C23H34F3N5O4","CC(C)C.CC1(C2C1C(N(C2)C(=O)CNC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
163285815,35,"C23H32F3N5O4","CC1([C@@H]2[C@@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
168476190,35,"C23H32F3N5O4","[2H]C1(C[C@H](C(=O)N1)C[C@@H](C#N)NC(=O)[C@@H]2[C@@H]3[C@@H](C3(C)C)CN2C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)[2H]"
169240702,35,"C23H32F3N5O4","CC1(C2C1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)NC(C[C@@H]3CCNC3=O)C#N)C"
171394959,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@H](C[C@H]3CCNC3=O)C#N)C"
166642632,35,"C23H32F3N5O4","[2H]C([2H])([2H])C([C@@H](C(=O)N1C[C@H]2[C@@H]([C@H]1C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C2(C)C)NC(=O)C(F)(F)F)(C([2H])([2H])[2H])C([2H])([2H])[2H]"
172463305,35,"C23H32F3N5O4","CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
smiles = "CC1([C@@H]2[C@H]1[C@H](N(C2)C(=O)[C@H](C(C)(C)C)NC(=O)C(F)(F)F)C(=O)N[C@@H](C[C@@H]3CCNC3=O)C#N)C"
cids = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsimilarity_3d/smiles/cids/txt", data={'smiles':smiles}).text.strip().split()
print(len(cids))
print(cids)
64
['168476190', '168301133', '155903259', '162396448', '162396443', '167331612', '162712482', '162396442', '162712489', '169595910', '166157170', '166157194', '168290284', '162712462', '166149071', '171347857', '169491069', '162396447', '170200908', '170200725', '169727506', '162712498', '166949501', '171502451', '171381582', '166949500', '165368436', '171362423', '167481592', '164622832', '167203580', '171350187', '171362422', '171502467', '170774498', '58799705', '58766735', '171843228', '167213710', '165368435', '167430402', '170774504', '164701814', '163321803', '170774499', '44227150', '166156672', '59115764', '58908752', '58605465', '59115469', '58908973', '58604790', '168941919', '169907726', '166156886', '169291122', '167203609', '167203582', '166157313', '169595980', '162396450', '166157277', '166157052']
Currently, the similarity threshold used for 3-D similarity search is not adjustable, contrary to 2-D similarity search.
5. Performing substructure/superstructure search#
smiles = "C2CN=C(C1=C(C=CC=C1)N2)C3=CC=CC=C3"
cids = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsubstructure/smiles/cids/txt", data={'smiles':smiles}).text.strip().split()
print(len(cids))
41497
smiles = "C2CN=C(C1=C(C=CC=C1)N2)C3=CC=CC=C3"
cids = requests.post("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastsuperstructure/smiles/cids/txt", data={'smiles':smiles}).text.strip().split()
print(len(cids))
6894
7. Molecular Formula search#
formula = "C6H12O6"
cids = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/" + formula +"/cids/txt").text.strip().split()
print(len(cids))
1581
You can download the structural information for the compounds returned from the molecular formula search.
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/" + formula +"/property/MolecularFormula,IsomericSMILES/CSV").text.strip()
cid_props = {}
reader = csv.reader(io.StringIO(data))
print(next(reader)) # Print the first line (column header)
for row in reader:
key = row[0]
cid_props[key] = row[1:]
count = 0
for item in cid_props:
count += 1
print(item, "\t", cid_props[item][0], "\t", cid_props[item][1])
if count == 10 : # For simplicity, print only the first 10 items.
break
['CID', 'MolecularFormula', 'IsomericSMILES']
892 C6H12O6 C1(C(C(C(C(C1O)O)O)O)O)O
5793 C6H12O6 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O
2723872 C6H12O6 C1[C@H]([C@H]([C@@H](C(O1)(CO)O)O)O)O
6036 C6H12O6 C([C@@H]1[C@@H]([C@@H]([C@H](C(O1)O)O)O)O)O
107526 C6H12O6 C([C@H]([C@H]([C@@H]([C@H](C=O)O)O)O)O)O
64689 C6H12O6 C([C@@H]1[C@H]([C@@H]([C@H]([C@@H](O1)O)O)O)O)O
439312 C6H12O6 C1[C@H]([C@@H]([C@@H](C(O1)(CO)O)O)O)O
24310 C6H12O6 C1[C@H]([C@H]([C@@H]([C@](O1)(CO)O)O)O)O
439353 C6H12O6 C([C@@H]1[C@@H]([C@@H]([C@H]([C@@H](O1)O)O)O)O)O
439709 C6H12O6 C([C@@H]1[C@H]([C@@H]([C@](O1)(CO)O)O)O)O
cids = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/" + formula +"/cids/txt?AllowOtherElements=True").text.strip().split()
print(len(cids))
3338
data = requests.get("https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/fastformula/" + formula +"/property/MolecularFormula,IsomericSMILES/CSV?AllowOtherElements=True").text.strip()
cid_props = {}
reader = csv.reader(io.StringIO(data))
print(next(reader)) # Print the first line (column header)
for row in reader:
key = row[0]
cid_props[key] = row[1:]
count = 0
for item in cid_props:
count += 1
print(item, "\t", cid_props[item][0], "\t", cid_props[item][1])
if count == 10 : # For simplicity, print only the first 10 items.
break
['CID', 'MolecularFormula', 'IsomericSMILES']
892 C6H12O6 C1(C(C(C(C(C1O)O)O)O)O)O
5793 C6H12O6 C([C@@H]1[C@H]([C@@H]([C@H](C(O1)O)O)O)O)O
2723872 C6H12O6 C1[C@H]([C@H]([C@@H](C(O1)(CO)O)O)O)O
6036 C6H12O6 C([C@@H]1[C@@H]([C@@H]([C@H](C(O1)O)O)O)O)O
107526 C6H12O6 C([C@H]([C@H]([C@@H]([C@H](C=O)O)O)O)O)O
64689 C6H12O6 C([C@@H]1[C@H]([C@@H]([C@H]([C@@H](O1)O)O)O)O)O
439312 C6H12O6 C1[C@H]([C@@H]([C@@H](C(O1)(CO)O)O)O)O
24310 C6H12O6 C1[C@H]([C@H]([C@@H]([C@](O1)(CO)O)O)O)O
439353 C6H12O6 C([C@@H]1[C@@H]([C@@H]([C@H]([C@@H](O1)O)O)O)O)O
439709 C6H12O6 C([C@@H]1[C@H]([C@@H]([C@](O1)(CO)O)O)O)O