IUPAC WorldFAIR

Interactive Demonstration#

WorldFAIR Chemistry: Protocol Services#

This notebook is intended as an interactive demonstration of the services being proposed by the IUPAC WorldFAIR Chemistry D3.3 project team. A complete description of the project is available at:

https://iupac.github.io/WFChemProtocols/intro.html

This notebook itself is available at:

IUPAC/WFChemProtocols

Resolver Summary#

While more detail is provided in the documentation linked above, in short what is described here is a web service called a “resolver” that performs two main functions:

  1. Check for the presence of a chemical record in the hosting organization’s database.

  2. Validate the machine-readable chemical structure according to the hosting organization’s rules.

Resolver Base URL#

The service being proposed in this project is a regular HTTP web service, using standard CGI URL syntax, and a well-defined data model for the information returned. This demonstration uses a prototype service hosted by PubChem, using JSON as the response format (although in principle it could be XML or any other structured data format).

One key point of this proposal is that the base URL for the resolver CGI would vary from one institution to another, but the inputs (CGI arguments) and outputs (JSON data) would be standard, the same for any organization implementing the service. So simply by switching the base URL, one can run the same query on multiple different sites, without otherwise needing to change any code.

In python, using the “requests” library, it might look like this:

# do not change or remove this line, the examples below depend on it
import requests
RESOLVER_BASE_URL = "https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi"

When called without any arguments, the resolver will return some information about what inputs and outputs it can handle.

# this does the actual HTTP call to the resolver CGI
result = requests.get(RESOLVER_BASE_URL)

print(result.url)
print('\n')
print(result.text)
https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi


{
  "Result": {
    "ServiceDetails": [
      {
        "Resource": "PubChem",
        "ResourceURL": "https://pubchem.ncbi.nlm.nih.gov",
        "ResolverURL": "https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi",
        "AvailableInputs": {
          "SDF": true,
          "SMILES": true,
          "InChI": true,
          "InChIKey": true,
          "PNG": false,
          "Name": true
        },
        "AvailableOutputs": {
          "IUPACName": true,
          "SMILES": true,
          "InChI": true,
          "InChIKey": true,
          "ResourceIdentifier": true,
          "RecordURL": true,
          "ImageURL": true
        }
      }
    ]
  }
}

Chemical Lookup#

The resolver service can check to see whether a given chemical is present in the host organization’s database. Examples are below, but note that in the interactive Jupyter notebook, one can edit the inputs to query whatever chemical is desired.

First, to look up by SMILES string:

payload = { "smiles": "CCCC" }
result = requests.get(RESOLVER_BASE_URL, payload)

print(result.url)
print('\n')
print(result.text)
https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi?smiles=CCCC


{
  "Result": {
    "Match": [
      {
        "Resource": "PubChem",
        "ResourceURL": "https://pubchem.ncbi.nlm.nih.gov",
        "ResourceIdentifier": "7843",
        "ResourceIdentifierType": "CID",
        "RecordURL": "https://pubchem.ncbi.nlm.nih.gov/compound/7843",
        "ImageURL": "https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?t=l&cid=7843",
        "IUPACName": "butane",
        "SMILES": "CCCC",
        "InChI": "InChI=1S/C4H10/c1-3-4-2/h3-4H2,1-2H3",
        "InChIKey": "IJDNQMDRQITEOD-UHFFFAOYSA-N"
      }
    ]
  }
}

In this example code, the requests module is constructing the full URL from the payload argument. The resulting data indicates that there is indeed a matching record in the host’s database, and various record fields are provided that would allow the user to get more information directly from the hosting site; this is not intended for full record retrieval, but rather a simplified response that says whether the chemical is found and where to go to get more detail. So in this case the user can follow the link to the full PubChem record:

https://pubchem.ncbi.nlm.nih.gov/compound/7843

Or see an image of the chemical structure (although not terribly interesting in this case!):

https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?t=l&cid=7843 butane image

If the chemical is not in the database, the response would be something like this, where an empty result means nothing was found (this could also potentially be indicated by an HTTP 404 response, but is not done that way in this sample implementation):

payload = { "smiles": "CCCC(Br)CC(F)(Cl)CCC" }
result = requests.get(RESOLVER_BASE_URL, payload)

print(result.url)
print('\n')
print(result.text)
https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi?smiles=CCCC%28Br%29CC%28F%29%28Cl%29CCC


{
  "Result": {
  }
}

The resolver can handle multiple input formats for the chemical structure, as listed in the previous section. So all of these would return the same result, which can be verified by (un)commenting various payload lines below:

payload = { "inchi": "InChI=1S/C4H10/c1-3-4-2/h3-4H2,1-2H3" }
#payload = { "smiles": "CCCC" }
#payload = { "inchikey": "IJDNQMDRQITEOD-UHFFFAOYSA-N" }
#payload = { "name": "butane" }

result = requests.get(RESOLVER_BASE_URL, payload)

print(result.url)
print('\n')
print(result.text)
https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi?inchi=InChI%3D1S%2FC4H10%2Fc1-3-4-2%2Fh3-4H2%2C1-2H3


{
  "Result": {
    "Match": [
      {
        "Resource": "PubChem",
        "ResourceURL": "https://pubchem.ncbi.nlm.nih.gov",
        "ResourceIdentifier": "7843",
        "ResourceIdentifierType": "CID",
        "RecordURL": "https://pubchem.ncbi.nlm.nih.gov/compound/7843",
        "ImageURL": "https://pubchem.ncbi.nlm.nih.gov/image/imgsrv.fcgi?t=l&cid=7843",
        "IUPACName": "butane",
        "SMILES": "CCCC",
        "InChI": "InChI=1S/C4H10/c1-3-4-2/h3-4H2,1-2H3",
        "InChIKey": "IJDNQMDRQITEOD-UHFFFAOYSA-N"
      }
    ]
  }
}

Note that the InChI full string needs to be URL-encoded in order to be passed as an argument to the CGI, as would some SMILES strings with special characters. Again, this is handled automatically by the requests library in this example.

Chemical Structure Validation#

The second major function of the resolver is to check the validity of chemical structures. That is, when a user inputs a SMILES string or an SDF file (for example, as export from some chemical drawing package or ELN), does the host organization confirm that the structure is valid? Does it have the right number of defined stereocenters, isotopes, etc.? Sometimes chemists draw complex structures in a way where stereochemistry is implied by the drawing, but may not be interpreted as such by a machine. This tool will allow the chemist to verify that the structure is perceived by the chemical software in the same way as by the chemist themselves.

When called with this special action argument, the resolver returns some basic statistics about what it sees in the structure. Note this may vary somewhat from organization to organization, especially for edge cases where different chemical software packages produce slightly different results. This is expected, and part of the idea here is to ask “What does PubChem think of this structure?” vs. “What does EPA think of this structure?”

payload = { "smiles": "CCCC", "action": "validate_structure" }

result = requests.get(RESOLVER_BASE_URL, payload)

print(result.url)
print('\n')
print(result.text)
https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi?smiles=CCCC&action=validate_structure


{
  "Result": {
    "Message": "Structure is valid",
    "Statistics": [
      {
        "Type": "DefinedAtomStereo",
        "Value": "0"
      },
      {
        "Type": "UndefinedAtomStereo",
        "Value": "0"
      },
      {
        "Type": "DefinedBondStereo",
        "Value": "0"
      },
      {
        "Type": "UndefinedBondStereo",
        "Value": "0"
      },
      {
        "Type": "HeavyAtoms",
        "Value": "4"
      },
      {
        "Type": "IsotopeAtoms",
        "Value": "0"
      },
      {
        "Type": "CovalentUnits",
        "Value": "1"
      }
    ]
  }
}

If there is a problem with the input structure, there should some human-readable message that indicates what the error is. Again this will vary by organization, the message itself is not part of this standard, but basic things like valence checks on organic structures will presumably be handled similarly.

payload = { "smiles": "CC(C)(C)(C)C", "action": "validate_structure" }

result = requests.get(RESOLVER_BASE_URL, payload)

print(result.url)
print('\n')
print(result.text)
https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi?smiles=CC%28C%29%28C%29%28C%29C&action=validate_structure


{
  "Fault": {
    "Code": "Invalid",
    "Message": "Structure is not valid",
    "Details": [
      "Record 0: Warning: \"pcData/pubchem_valence.cpp\", line 290: Detected illegal valence for element \"C\": 5 sigma bonds, 0 pi bonds, 0 charge",
      "Exception: Valence validation failed"
    ]
  }
}

Here is an exmple where the organization’s specific rules come into play. PubChem, which is designed mainly for drug-like chemicals, rejects isotopes with half-life less than 1 millisecond. This may not be the case for other databases with different purposes and goals. So even though 5H exists (at least in a laboratory), it’s not considered valid in PubChem.

payload = { "smiles": "C[5H]", "action": "validate_structure" }

result = requests.get(RESOLVER_BASE_URL, payload)

print(result.url)
print('\n')
print(result.text)
https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi?smiles=C%5B5H%5D&action=validate_structure


{
  "Fault": {
    "Code": "Invalid",
    "Message": "Structure is not valid",
    "Details": [
      "Record 0: Info: \"OpenEye/pubchem_compound.cpp\", line 3121: Atom ID \"2\" has illegal isotope (5) for atomic number 1 (\"H\")",
      "Exception: Element validation failed"
    ]
  }
}

Here is a more complex example, a larger structure (Prostaglandin D2) with multiple stereocenters, both sp3 and sp2. Note the response data indicates how many defined vs. undefined stereocenters are present, which may assist the user in matching their expectations to the machine result.

payload = { 
    "smiles": "CCCCC[C@@H](/C=C/[C@@H]1[C@H]([C@H](CC1=O)O)C/C=C\CCCC(=O)O)O", 
    "action": "validate_structure" 
}

result = requests.get(RESOLVER_BASE_URL, payload)

print(result.url)
print('\n')
print(result.text)
https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi?smiles=CCCCC%5BC%40%40H%5D%28%2FC%3DC%2F%5BC%40%40H%5D1%5BC%40H%5D%28%5BC%40H%5D%28CC1%3DO%29O%29C%2FC%3DC%5CCCCC%28%3DO%29O%29O&action=validate_structure


{
  "Result": {
    "Message": "Structure is valid",
    "Statistics": [
      {
        "Type": "DefinedAtomStereo",
        "Value": "4"
      },
      {
        "Type": "UndefinedAtomStereo",
        "Value": "0"
      },
      {
        "Type": "DefinedBondStereo",
        "Value": "2"
      },
      {
        "Type": "UndefinedBondStereo",
        "Value": "0"
      },
      {
        "Type": "HeavyAtoms",
        "Value": "25"
      },
      {
        "Type": "IsotopeAtoms",
        "Value": "0"
      },
      {
        "Type": "CovalentUnits",
        "Value": "1"
      }
    ]
  }
}

Finally, it may be helpful to chemists, who are trained to interpret chemical structures visually, to see a computer-generated image of their input, again to see if it matches what the chemist thinks should be there. So the resolver can also return an image file, with an appropriate output format request. Note, in order to show the image here in the notebook, we must use the resolver URL in the image tag directly, rather than going through python.

payload = { 
    "smiles": "CCCCC[C@@H](/C=C/[C@@H]1[C@H]([C@H](CC1=O)O)C/C=C\CCCC(=O)O)O", 
    "action": "validate_structure",
    "format": "png"
}

result = requests.get(RESOLVER_BASE_URL, payload)

print(result.url)
https://pubchem.ncbi.nlm.nih.gov/resolver/resolver.cgi?smiles=CCCCC%5BC%40%40H%5D%28%2FC%3DC%2F%5BC%40%40H%5D1%5BC%40H%5D%28%5BC%40H%5D%28CC1%3DO%29O%29C%2FC%3DC%5CCCCC%28%3DO%29O%29O&action=validate_structure&format=png

Resolver image

Conclusion#

It is our hope this this notebook provides a clear overview of the expected functionality of the resolver being proposed by this IUPAC project. These working examples should give the user a chance to see how to submit these web service requests, without having to know any programming, and to be able to change the inputs with their own SMILES strings etc. in order to see how the resolver responds to their unique cases.

We would be happy to get feedback, please see here for details. Thank you!

https://iupac.github.io/WFChemProtocols/demo.html