Notes by Bob Hanson regarding the processing of 10.14469/hpc/10386 For the most part, the crawling was a straightforward task, looking for that had either a URL or DOI relatedIdentifierType. URL types were digital items; DOI types with relationType="HasPart" were followed to "child" records. So, for instance we have: 10386 (the main DOI): 10.14469/hpc/11652 10.14469/hpc/11349 10.14469/hpc/11405 and 11652 (a compound DOI): https://data.hpc.imperial.ac.uk/resolve/?doi=11652&file=1 https://data.hpc.imperial.ac.uk/resolve/?doi=11652&file=2 https://data.hpc.imperial.ac.uk/resolve/?doi=11652&file=3 Since the DataCite metadata has no more that this information about the URL parts, we decided to pull the headers of the files using the HTTPS HEAD method. These headers provided mediaType, length, and local filename. Determination of the spectroscopy type was not definitive. Some of the DOI entries had contain subjectScheme="inchi" and subjectScheme="inchikey", and we can back-translate an InChI to a SMILES and then to a structure within Jmol. InChI=1S/C8H4NO.Cl.N.H/c1-6-4-2-3-5-7(6)8(9)10;;;/h2-5H;;; WTOACRDOVRAYJI-UHFFFAOYSA-N This *would* be perfect. As it turns out, though, this is not the InChI of 4-chlorophthalazin-1(2H)-one, as shown in Jmol: print "InChI=1S/C8H4NO.Cl.N.H/c1-6-4-2-3-5-7(6)8(9)10;;;/h2-5H;;;".smiles() [C+1]=C1C2=CC=C[CH+1]1.C2(=[N-1])[O-1] Yeiks! The correct InChI, from PubChem gives us a valid SMILES string from Jmol, and turns that SMILES back into 4-chlorophthalazin-1(2H)-one: x = "InChI=1S/C8H5ClN2O/c9-7-5-3-1-2-4-6(5)8(12)11-10-7/h1-4H,(H,11,12)".smiles() print x c1ccc2c3c1.c2(O)[n][n]c3Cl load @{"$" + x} print {*}.find("MF") H 5 C 8 N 2 O 1 Cl 1 show inchi InChI=1S/C8H5ClN2O/c9-7-5-3-1-2-4-6(5)8(12)11-10-7/h1-4H,(H,11,12) So, in this case, the crawler would not know the structure of the compound from metadata only.