# Finding a Findable Dataset

```{dropdown} About this interactive ![icons](../static/img/rocket.png) recipe
- Author(s): [Stuart Chalk](https://orcid.org/0000-0002-0703-7776)
- Topic(s): How and where to find a 'findable' chemical dataset
- Format(s): Interactive Jupyter Notebook (Python)
- Scenario(s): You are looking for research data to complement your compare with your own data
- Skill(s): You should be familiar with
    - [Application Programming Interfaces (APIs)](https://www.ibm.com/topics/api)
    - [Working with a data model](https://doi.org/10.1515/pac-2021-3013)
    - [Introductory JSON](https://www.youtube.com/watch?v=iiADhChRriM)
- Learning outcomes: After completing this example you should understand:
    - How to make a request to a website using the Python 'requests' functionality
    - Retrieve data in JSON format and how to parse it (knowing the data model)
    - How to store confidential data in a remote file
    - How programmtically you can authenticate to an API (one of many ways)
- Citation: 'Finding a Findable Dataset', Stuart Chalk, The IUPAC FAIR Chemistry Cookbook, Contributed: 2024-02-14 [https://w3id.org/ifcc/IFCC013](https://w3id.org/ifcc/IFCC013).
- Reuse: This notebook is made available under a [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
```

## Scenario
Our group has a set of thermophysical data on over 8000 chemical substances.  We want to integrate into this dataset another physical property dataset so that we can do an analysis of the correlations of the thermophysical data with the chosen physical property of the substances (that are common to both sets).

Criteria for picking the physical property dataset: high quality, trusted, large, available with an open license, so I can publish the results and make the derived dataset open.
- **High quality means**: unambiguous identification of each chemical substance, enough contextual information (metadata) to make the values scientifically useful, i.e., at least the composition of the solvent, the temperature and for volatile substances the pressure.
- **Trusted means**: the provenance chain is reported with the data, and it shows that the data comes from a reputable source(s) and any aggregation and/or processing is documented in enough detail that the community can understand how the dataset has been created/provided.

## Step 1 - Searching PubChem for datasets
Pubchem houses a lot of data about chemical substances, compounds and bioassays.  Over time external organizations have worked with PubChem to include data, in one of a couple of ways:
- data that has been integrated into PubChem pages (e.g., [CCDC](https://pubchem.ncbi.nlm.nih.gov/source/941) -> [example](https://pubchem.ncbi.nlm.nih.gov/compound/241))
- data that is not available in a PubChem page but is available via the data sources section of the site as 'annotations' (e.g. [RCSB PDB](https://pubchem.ncbi.nlm.nih.gov/source/15751) -> [Example](https://pubchem.ncbi.nlm.nih.gov/source/15751#data=Annotations))

The data available is may not be structured and or clearly described, however if the source has a website with an API then you are likely to get better quality metadata from the linked site.

### 1.1 - Load the Python functions

In [9]:
# as these are direct imports (they do not reference a Python package) they are built into Python
import requests
import json

### 1.2 - Search for sources that have 'curation efforts'

In [10]:
# This URL is the metadata about the data sources in PubChem
url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/sourcetable/all/JSON/?response_type=display'
response = requests.get(url)
srcs = response.json()
results = []
search = 'Curation Efforts'  # i.e., a repository, or other type of data source (this is in index 8 of the data list for each source)
rows = srcs['Table']['Row']
for row in rows:
    if row['Cell'][8].find(search) != -1:
        hit = {}
        hit.update({'name': row['Cell'][0]})
        hit.update({'url': row['Cell'][9]})
        results.append(hit)
# when printed this is a scrollable list of many entries
print(json.dumps(results, indent=4))

[
    {
        "name": "Agency for Toxic Substances and Disease Registry (ATSDR)",
        "url": "https://www.atsdr.cdc.gov/"
    },
    {
        "name": "Alliance of Genome Resources",
        "url": "https://www.alliancegenome.org/"
    },
    {
        "name": "Athena Minerals",
        "url": "https://athena.unige.ch/athena/mineral/mineral.html"
    },
    {
        "name": "Barrie Walker, BARK Information Services",
        "url": "https://uk.linkedin.com/in/barrie-walker-85b4a510"
    },
    {
        "name": "BindingDB",
        "url": "https://www.bindingdb.org/rwd/bind/"
    },
    {
        "name": "BioCyc",
        "url": "https://biocyc.org/"
    },
    {
        "name": "BioGRID",
        "url": "https://thebiogrid.org/"
    },
    {
        "name": "CAMEO Chemicals",
        "url": "https://cameochemicals.noaa.gov/"
    },
    {
        "name": "Catalogue of Life (COL)",
        "url": "https://www.catalogueoflife.org/"
    },
    {
        "name": "CCSbase",
        "

## Step 2 - Searching FAIR Sharing for datasets
FAIRSharing is a database of FAIR resources and per se a database of datasets, however you might find a repository here that
has the kind of data you are looking for. The code below accesses the FAIR sharing API so search for 'chemistry' (or other term) related resources.

*Note: To use the code below please go to [https://fairsharing.org/accounts/signup](https://fairsharing.org/accounts/signup?ref=iupacfaircookbook), create an account and then enter your username and password in the quotes for 'fs_user' and 'fs_pass' below.*

### 2.1 - Authentication to the FAIRSharing API

In [11]:
# see https://fairsharing.org/API_doc for instructions on how to search the API
# user login
fs_user = "ChemCookbook"
fs_pass = "ydt_wdh_MRD*qut5xvq"
url = 'https://api.fairsharing.org/users/sign_in'
loghdrs = {'Accept': 'application/json','Content-Type': 'application/json'}
login = {'user': {'login': fs_user, 'password': fs_pass}}
response = requests.request("POST", url, headers=loghdrs, data=json.dumps(login))
data = response.json()
print(data)

{'success': True, 'jwt': 'eyJhbGciOiJIUzI1NiJ9.eyJqdGkiOiI0YjVkNzQyMi1hOTg3LTRlZWYtYjQyYi1hN2U3Yjc2MTM0ZTEiLCJzdWIiOiI4NTA5Iiwic2NwIjoidXNlciIsImF1ZCI6bnVsbCwiaWF0IjoxNjk5NDUzNDMwLCJleHAiOjE2OTk1Mzk4MzB9.d0r8PUqy9nqaLS2-qPG8-20EHN1LnPunKmrPIvh9p7U', 'username': 'ChemCookbook', 'id': 8509, 'role': 'user', 'profile_type': 'none', 'watched_records': [], 'is_curator': False, 'is_super_curator': False, 'third_party': False, 'expiry': 1699539830, 'message': 'Authentication successful'}


### 2.2 - Make the API request

In [12]:
# in order to authenticate when making an API request the 'jwt' code above must
# be included in the JSON headers (see https://en.wikipedia.org/wiki/List_of_HTTP_header_fields)
jwt = data['jwt']
srchdrs = {'Accept': 'application/json', 'Content-Type': 'application/json', 'Authorization': "Bearer {0}".format(jwt)}
searchterm = 'chemistry'
searchurl ='https://api.fairsharing.org/search/fairsharing_records?q=' + searchterm
search = requests.request("POST", searchurl, headers=srchdrs)
hits = json.loads(search.content)
# this prints out the raw JSON for the first entry (the 'data' entry is a JSON list)
# that is returned from the API request (formatted nicely, which means its on many lines)
print(json.dumps(hits['data'][0], indent=4))

{
    "id": "1443",
    "type": "fairsharing_records",
    "attributes": {
        "created_at": "2021-06-29T14:47:39.000Z",
        "updated_at": "2023-03-15T08:08:11.434Z",
        "metadata": {
            "doi": "10.25504/FAIRsharing.cb1adb",
            "name": "Portable reduced-precision binary format for trajectories produced by GROMACS package.",
            "status": "ready",
            "contacts": [
                {
                    "contact_name": "Adam Hospital",
                    "contact_email": "adam.hospital@irbbarcelona.org",
                    "contact_orcid": "0000-0002-8291-8071"
                }
            ],
            "homepage": "https://manual.gromacs.org/documentation/2021/reference-manual/file-formats.html#xtc",
            "citations": [],
            "identifier": 1443,
            "description": "The XTC format is a portable binary format for trajectories produced by GROMACS package. It uses the External Data Representation (xdr) routines for wr

### 2.3 - Output the data in a presentable format

In [13]:
# here we loop over the data that has been returned and print it out, one per line
for hit in hits['data']:
    print(hit['attributes']['name'] + ": " + hit['attributes']['url'])

FAIRsharing record for: Portable reduced-precision binary format for trajectories produced by GROMACS package.: https://fairsharing.org/10.25504/FAIRsharing.cb1adb
FAIRsharing record for: Chemistry: https://fairsharing.org/fairsharing_records/3524
FAIRsharing record for: Chemistry vocabulary: https://fairsharing.org/10.25504/FAIRsharing.TrcBD2
FAIRsharing record for: EMODnet Chemistry: https://fairsharing.org/10.25504/FAIRsharing.KOiDmy
FAIRsharing record for: ioChem-BD: https://fairsharing.org/10.25504/FAIRsharing.lwW6a1
FAIRsharing record for: Royal Society of Chemistry - Data policy: https://fairsharing.org/10.25504/FAIRsharing.egbgwm
FAIRsharing record for: MINAS - A Database of Metal Ions in Nucleic AcidS: https://fairsharing.org/10.25504/FAIRsharing.wqtfkv
FAIRsharing record for: Beilstein Journal of Organic Chemistry: https://fairsharing.org/10.25504/FAIRsharing.7GA79k
FAIRsharing record for: CAS Registry Number: https://fairsharing.org/10.25504/FAIRsharing.r7Kwy7
FAIRsharing re