---
title: "4. The Sertotype Database API: Retrieving Reference Lists"
author: "William Lane MD, PhD, A(ACHI)"
format: 
  html:
    code-fold: false
    toc: true
    theme: cosmo
execute:
  warning: false
  message: false
---

## Getting Data Element Lists

While lesson 3 showed you how to get partial or full data dumps, sometime it helpful to just get a list of a specific data point by itself. The Serotype Database API provides several endpoints to retrieve comprehensive lists of available data. Let's explore each one individually:

These lists can be particularly useful for:

- Validating input data against known valid values
- Building dropdown menus or autocomplete features in applications
- Understanding the complete scope of available data
- Cross-referencing between different naming conventions (e.g., serotypes vs alleles)

You can also query these endpoints individually if you only need specific lists or all at once.

## 4.1 Setup Packages

```{r}
#| label: setup-packages

# Clear everything
rm(list = ls())

# Install required packages if not already installed
options(repos = c(CRAN = "https://cloud.r-project.org"))
if (!requireNamespace("httr", quietly = TRUE)) install.packages("httr")
if (!requireNamespace("jsonlite", quietly = TRUE)) install.packages("jsonlite")
if (!requireNamespace("dotenv", quietly = TRUE)) install.packages("dotenv")

# Load packages
library(httr)
library(jsonlite)
library(dotenv)

# Load environment variables
load_dot_env()
```

## 4.2 Set API Key

To query the Serotype Database API, you will need an API Key, which are available for free by signing up for an account at <https://www.serotype.org/user>.

If you are familiar with how to set R environment variables, you can save the key to your .env file as `SEROTYPE_API_KEY=YOUR_API_KEY`, replacing `YOUR_API_KEY` with your actual API key. This is considered best practice because it keeps your private API key separate from your code, enhancing security and making your code easier to share or collaborate on without exposing sensitive information.

Alternatively, you can set the value directly in the code block below by assigning your API key to `apiKeyOverride`. However, be cautious: if you choose to embed your API key in the code, ensure you remove it before sharing the file, as each user must use their own unique API key for security and proper functionality.

```{r}
# Check for the API key in environment variables
apiKey <- Sys.getenv("SEROTYPE_API_KEY", unset = NA)

# Allow manual override of the API key by user here
apiKeyOverride <- ""  # Set this to your API manually if not using environment variables

# Use the override if provided, otherwise use the environment variable value
if (!is.null(apiKeyOverride) && nzchar(apiKeyOverride)) {
  apiKey <- apiKeyOverride
}
```

## 4.3 Getting All HLA Loci

First, let's retrieve all available HLA loci:

```{r}
#| label: get-loci

# Define the API endpoint
url <- "https://serotype.org/api/graphql"

# Define the GraphQL query for loci
query_string <- '
query {
  getDataElementLists(
    fieldName: loci
  )
}
'

# Make the POST request
response <- POST(
  url,
  body = list(query = query_string),
  encode = "json",
  add_headers(`x-api-key` = apiKey)
)

# Parse the JSON response
loci_data <- fromJSON(content(response, "text"), flatten = TRUE)

# Display the loci
loci <- loci_data$data$getDataElementLists
print(loci)
```

## 4.4 Getting All Serotypes

Next, let's get all available serotypes:

First, get all serotypes organized by loci. This the default return format, but it can also be optionally specified by adding `returnFormat: structured` to the query.
```{r}
#| label: get-serotypes (returnFormat structured)

# Define the GraphQL query for serotypes
query_string <- '
query {
  getDataElementLists(
    fieldName: serotypes
    returnFormat: structured
  )
}
'

# Make the POST request
response <- POST(
  url,
  body = list(query = query_string),
  encode = "json",
  add_headers(`x-api-key` = apiKey)
)

# Parse the JSON response
serotypes_data <- fromJSON(content(response, "text"), flatten = TRUE)

# Display the first few serotypes
serotypes <- serotypes_data$data$getDataElementLists
head(serotypes)
```

Second, get all serotypes as a flat list. This can specified by adding `returnFormat: flat` to the query.
```{r}
#| label: get-serotypes (returnFormat flat)

# Define the GraphQL query for serotypes
query_string <- '
query {
  getDataElementLists(
    fieldName: serotypes
    returnFormat: flat
  )
}
'

# Make the POST request
response <- POST(
  url,
  body = list(query = query_string),
  encode = "json",
  add_headers(`x-api-key` = apiKey)
)

# Parse the JSON response
serotypes_data <- fromJSON(content(response, "text"), flatten = TRUE)

# Display the first few serotypes
serotypes <- serotypes_data$data$getDataElementLists
head(serotypes)
```

## 4.5 Getting All Antigens

Here's how to retrieve all antigens:

```{r}
#| label: get-antigens

# Define the GraphQL query for antigens
query_string <- '
query {
  getDataElementLists(
    fieldName: antigens
    returnFormat: flat
  )
}
'

# Make the POST request
response <- POST(
  url,
  body = list(query = query_string),
  encode = "json",
  add_headers(`x-api-key` = apiKey)
)

# Parse the JSON response
antigens_data <- fromJSON(content(response, "text"), flatten = TRUE)

# Display the first few antigens
antigens <- antigens_data$data$getDataElementLists
head(antigens)
```

## 4.6 Getting Two-Field Alleles for One Locus

To get alleles at two-field resolution for one locus. The `loci` filter
keeps the response under the API's 5,000-row cap; drop it and you'll
get back a structured `RESPONSE_TOO_LARGE` error pointing at the
matching pre-built download instead.

```{r}
#| label: get-two-field

# Define the GraphQL query for two-field alleles for HLA-A
query_string <- '
query {
  getDataElementLists(
    fieldName: allelesTwoField
    loci: ["A"]
    returnFormat: flat
  )
}
'

# Make the POST request
response <- POST(
  url,
  body = list(query = query_string),
  encode = "json",
  add_headers(`x-api-key` = apiKey)
)

# Parse the JSON response
two_field_data <- fromJSON(content(response, "text"), flatten = TRUE)

# Display the first few two-field alleles
two_field_alleles <- two_field_data$data$getDataElementLists
head(two_field_alleles)
```

> To pull every locus's two-field alleles in one file, fetch the
> `variable_lists.json.gz` entry from the manifest at
> `meta$manifestUrl` — see lesson 3 for the auto-discovery pattern.

## 4.7 Getting Full-Field Alleles for One Locus

Same idea at full-field resolution for one locus:

```{r}
#| label: get-full-field

# Define the GraphQL query for full-field alleles for HLA-A
query_string <- '
query {
  getDataElementLists(
    fieldName: allelesFullField
    loci: ["A"]
    returnFormat: flat
  )
}
'

# Make the POST request
response <- POST(
  url,
  body = list(query = query_string),
  encode = "json",
  add_headers(`x-api-key` = apiKey)
)

# Parse the JSON response
full_field_data <- fromJSON(content(response, "text"), flatten = TRUE)

# Display the first few full-field alleles
full_field_alleles <- full_field_data$data$getDataElementLists
head(full_field_alleles)
```

> Same caveat as 4.6: drop `loci` only if you really want every locus,
> in which case use the manifest-driven pattern from lesson 3.

These list endpoints are particularly useful for:

- Validating input data against known valid values
- Building dropdown menus or autocomplete features in applications
- Understanding the complete scope of available data
- Cross-referencing between different naming conventions (e.g., serotypes vs alleles)

Each endpoint returns a comprehensive list of values that can be used in other API queries or for data validation purposes.