---
title: "5. The Sertotype Database API: Exploring Antigen Relationships"
author: "William Lane MD, PhD, A(ACHI)"
format: 
  html:
    code-fold: false
    toc: true
    theme: cosmo
execute:
  warning: false
  message: false
---

# Exploring HLA Antigen Relationships

In this document, we'll explore the relationships between broad antigens, antigens, and serotypes using the Serotype Database API.

## 5.1 Setup Packages

```{r}
#| label: setup-packages

# Clear everything
rm(list = ls())

# Install required packages if not already installed
options(repos = c(CRAN = "https://cloud.r-project.org"))
if (!requireNamespace("httr", quietly = TRUE)) install.packages("httr")
if (!requireNamespace("jsonlite", quietly = TRUE)) install.packages("jsonlite")
if (!requireNamespace("conflicted", quietly = TRUE)) install.packages("conflicted")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("tidyr", quietly = TRUE)) install.packages("tidyr")
if (!requireNamespace("knitr", quietly = TRUE)) install.packages("ggplot2")
if (!requireNamespace("knitr", quietly = TRUE)) install.packages("knitr")
if (!requireNamespace("ggalluvial", quietly = TRUE)) install.packages("ggalluvial")
if (!requireNamespace("dotenv", quietly = TRUE)) install.packages("dotenv")

# Load packages
library(httr)
library(jsonlite)
library(conflicted)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(ggalluvial)
library(dotenv)

# Load environment variables
load_dot_env()

# Resolve conflicts
conflict_prefer("filter", "dplyr")
```

## 5.2 Set API Key

To query the Serotype Database API, you will need an API Key, which are available for free by signing up for an account at <https://www.serotype.org/user>.

If you are familiar with how to set R environment variables, you can save the key to your .env file as `SEROTYPE_API_KEY=YOUR_API_KEY`, replacing `YOUR_API_KEY` with your actual API key. This is considered best practice because it keeps your private API key separate from your code, enhancing security and making your code easier to share or collaborate on without exposing sensitive information.

Alternatively, you can set the value directly in the code block below by assigning your API key to `apiKeyOverride`. However, be cautious: if you choose to embed your API key in the code, ensure you remove it before sharing the file, as each user must use their own unique API key for security and proper functionality.

```{r}
# Check for the API key in environment variables
apiKey <- Sys.getenv("SEROTYPE_API_KEY", unset = NA)

# Allow manual override of the API key by user here
apiKeyOverride <- ""  # Set this to your API manually if not using environment variables

# Use the override if provided, otherwise use the environment variable value
if (!is.null(apiKeyOverride) && nzchar(apiKeyOverride)) {
  apiKey <- apiKeyOverride
}
```

## 5.3 Query HLA-A Antigen Data

We'll query the API to get information about HLA-A antigens and their relationships:

```{r}
#| label: query-antigens
url <- "http://serotype.org/api/graphql"

query_antigens <- '
query {
  antigenToSerotype(
    loci: ["A"]
  ) {
    locus
    serotype
    antigen
    broadAntigen
  }
}
'

resp_antigens <- POST(
  url,
  body = list(query = query_antigens),
  encode = "json",
  add_headers(`x-api-key` = apiKey)
)
df_antigens <- fromJSON(content(resp_antigens, "text"), flatten = TRUE)$data$antigenToSerotype

cat("Number of HLA-A antigen records:", nrow(df_antigens), "\n")
kable(df_antigens, "html", caption = "HLA-A Antigen Relationships") %>% 
  kable_styling() %>% 
  scroll_box(height = "300px")
```

## 5.4 Analyze Antigen Distribution

Let's look at the distribution of antigens across broad antigen groups:

```{r}
#| label: antigen-distribution
antigen_counts <- df_antigens %>%
  group_by(broadAntigen) %>%
  summarize(
    num_antigens = n_distinct(antigen),
    antigens = paste(sort(unique(antigen)), collapse = ", ")
  ) %>%
  arrange(desc(num_antigens))

kable(antigen_counts, "html", 
      caption = "Distribution of Antigens by Broad Antigen Group") %>%
  kable_styling() %>%
  scroll_box(height = "300px")
```

## 5.5 Visualize Antigen Relationships

Create an alluvial diagram to show the flow between broad antigens, antigens, and serotypes:

```{r}
#| label: visualize-relationships
#| fig.height: 12
#| fig.width: 16

# Prepare data for alluvial diagram
alluvial_data <- df_antigens %>%
  select(broadAntigen, antigen, serotype) %>%
  distinct() %>%
  # Add counts for flow width
  group_by(broadAntigen, antigen, serotype) %>%
  summarise(n = n(), .groups = 'drop')

# Create alluvial plot
ggplot(alluvial_data,
       aes(y = n, axis1 = broadAntigen, axis2 = antigen, axis3 = serotype)) +
  geom_alluvium(aes(fill = broadAntigen), width = 1/3) +
  geom_stratum(width = 1/3, fill = "white", color = "black") +
  geom_label(stat = "stratum", aes(label = after_stat(stratum))) +
  scale_x_continuous(breaks = 1:3, labels = c("Broad Antigen", "Antigen", "Serotype")) +
  labs(title = "HLA-A Antigen Relationships",
       subtitle = "Flow diagram showing relationships between broad antigens, antigens, and serotypes",
       y = "Count",
       fill = "Broad Antigen") +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 12),
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    legend.position = "right",
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank()
  )
```

```{r}
#| label: antigen-counts
#| fig.height: 8
#| fig.width: 12

# Calculate counts at each level
level_counts <- bind_rows(
  df_antigens %>% count(broadAntigen, name = "count") %>% mutate(level = "Broad Antigen"),
  df_antigens %>% count(antigen, name = "count") %>% mutate(level = "Antigen"),
  df_antigens %>% count(serotype, name = "count") %>% mutate(level = "Serotype")
)

# Plot the counts
ggplot(level_counts, aes(x = level, y = count)) +
  geom_boxplot() +
  geom_jitter(aes(color = level), width = 0.2, alpha = 0.6) +
  labs(title = "Distribution of Counts at Each Level",
       y = "Count",
       x = "Level") +
  theme_minimal()
```

## 5.6 Analysis Summary

The alluvial diagram above visualizes the hierarchical relationships in HLA-A antigens by showing:

1. The flow from broad antigens (left) to specific antigens (middle) to serotypes (right)
2. The relative size of each group through the width of the flows
3. The splitting and merging patterns that represent the complexity of HLA relationships

Key observations:
- Most broad antigens split into multiple specific antigens
- A few antigens maintain their designation as serotypes, but most subdivide into more specific serotypes
- The distribution shows varying levels of complexity across different broad antigen groups

This visualization helps in understanding:
- The hierarchical nature of HLA nomenclature
- The relationships between different levels of antigen classification
- The relative frequency of different antigen groups
- Patterns of serological cross-reactivity