---
title: "3. The Serotype Database API: Working with Large Datasets"
author: "William Lane MD, PhD, A(ACHI)"
format:
  html:
    code-fold: false
    toc: true
    theme: cosmo
execute:
  warning: false
  message: false
---

# Three Patterns for Bulk Data

The API caps single responses at **5,000 rows** to keep latency
predictable and to spare the database from accidental full-table
scans. There are three good ways to deal with this when your work
actually needs more than 5,000 rows:

1. **Narrow the query** so it stays under the cap (add a `loci`
   filter, an antigen filter, etc.). Fast and simple when your
   analysis is scoped to a single locus or antigen.
2. **Let the API tell you where to go.** Issue the big query you
   want; if the cap rejects it, parse the structured error response
   to get a `downloadUrl`, then fetch from that. One code path
   covers both small and large queries.
3. **Read the manifest directly.** Every HATS version ships a
   `manifest.json` listing all pre-built per-locus and bulk files.
   Fetch it, pick what you need, fetch the file. Most explicit and
   fully version-agnostic — your code keeps working when upstream
   ships a new HATS release.

We'll show all three below. None of this notebook hard-codes a
version string — the auto-discovery pattern in section 3.2 makes
everything else work for whatever's current.

## 3.1 Setup Packages

```{r}
#| label: setup-packages

# Clear everything
rm(list = ls())

# Install required packages if not already installed
options(repos = c(CRAN = "https://cloud.r-project.org"))
if (!requireNamespace("httr", quietly = TRUE)) install.packages("httr")
if (!requireNamespace("jsonlite", quietly = TRUE)) install.packages("jsonlite")
if (!requireNamespace("conflicted", quietly = TRUE)) install.packages("conflicted")
if (!requireNamespace("dplyr", quietly = TRUE)) install.packages("dplyr")
if (!requireNamespace("tidyr", quietly = TRUE)) install.packages("tidyr")
if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
if (!requireNamespace("knitr", quietly = TRUE)) install.packages("knitr")
if (!requireNamespace("kableExtra", quietly = TRUE)) install.packages("kableExtra")
if (!requireNamespace("dotenv", quietly = TRUE)) install.packages("dotenv")

# Load packages
library(httr)
library(jsonlite)
library(conflicted)
library(dplyr)
library(tidyr)
library(ggplot2)
library(knitr)
library(kableExtra)
library(dotenv)

# Load environment variables (only needed for approaches 1 and 2 which
# use the API and therefore need a key; approach 3 is fully anonymous).
load_dot_env()

conflict_prefer("filter", "dplyr")

# API key — needed only for approaches 1 and 2. Same convention as the
# other lessons: read from .env (SEROTYPE_API_KEY=...) or set
# apiKeyOverride below.
apiKey <- Sys.getenv("SEROTYPE_API_KEY", unset = NA)
apiKeyOverride <- ""
if (!is.null(apiKeyOverride) && nzchar(apiKeyOverride)) {
  apiKey <- apiKeyOverride
}

graphql_url <- "https://www.serotype.org/api/graphql"
```

## 3.2 Auto-Discover the Current Version and Manifest

`/api/latest-version` is a public, no-auth REST endpoint that
returns the current HATS version plus the URLs to the matching
pre-built export bundle. From there we can read the manifest, which
lists every available file with its scope (per-locus, bulk,
serotype_definitions, variable_lists), locus, resolution, format,
and row count.

This is the only place the URL `https://www.serotype.org` appears
in the notebook. Everything else flows from `meta` and `manifest`.

```{r}
#| label: discover

meta <- jsonlite::fromJSON("https://www.serotype.org/api/latest-version")
cat("Current HATS version:", meta$version, "\n")
cat("Imported into the database:", meta$lastImportedAt, "\n")
cat("Upstream commit:", meta$upstreamCommitSha, "\n")

manifest <- jsonlite::fromJSON(meta$manifestUrl)
cat("Manifest lists", nrow(manifest$files),
    "files for version", manifest$version, "\n\n")

# Peek at the manifest's shape
head(manifest$files[, c("path", "scope", "locus", "resolution",
                        "format", "rows", "size")])
```

## 3.3 Approach 1 — Narrow the Query

If you only need one locus, ask for one locus. The response stays
well under the cap, and the API pipeline is the right tool — no
download juggling needed.

```{r}
#| label: approach-1-narrow
query_one_locus <- '
query {
  alleleToSerotype(loci: ["A"], resolution: two_field) {
    locus
    allele
    score
    serotype
    antigen
    broadAntigen
    bw4_bw6
  }
}'

resp <- POST(
  graphql_url,
  body = list(query = query_one_locus),
  encode = "json",
  add_headers(`x-api-key` = apiKey)
)
df_A <- fromJSON(content(resp, "text"), flatten = TRUE)$data$alleleToSerotype
cat("Rows from one-locus query (HLA-A, two_field):", nrow(df_A), "\n")
head(df_A, 5)
```

Same idea with serotype, antigen, or specific allele filters — any
filter that takes the response below 5,000 rows works.

## 3.4 Approach 2 — Let the API Tell You Where the File Lives

Sometimes you don't know up front whether your query is small or
big. Or you want the same code to work for both. Make the call; if
the API rejects with `RESPONSE_TOO_LARGE`, the error includes a
`downloadUrl` pointing at the matching pre-built file. Catch it,
fetch from there, return the data. The caller doesn't have to know.

```{r}
#| label: approach-2-cap-helper

# Helper: run a GraphQL query and return its data field. On cap
# rejection, transparently switch to fetching the matching gzipped
# JSON file from object storage. The returned data frame has the
# same column set in either branch (the export drops a few internal
# fields like `version` that the API includes — see the note below).
#
# We parse the response twice: once with simplifyVector=FALSE so the
# error structure is a predictable nested list (jsonlite would
# otherwise collapse `errors` into a data frame and the $extensions
# walk below would break), and once with the default simplification
# so the data path returns a proper data frame.
fetch_or_download <- function(query, apiKey, op_name = "alleleToSerotype") {
  resp <- POST(graphql_url, body = list(query = query), encode = "json",
               add_headers(`x-api-key` = apiKey))
  raw <- content(resp, "text")

  parsed_raw <- fromJSON(raw, simplifyVector = FALSE)
  errs <- parsed_raw$errors
  if (!is.null(errs) && length(errs) > 0) {
    first_ext <- errs[[1]]$extensions
    if (!is.null(first_ext) &&
        identical(first_ext$code, "RESPONSE_TOO_LARGE")) {
      dl <- first_ext$downloadUrl
      message("Cap hit; switching to download: ", dl)
      return(fromJSON(gzcon(url(dl))))
    }
    stop("GraphQL error: ", errs[[1]]$message)
  }

  parsed_df <- fromJSON(raw, flatten = TRUE)
  parsed_df$data[[op_name]]
}

# Same query that fails in section 3.5 of the old version of this
# lesson — two loci at two-field, ~11,000 rows.
big_query <- '
query {
  alleleToSerotype(loci: ["A","B"], resolution: two_field) {
    locus
    allele
    score
    serotype
    antigen
    broadAntigen
    bw4_bw6
  }
}'

df_AB <- fetch_or_download(big_query, apiKey)
cat("Rows after auto-switch to download:", nrow(df_AB), "\n")
```

> **Note on shape**: the API returns rows with a `version` column;
> the export file omits it (the version is in the manifest, not on
> every row). If you rely on `version` being present, add it back
> after fetching (e.g., `df$version <- meta$version`).

## 3.5 Approach 3 — Read the Manifest, Pick the File

When you already know you want bulk data, skip the API entirely.
Filter the manifest's `files` table for the slice you want, build
the URL from `downloadBaseUrl` and the entry's `path`, and fetch.

```{r}
#| label: approach-3-manifest

# Bulk full-field, JSON format
bulk_full <- subset(manifest$files,
                    scope == "bulk" & resolution == "full_field" & format == "json")
stopifnot(nrow(bulk_full) == 1)

file_url <- file.path(meta$downloadBaseUrl, bulk_full$path)
cat("Fetching:", file_url, "\n")
df_all_full <- jsonlite::fromJSON(gzcon(url(file_url)))
cat("Rows:", nrow(df_all_full), "\n")

# Same thing for bulk two-field
bulk_two <- subset(manifest$files,
                   scope == "bulk" & resolution == "two_field" & format == "json")
file_url_two <- file.path(meta$downloadBaseUrl, bulk_two$path)
df_all_two <- jsonlite::fromJSON(gzcon(url(file_url_two)))
cat("Two-field rows:", nrow(df_all_two), "\n")
```

CSV works the same way — pick `format == "csv"` from the manifest
and use `readr::read_csv(file_url)` which handles the `.gz`
extension natively. Same for `format == "tsv"` with `read_tsv()`.

## 3.6 Quick Analysis Example

With both bulk datasets in hand, compare row counts per locus
across resolutions:

```{r}
#| label: quick-analysis
count_two_field <- df_all_two %>%
  group_by(locus) %>%
  tally() %>%
  mutate(resolution = "two_field")

count_full_field <- df_all_full %>%
  group_by(locus) %>%
  tally() %>%
  mutate(resolution = "full_field")

df_counts <- bind_rows(count_two_field, count_full_field)

ggplot(df_counts, aes(x = locus, y = n, fill = resolution)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = paste0("Record Counts by Locus and Resolution (HATS ", meta$version, ")"),
    x = "Locus",
    y = "Count"
  ) +
  theme_minimal()
```

## 3.7 Which Approach Should I Use?

| Situation | Use |
|---|---|
| Working with a specific locus or filter | Approach 1 — narrow query, no extra plumbing. |
| Library or tool that needs to handle both small and large inputs uniformly | Approach 2 — the `fetch_or_download` helper. |
| Pipeline that always wants bulk data, or you want to skip the API entirely | Approach 3 — read the manifest, fetch the file. |

The browser-friendly view of what's available is at
[`/downloads`](https://www.serotype.org/downloads) — same files,
just rendered as a table.
