Statistical models for source attribution of zoonotic diseases

10 December 2019

New Zealand campylobacteriosis cases

Manawatu sentinel surveillance site

Where are people getting it from?

Drawing Drawing

MLST distribution of human cases

MLSTs are source specific

Assume human types are a mix of source types

Adding statistics

\[ P(\mathsf{Human~cases}) = \prod_h P(\mathsf{st}_h) \]

Adding statistics

\[ P(\mathsf{st}_h) = \sum_j P(\mathsf{st}_h~\mathsf{from~source}_j) P(\mathsf{source}_j) \]

Adding statistics

\[ P(\mathsf{st}_h) = \sum_j \underbrace{P(\mathsf{st}_h~\mathsf{from~source}_j)}_\text{genomic model} P(\mathsf{source}_j) \]

Adding statistics

\[ P(\mathsf{st}) = \sum_j \underbrace{P(\mathsf{st}_h~\mathsf{from~source}_j)}_\text{genomic model} \underbrace{P(\mathsf{source}_j)}_\text{attribution to source} \]

Genomic models

Asymmetric Island model

D. Wilson (2009)

Assume that genotypes arise from two or more homogeneous mixing populations where we have

Mutation, where novel alleles are produced.
Recombination, where the allele at a given locus has been observed before, but not in this allelic profile (i.e. the alleles come from at least two different genotypes).
Migration between sources of genotypes and alleles.

Asymmetric Island model

We model \(P(\mathsf{st}_h~\mathsf{from~source}_j)\) via:

\[ P(\mathsf{st}_h \mid j,X) = \sum_{c\in X} \frac{M_{S_cj}}{N_{S_c}} \prod_{l=1}^7 \left\{\begin{array}{ll} \mu_j & \text{if $\mathsf{st}_h^{l}$ is novel,}\\ (1-\mu_j)R_j\sum_{k=1}^J M_{kj}f^l_{\mathsf{st}_h^lk} & \text{if $\mathsf{st}_h^{l}\neq c^l$}\\ (1-\mu_j)\left[1 - R_j(1 - \sum_{k=1}^J M_{kj}f^l_{\mathsf{st}_h^lk})\right] & \text{if $\mathsf{st}_h^{l}=c^l$} \end{array} \right. \]

\(c \in X\) are candidate sequences from which \(\mathsf{st}_h\) evolved (all sequences other than \(\mathsf{st}_h\)).
\(S_c\) is the source where candidate sequence \(c\) was observed.
\(N_{S_c}\) is the number of types observed on source \(S_c\).
\(\mu_j\) be the probability of a novel mutant allele from source \(j\).
\(R_j\) be the probability that a type has undergone recombination in source \(j\).
\(M_{kj}\) be the probability of an allele migrating from source \(k\) to \(j\).
\(f^l_{ak}\) be the frequency with which allele \(a\) has been observed at locus \(l\) in those genotypes sampled from source \(k\).

Asymmetric Island model

Use the source isolates to estimate the unknown parameters:

The mutation probabilities \(\mu_j\)
Recombination probabilities \(R_j\)
Migration probabilities \(M_{jk}\).

using a leave one out pseudo-likelihood MCMC algorithm.

Once we have these estimated, we can use all source sequences as candidates and estimate \(P(\mathsf{st}_h~\mathsf{from~source}_j)\) for the observed human sequences (even those unobserved on the sources).

Dirichlet model

Before we collect data, assume each type is equally likely for each source.

Dirichlet model

Then add the observed counts.

Dirichlet model

And convert to proportions.

Dirichlet model

Get uncertainty directly.

Dirichlet model

S.J. Liao 2019

The prior and data model are:

\[ \begin{aligned} \mathbf{\pi}_j &\sim \mathsf{Dirichlet}(\mathbf{\alpha}_j)\\ \mathbf{X}_{j} &\sim \mathsf{Multinomial}(n_j, \mathbf{\pi}_j) \end{aligned} \]

so that the posterior is \[ \mathbf{\pi}_{j} \sim \mathsf{Dirichlet}(\mathbf{X}_j + \mathbf{\alpha}_j) \]

where \(\pi_j\) is the genotype distribution, \(\mathbf{X}_j\) are the counts, and \(\mathbf{\alpha}_j\) is the prior for source \(j\).

Genotype distributions

Genotype distributions

Attribution results

Adding covariates

MLSTs differ by rurality

Attribution with covariates

\[ P(\mathsf{st}_h \mid \underbrace{\mathbf{x}_h}_\text{covariates}) = \sum_j \underbrace{P(\mathsf{st}_h~\mathsf{from~source}_j)}_\text{genomic model} \underbrace{P(\mathsf{source}_j \mid \mathbf{x}_h)}_\text{attribution with covariates} \]

Attribution with covariates

Within each source \(j\) we have a linear model on the logit scale for attribution: \[ \begin{aligned} \eta_{hj} &= \alpha_j + \beta_{j1} x_{1h} + \cdots + \beta_{jp} x_{ph}\\ P(\mathsf{source}_j \mid \mathbf{x}_h) &= \frac{\exp(\eta_{hj})}{\sum_j \exp(\eta_{hj})} \end{aligned} \]

The covariates can then be anything we like for each human case, and there’s a separate attribution probability for each unique covariate pattern in the data.

Covariates: Rurality, Age, Season

We’ll present four different covariate models:

Seasons by urban/rural split.
Rurality as 7 separate categories.
Rurality as a linear trend (on logit scale).
Separate linear trends for under 5 versus older than 5 years old (2008 onwards).

Results

Seasonality

Urban/Rural scales

Young kids

Summary

The Island model assigns genotypes not observed on sources where the Dirichlet model has no information. But few cases are of this type.
The simple Dirichlet model seems to do just as well as the Island model.
Urban cases tend to be more associated with poultry, and rural cases with ruminants, with a linear relationship between attribution and rurality being adequate.
Kids under 5 in rural areas are more likely to have ruminant associated campylobacteriosis compared to over 5’s.
Strong evidence for seasonality of attribution, with a higher proportion of poultry related cases in summer and strong ruminant association in rural areas in winter and spring.

What if we have more genes?

What if we have more genes?

With higher genomic precision, the counts used for the Dirichlet model will eventually drop to 1 of each as each isolate is likely unique.
Worse than that, the human isolates will differ from all source isolates, so the source counts for them will be 0, so we lose all ability to discriminate.
In theory the Island model should still work.
But it… doesn’t :(

Island model attribution with more genes

Why does this happen?

The island likelihood for each type is complicated, but if we have \(g\) genes is approximately \[ p(\mathsf{st} \mid m) \approx m^d (1-m)^{g-d} \] where \(m\) is a combination of mutation and recombination probabilities, and \(d\) is the average number of loci that differ between isolate pairs.

This is maximised when \(m = d/g\) so that \[ p(\mathsf{st} \mid m) \approx f(m)^g \] where \[ f(m) = m^m (1-m)^{1-m} \]

Why does this happen?

Suppose we have two sources with \(m_1=0.02\) and \(m_2=0.03\).

The island model just ends up assigning everything to the lower diversity source.

Can we avoid it?

Instead of per-source estimates of recombination and mutation probabilities, make them common across all sources.
This then stabilises the genotype probability distributions to be the same order of magnitude across all sources.
We thus get sensible attribution with increasing gene counts

Stable attribution as gene count increases

Summary

As we increase the number of genes, the Dirichlet model loses all ability to differentiate between sources.
The Island model still works as it should, but we need to estimate shared mutation and recombination probabilities.
We get a stable attribution as the number of genes increases.
But how should we choose genes?

Acknowledgements

Aymmetric Island model

D. Wilson (2009)

Assume that genotypes arise from two or more homogeneous mixing populations where we have

Mutation, where novel alleles are produced.
Recombination, where the allele at a given locus has been observed before, but not in this allelic profile (i.e. the alleles come from at least two different genotypes).
Migration between sources of genotypes and alleles.

Mutation

ST	aspA	glnA	gltA	glyA	pgm	txt	uncA
474	2	4	1	2	2	1	5
?	2	4	1	2	29	1	5

We have a novel allele at the pgm locus.
We assume this genotype has arisen through mutation.

Recombination

ST	aspA	glnA	gltA	glyA	pgm	txt	uncA
474	2	4	1	2	2	1	5
?	2	4	1	2	1	1	5
45	4	7	10	4	1	7	1
3718	2	4	1	4	1	1	5

We have seen this pgm allele before, but haven’t seen this genotype.
We assume it arose through recombination, either from 45 or 3718.

Migration

ST	aspA	glnA	gltA	glyA	pgm	txt	uncA
474	2	4	1	2	2	1	5
?	2	4	1	2	2	1	5

This is just 474. We’ve seen it before, but possibly not on this source.
We assume it arose through migration.