10 December 2019
\[ P(\mathsf{Human~cases}) = \prod_h P(\mathsf{st}_h) \]
\[ P(\mathsf{st}_h) = \sum_j P(\mathsf{st}_h~\mathsf{from~source}_j) P(\mathsf{source}_j) \]
\[ P(\mathsf{st}_h) = \sum_j \underbrace{P(\mathsf{st}_h~\mathsf{from~source}_j)}_\text{genomic model} P(\mathsf{source}_j) \]
\[ P(\mathsf{st}) = \sum_j \underbrace{P(\mathsf{st}_h~\mathsf{from~source}_j)}_\text{genomic model} \underbrace{P(\mathsf{source}_j)}_\text{attribution to source} \]
Assume that genotypes arise from two or more homogeneous mixing populations where we have
Mutation, where novel alleles are produced.
Recombination, where the allele at a given locus has been observed before, but not in this allelic profile (i.e. the alleles come from at least two different genotypes).
Migration between sources of genotypes and alleles.
We model \(P(\mathsf{st}_h~\mathsf{from~source}_j)\) via:
\[ P(\mathsf{st}_h \mid j,X) = \sum_{c\in X} \frac{M_{S_cj}}{N_{S_c}} \prod_{l=1}^7 \left\{\begin{array}{ll} \mu_j & \text{if $\mathsf{st}_h^{l}$ is novel,}\\ (1-\mu_j)R_j\sum_{k=1}^J M_{kj}f^l_{\mathsf{st}_h^lk} & \text{if $\mathsf{st}_h^{l}\neq c^l$}\\ (1-\mu_j)\left[1 - R_j(1 - \sum_{k=1}^J M_{kj}f^l_{\mathsf{st}_h^lk})\right] & \text{if $\mathsf{st}_h^{l}=c^l$} \end{array} \right. \]
Use the source isolates to estimate the unknown parameters:
using a leave one out pseudo-likelihood MCMC algorithm.
Once we have these estimated, we can use all source sequences as candidates and estimate \(P(\mathsf{st}_h~\mathsf{from~source}_j)\) for the observed human sequences (even those unobserved on the sources).
Before we collect data, assume each type is equally likely for each source.
Then add the observed counts.
And convert to proportions.
Get uncertainty directly.
The prior and data model are:
\[ \begin{aligned} \mathbf{\pi}_j &\sim \mathsf{Dirichlet}(\mathbf{\alpha}_j)\\ \mathbf{X}_{j} &\sim \mathsf{Multinomial}(n_j, \mathbf{\pi}_j) \end{aligned} \]
so that the posterior is \[ \mathbf{\pi}_{j} \sim \mathsf{Dirichlet}(\mathbf{X}_j + \mathbf{\alpha}_j) \]
where \(\pi_j\) is the genotype distribution, \(\mathbf{X}_j\) are the counts, and \(\mathbf{\alpha}_j\) is the prior for source \(j\).
\[ P(\mathsf{st}_h \mid \underbrace{\mathbf{x}_h}_\text{covariates}) = \sum_j \underbrace{P(\mathsf{st}_h~\mathsf{from~source}_j)}_\text{genomic model} \underbrace{P(\mathsf{source}_j \mid \mathbf{x}_h)}_\text{attribution with covariates} \]
Within each source \(j\) we have a linear model on the logit scale for attribution: \[ \begin{aligned} \eta_{hj} &= \alpha_j + \beta_{j1} x_{1h} + \cdots + \beta_{jp} x_{ph}\\ P(\mathsf{source}_j \mid \mathbf{x}_h) &= \frac{\exp(\eta_{hj})}{\sum_j \exp(\eta_{hj})} \end{aligned} \]
The covariates can then be anything we like for each human case, and there’s a separate attribution probability for each unique covariate pattern in the data.
We’ll present four different covariate models:
The Island model assigns genotypes not observed on sources where the Dirichlet model has no information. But few cases are of this type.
The simple Dirichlet model seems to do just as well as the Island model.
Urban cases tend to be more associated with poultry, and rural cases with ruminants, with a linear relationship between attribution and rurality being adequate.
Kids under 5 in rural areas are more likely to have ruminant associated campylobacteriosis compared to over 5’s.
Strong evidence for seasonality of attribution, with a higher proportion of poultry related cases in summer and strong ruminant association in rural areas in winter and spring.
With higher genomic precision, the counts used for the Dirichlet model will eventually drop to 1 of each as each isolate is likely unique.
Worse than that, the human isolates will differ from all source isolates, so the source counts for them will be 0, so we lose all ability to discriminate.
In theory the Island model should still work.
But it… doesn’t :(
The island likelihood for each type is complicated, but if we have \(g\) genes is approximately \[ p(\mathsf{st} \mid m) \approx m^d (1-m)^{g-d} \] where \(m\) is a combination of mutation and recombination probabilities, and \(d\) is the average number of loci that differ between isolate pairs.
This is maximised when \(m = d/g\) so that \[ p(\mathsf{st} \mid m) \approx f(m)^g \] where \[ f(m) = m^m (1-m)^{1-m} \]
Suppose we have two sources with \(m_1=0.02\) and \(m_2=0.03\).
The island model just ends up assigning everything to the lower diversity source.
Instead of per-source estimates of recombination and mutation probabilities, make them common across all sources.
This then stabilises the genotype probability distributions to be the same order of magnitude across all sources.
We thus get sensible attribution with increasing gene counts
As we increase the number of genes, the Dirichlet model loses all ability to differentiate between sources.
The Island model still works as it should, but we need to estimate shared mutation and recombination probabilities.
We get a stable attribution as the number of genes increases.
But how should we choose genes?
Twitter: @jmarshallnz
Github: jmarshallnz
Assume that genotypes arise from two or more homogeneous mixing populations where we have
Mutation, where novel alleles are produced.
Recombination, where the allele at a given locus has been observed before, but not in this allelic profile (i.e. the alleles come from at least two different genotypes).
Migration between sources of genotypes and alleles.
ST | aspA | glnA | gltA | glyA | pgm | txt | uncA |
---|---|---|---|---|---|---|---|
474 | 2 | 4 | 1 | 2 | 2 | 1 | 5 |
? | 2 | 4 | 1 | 2 | 29 | 1 | 5 |
We have a novel allele at the pgm locus.
We assume this genotype has arisen through mutation.
ST | aspA | glnA | gltA | glyA | pgm | txt | uncA |
---|---|---|---|---|---|---|---|
474 | 2 | 4 | 1 | 2 | 2 | 1 | 5 |
? | 2 | 4 | 1 | 2 | 1 | 1 | 5 |
45 | 4 | 7 | 10 | 4 | 1 | 7 | 1 |
3718 | 2 | 4 | 1 | 4 | 1 | 1 | 5 |
We have seen this pgm allele before, but haven’t seen this genotype.
We assume it arose through recombination, either from 45 or 3718.
ST | aspA | glnA | gltA | glyA | pgm | txt | uncA |
---|---|---|---|---|---|---|---|
474 | 2 | 4 | 1 | 2 | 2 | 1 | 5 |
? | 2 | 4 | 1 | 2 | 2 | 1 | 5 |
This is just 474. We’ve seen it before, but possibly not on this source.
We assume it arose through migration.