Statistical modeling of repertoire overlap in entire sampling spaces


We analyze the distribution of T-cell clonotypes in a compartment like blood based on samples. In particular, we study how the distribution of clonotype frequencies changes between different samples. We consider this as a sampling problem and formulate the problem as a generalization of the classical statistical problem of comparing samples from an urn. Due to the low sampling size compared to the number of different clonotypes in the entire sampling space, the classical methodology that works directly with clonotype frequencies in samples is not suited. We approach this challenge by representing other properties of the sample. Our re-representation allows for easy sampling model fitting and testing under natural model conditions. Although we here focus on the application on clonotypes, the new methodology generalizes seamlessly to other applications.