ISSN: 2378315X BBIJ
Biometrics & Biostatistics International Journal
Research Article
Volume 5 Issue 3  2017
A New Statistical Method for Analyzing Elispot Data
JungYi Lin^{1} and Shuguang Huang^{2}*
^{1}Department of Biostatistics, University of Pittsburgh, USA
^{2}Chief Scientific Officer, Stat4ward LLC, USA
Received: January 25, 2017  Published: February 17, 2017
*Corresponding author:
Shuguang Huang, Chief Scientific Officer, Stat4ward LLC,711 Parkview Dr, Gibsonia PA 15044, USA, Tel: +1 724 602 6644; Email:
Citation:
Lin JY, Huang S (2017) A New Statistical Method for Analyzing Elispot Data. Biom Biostat Int J 5(3): 00131.
DOI:
10.15406/bbij.2017.05.00131
Abstract
The enzymelinked immunospot assay (ELISpot) is one of the most frequently and widely used methods to detect multiple secretory products from a single cell. There are two main approaches to defining positivity criterion for the quantitative ELISpot assay data: ad hoc (Empirical) rules and hypothesis test methods such as ttest and DFR. Hypothesis tests are indeed important to detect whether the test samples are significantly different from the controls. However, the consistency of the results is also important for clinical tests. This research proposed a new method (called the LOD method) focusing on the reproducibility of the results. Simulation and real data analysis showed that Ttest and DFR(eq) have higher test power but LOD method has the highest level of reproducibility.
Keywords: ELISpot; Hypothesis test; Statistics; Positivity
Introducton
The enzymelinked immunospot assay (ELISpot), which was first described in 1983, is one of the most frequently and widely used functional assays for singlecell analysis [15]. The principles of ELISpot are based on the ELISA technique, with the difference that in ELISpotanalytes secreted by cells plated into 96well plates are immediately captured by an antibody used to coat the membrane bottom of such plates. After the removal of cells, the captured analyte, either a cytokine, chemokine, immunoglobulin (Ig) or other secreted molecule, is made visible by a biotinavidinenzymesubstrate development cascade, which results in colored spots (the color depends on the enzyme and substrate chosen for the assay) on a whitish membrane. The colored spots are usually enumerated using an automated ELISpot reader system. Each spot represents one cell that secreted the analyte of interest, and it represents the integration of the amount of analyte secreted during the assay duration, as well as its secretion kinetics.
Thus, ELISpot assay allows the visualization and enumeration of multiple secretory products from single cells. One common application is the count of CD4+ and CD8+ T cells that secrete cytokines in response to an antigenic stimulus at the single cell level. Multiple peptide pools are typically used for stimulation. The raw form of the data arising from the ELISpot assay is the number of spot forming cells (SFCs) for each experimental and control (negative and positive) well, each of which are often performed in replicates (It is a common practice that 6 replicates are run for the negative controls, 3 replicates are run for the test samples). A test sample is subsequently categorized as positive or negative depending on whether or not the number of SFCs in the experimental wells is significantly greater than in the control wells. The word ‘significantly’ is loaded with a variety of implications, it is the goal of this research to delineate and distinguish the difference in statistical, biological, and analytical aspects.
There are two main approaches to defining a positivity criterion for the quantitative ELISpot assay data: ad hoc (Empirical) rules and hypothesis test. The empirical approach uses a threshold and/or a fold difference between the experimental and negative control wells that is determined based on empirical reasons, which is meant to represent “biological significance” according to the researcher’s opinion. For example [6], declares a positive response for antigens where the average number of SFCs in the peptide pool wells is greater than or equal to 11 SFCs/well and is also at least four times the average of the negative control wells (where each well contains 200,000 cells).
A statistical hypothesis test uses the nonzero difference (in statistical sense) between the experimental and negative control wells as the criteria for determining positivity. This approach is designed to control the false positive rate and has the ability to detect weak but truly nonzero responses (power). Surveying the literature, the considered statistical methods for ELISpot assays include ttest, Wilcoxon rank sum test, score test based on binomial distributions, Severini test that considers the overdispersion, and the bootstrap methods that has gained some popularity in recent years. This research primarily considers ttest and the distributionfree resampling method (DFR) which is a permutationbased method [7].
The DFR method, as the name indicates, a distributionfree method, was proposed to address concerns with ttest which requires certain assumptions about the data (e.g. normality, variance homogeneity). The DFR method employs a permutation test whereby only the independent replicates of a single peptide pool and the negative controls from an individual are permuted. The test statistic employed is simply the difference in the sample means of the peptide pool and negative control replicates.
Scaling the difference by the variance is not applied for DFR method since it is deemed “undesirable as it gives large test statistics when the variance is small, even when the difference is small also. Such cases should not provide evidence to support a positive call. Also, the variance estimates can be unreliable with a small number of replicates. For these reasons, the test statistic does not involve the variance”. However, a variance filter for extreme outliers is imposed at the lab for quality control. Specifically, an assay is deemed as an outlier if
$\frac{Variance}{Median+1}>1$
Where the ‘Variance’ is the intrareplicate variation was calculated as the sample variance of the replicates; the number one is added to the median to avoid division by a zero median. That is, if the peptide pool for which the ratio of the variance to the median+1 exceeds 100%, that assay is rerun. For a Poisson distribution, the ratio of variance/mean (called dispersion index) follows a χ^{2} distribution with degree of freedom n1 (where n is the number of replicates).
Both empirical methods and hypothesis test methods have pros and cons. The empirical method is easy to implement, it incorporates the researcher’s subjective “requirement” (subjective may not be a bad thing). Ad hoc approach is not uncommon, for example, FDA guideline suggests that “LLoQ needs to be at least 4 times bigger than LoB”. Here the number 4 is based on empirical reasons. The main drawback of the ad hoc method is that it doesn’t control false positive or false negative rates. In general, hypothesis test methods, on the other hand, are purely datadriven and objective (objectivity can also be both bad and good). They may detect a statistically significant difference that is of no sufficient biological meaning. Therefore, there are some ‘hybrid’ methods used. For instance, DFR (2X) tests whether the mean of the test group is more than two times the mean of the control group. Here 2X is clearly an empirical decision. A brief summary of the methods is provided in table 1 below.
Method 
Pros 
Cons 
Empirical Rules 
Intuitive, ‘significance’ is not driven solely be statistical testing, but rather driven by biological expectations/opinions 
Doesn’t control false positive/negative rates, doesn’t consider variation of the means. Can be adapted to establish CI using Feller’s theorem. 
Hypothesis test 
Ttest 
Easy to implement 
Assumes normality and quite often variance homogeneity. Sample size plays a very important role in the resulting pvalue 
DFR (eq) 
Permutationbased method, ‘distributionfree’. 
Sample size requirement, used an ‘empirical’ rule on variance as a QC criteria (<100%), no division by SD
Not consider the consistency of the ‘positive’ call, pure hypothesis test on mean difference 
DFR (2x) 
Similar to the Empirical approach, requires 2fold difference 
Where does the ‘2’ come from? A number of convenience. Obviously a hybrid of an ST and empirical rule. 
Table 1: Summary of the pros and cons for the current methods.
Empirical or statistical, parametric or nonparametric, the goal is to control the false positive and false negative error rates. The fundamental question is then what is considered an ‘error’. For empirical rules, it is an error of false positive when a test is claimed to be positive if the difference in mean values is less than 2 fold; for hypothesis test method, an ‘error’ is purely determined by whether the null hypothesis
${\mu}^{c}={\mu}^{t}$
is true, regardless of how small the difference might be. Hence, a type I error is committed if a sample is called positive but in truth
${\mu}^{c}={\mu}^{t}$
, whereas a type II error is committed if the sample is not called positive when in truth
${\mu}^{c}<{\mu}^{t}$
.
Why 2fold, 100% CV, or 2X? There are no theoretical bases for the choice of the criteria, they are numbers chosen mostly because of convenience. Instead, any number we choose should help us achieve certain goal – false positive control, false negative control, QC failure rate, clinical and economical impacts, etc. These numbers should be chosen based on the property of the data distribution, which is dependent upon the data generation mechanism. Therefore, it is critical to understand the underlying mechanism of data generation. Such a mechanism determines the model for us, rather than we determine the model for the data.
In this research, Ttest is considered here because it is still commonly applied due to its ease of use. Empirical method is not considered due to its obvious lack of controls over false positive and false negative errors. The focus of this paper is to compare DFR method and the proposed method, LOD method, which emphasizes the reproducibility of the decision call rather than the statistical significance.
Material and Methods
Statistical Description of ELISpot Data
Suppose that we have n replicate wells for cells stimulated with a particular peptide pool and m replicate wells containing cells serving as a negative control, all at the concentration (often 200,000 cells/well). Let
${Y}_{1}^{C},\dots ,{Y}_{n}^{C}$
be the number of SFCs per well for the m control wells, and
${Y}_{1}^{T},\dots ,{Y}_{n}^{T}$
be the number of SFCs per well for the n test wells treated with the peptide pool. Let
${\overline{Y}}_{T}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}{{\displaystyle \sum {}_{i}Y}}_{i}^{T}$
be the perwell average for a particular peptide pool and let
${\overline{Y}}_{C}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}{{\displaystyle \sum {}_{i}Y}}_{i}^{C}$
be the corresponding average for the control wells. Assume the distribution governing the number of SFCs/control well follows a distribution with true (but unknown) SFC of
${\mu}_{C}$
, denoted by
$P\left({\mu}_{C}\right);$
assume the distribution governing the number of SFCs/test well follows a distribution with true (but unknown) SFC of
${\mu}_{T}$
, denoted by
$F\left({\mu}_{T}\right)$
. The typical goal is to evaluate if
${\mu}_{T}={\mu}_{C}$
based on the observed data. Note that the two distributions,
$F\left({\mu}_{T}\right)$
and
$P\left({\mu}_{C}\right),$
are of the same family (e.g. both are Poisson) but with different parameters (e.g. mean and/or variance can be different).
Since the assay result for each well is the count of ‘event’, Poisson and Binomial distributions are typically used to characterize this variable. While both measure the number of certain random events (or "successes") within a certain frame, binomial distribution describe the process as “given a (fixed) number, N, of "attempts" (cells), each of which has the same probability of success, p”; Poisson distribution, on the other hand, describe the process as “with the same probability success, p, the number of successes observed from a random number of attempts”. In theory, given a Binomial distribution with probability of success p, if p is a small value such that when
$N\to \infty $
, Np → λ, then this distribution approaches a Poisson distribution with parameter λ. In practice, the difference is very small in whether or not N is considered fixed.
Another way to see the connection between the two distributions is as follows. Consider the general case of enumerating a total of N events, of which K meet a certain criterion (positives). The proportion of positives, p=K/N (0 ≤ p ≤ 1), is the estimate of the probability of the particular event being observed. For a binomial distribution, the variance (Var) of the random variable K is estimated as:
$Var\left(k\right)\text{\hspace{0.17em}}=\text{\hspace{0.17em}}Np\left(1p\right)$
When the event is rare, p is very small, and so 1p is very close to 1. The variance is approximately
$Var\left(k\right)\text{\hspace{0.17em}}=\text{\hspace{0.17em}}Np\text{\hspace{0.17em}}=\text{\hspace{0.17em}}N\times \text{\hspace{0.17em}}\frac{K}{N}=\text{\hspace{0.17em}}K$
The variance equals the mean, which is the case for a Poisson variable. In other words, when p is small, Poisson distribution is an approximation of a binomial distribution, and thus both distributions are reasonable for ELISpot assays.
Let N denote the total number of collected target cells in the sample pool from which replicates are drawn, and let K0 denote its true (but unknown) positive cells within the N cells. In other words, the proportion (‘event rate’) is λ=K0/N. When replicates (working samples)
${X}_{1},\text{\hspace{0.17em}}\dots \text{\hspace{0.17em}},{X}_{r}$
are drawn from this sample pool, the number of positives cells for a replicate i with a total of target cells is highly unlikely to be exactly equal to
$n\text{\hspace{0.17em}}\times \text{\hspace{0.17em}}{K}_{0}/N$
(the theoretically expected number). Due to random chance, some replicates will have more positive cells than the theoretical expectation, some will have less. This variation can be modeled as
${X}_{i}\text{\hspace{0.17em}}=\text{\hspace{0.17em}}\mu \text{\hspace{0.17em}}+\text{\hspace{0.17em}}{\delta}_{i}$
Where
$\mu \text{\hspace{0.17em}}=\text{\hspace{0.17em}}n\text{\hspace{0.17em}}\times \text{\hspace{0.17em}}{K}_{0}/N$
the expected number of positive cells is,
${\delta}_{i}$
is the deviation of replicate
${X}_{i}$
from the theoretical expectation. As discussed earlier,
${\delta}_{i}$
can be generally assumed to follow a Binomial or a Poisson distribution. This variation can be termed as ‘sampling’ variability.
For each working sample, due to the assay variability, the readout of the assay is also highly unlikely to be exactly the same as the truth
${X}_{i}$
. The observed number of positives can be modeled as
${Y}_{ij}={X}_{i}+{\epsilon}_{ij}=\mu +{\delta}_{i}+{\epsilon}_{ij}$
Where
${\epsilon}_{ij}$
represents the assay noise, which can depend upon factors such as operators, reagents, plate effects, instrument variabilities, signal computer processing and data analysis, and so on. This variation can be termed as ‘assay’ or ‘analytical’ variability.
In a typical ELISpot experiment, the data is composed of the control group and the test group. The data of the control group can be modeled as
${Y}_{ij}^{C}={\mu}^{C}+{\delta}_{ij}^{C}+{\epsilon}_{ij}^{C}$
And the data of the test group can be modeled as
${Y}_{ij}^{T}={\mu}^{T}+{\delta}_{ij}^{T}+{\epsilon}_{ij}^{T}$
The goal of the analysis is to assess if
${\mu}^{C}$
and
${\mu}^{T}$
are significantly different.
LOBLODLOQ paradigm
Statistical test is indeed important in evaluating whether the test samples are significantly different from the controls. In clinical assays, it is also (maybe more) important that the results are consistent. For example, if a sample is determined to be HIVpositive, we’d hope this result is reproducible so we have high confidence in the decision. That is, the decision that a sample is HIVpositive needs to be consistent(e.g. 95% concordant) if the assay is repeated multiple times. The chart below describes the fundamental difference between the three approaches.
$PositivityCriteria=\{\begin{array}{l}ad\text{\hspace{0.17em}}hoc:\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\frac{{\mu}^{T}}{{\mu}^{C}}\ge 2\text{\hspace{0.17em}}and\text{\hspace{0.17em}}{\mu}^{T}\text{\hspace{0.17em}}\ge \text{\hspace{0.17em}}11\\ Hypothesis\text{\hspace{0.17em}}test:\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\text{\hspace{1em}}\frac{{\overline{Y}}^{T}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{\overline{Y}}^{C}}{{S}_{p}/\sqrt{n}}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{C}_{0}\\ LOD\text{\hspace{0.17em}}method:\text{\hspace{1em}}{\overline{Y}}^{T}\text{\hspace{0.17em}}\ge \text{\hspace{0.17em}}\text{\hspace{0.17em}}LOD\left(distribution\text{\hspace{0.17em}}overlap\text{\hspace{0.17em}}\text{\hspace{0.17em}}p\%\right)\end{array}$
The test result based on the LOD method is not impacted by the sample size n, it only depends on how much the two distribution overlap. Conversely, the hypothesis test methods may miss a big difference due to a small sample size, or pick up a small difference due to a large sample size.
The desire for controlling the amount of distribution overlap (instead of just the p value of a hypothesis test) fits into the frame work of LOBLODLOQ. The concept of LOB, LOD, and LOQ are illustrated by the figure below. Simply put, LOB is the limit of ‘blanks’, the upper 95^{th} percentile (typically) of the controls. LOD is the number of events such that if the sample is repeatedly measured by incorporating different variation factors, its 5th percentile (typically) is equal to LOB. LOQ is the quantification limit that incorporates the desired assay goals (typically assay bias and precision). For example, LOQ is the number of event such that the CV of the repeated measurements is no greater than 20%. For more details on the definition and calculation of LOB, LOD, LOQ, please read CLSI guideline EP17 [8].
Figure 1: Schematic illustration of the Limits of detection.
Given the fact that there are only very limited number of replicates for each condition, and the best estimates of
$m{\mu}^{C}$
and
$n{\mu}^{T}$
are simply the observed mean values. With these estimates, the LOD method requests that the overlaps of the two distributions are no more than 5%.
Analysis procedure
When two Poisson variables X and Y are independent, with event rate
${\mu}_{1}$
and
${\mu}_{2}$
respectively, the difference of the two variables Z=YX follows a Skellam distribution with probability mass function (discrete variable) given by
$rob\left(YX=k\right)={e}^{\left({\mu}_{2}{\mu}_{1}\right)}{\left(\frac{{\mu}_{2}}{{\mu}_{1}}\right)}^{\frac{k}{2}}{I}_{k}\left(2\sqrt{{\mu}_{1}{\mu}_{2}}\right)$
Where
${I}_{k}\left(Z\right)$
is the modified Bessel function of the first kind,
${\mu}_{1}$
and
${\mu}_{2}$
are the event rates for the two Poisson distributions [9].
For the n replicates of the test samples,
${Y}_{i}^{T}\left(i=1,\text{\hspace{0.17em}}\dots \text{\hspace{0.17em}},n\right),$
, each of with follows a Poisson distribution with mean
${\mu}^{T}$
, the sum
${S}^{T}={\displaystyle {\sum}_{i=1}^{n}{Y}_{i}^{T}}$
follows a Poisson distribution with mean and variance equal to
$n{\mu}^{T}$
, the sum of the m replicates of the controls
${Y}_{i}^{C}\left(i=1,\text{\hspace{0.17em}}\dots \text{\hspace{0.17em}},m\right),$
${S}^{C}={\displaystyle {\sum}_{i=1}^{m}{Y}_{i}^{C}}$
follows a Poisson distribution with mean and variance equal to
$m{\mu}^{C}$
. The difference
${S}^{T}\text{\hspace{0.17em}}{S}^{C}$
follows a Skellam distribution with parameter
$\left(n{\mu}^{T},m{\mu}^{C}\right)$
 its mean is
$n{\mu}^{T}\text{\hspace{0.17em}}m{\mu}^{C}$
and variance is
$n{\mu}^{T}+\text{\hspace{0.17em}}m{\mu}^{C}$
. The maximum likelihood estimate (MLE) for
${\mu}^{T}$
is
${\overline{Y}}^{T}=\frac{1}{n}{\displaystyle {\sum}_{i=1}^{n}{Y}_{i}^{T}}$
; similarly, the MLE for
${\mu}^{C}$
is
${\overline{Y}}^{C}=\frac{1}{m}{\displaystyle {\sum}_{i=1}^{m}{Y}_{i}^{C}}$
.
The 95th percentile of the Skellam distribution under the NULL (the test sample has the same positivity rate) is
$=\text{\hspace{0.17em}}{z}_{\alpha}\sqrt{m{\mu}^{C}+\text{\hspace{0.17em}}n{\mu}^{C}}$
. Where is the critical value that corresponds to the 95%tile of the distribution, here we choose
${z}_{\alpha}=\text{\hspace{0.17em}}1.645$
, a value that is coincidentally the same as the 95%  tile critical value of a standard normal distribution (see Supplementary materials for the analysis for supporting this decision). Note that the LOB increases with the event counts.
LOD is an event rate of X such that 95% of the distribution of difference (Poisson variable with mean X
$m{\mu}^{C}$
, variance X+
$m{\mu}^{C}$
) is above LOB. Putting this into a mathematical equation, it is
$LOB\text{\hspace{0.17em}}+\text{\hspace{0.17em}}{z}_{\beta}\sqrt{X+{m}_{{\mu}^{C}}}=\text{\hspace{0.17em}}X{m}_{{\mu}^{C}}$
Here is the type II error rate. Setting
${z}_{\beta}={z}_{\alpha}$
and solvingthe equation for X, we get
$=\frac{2a\sqrt{m{\mu}^{C}+n{\mu}^{C}}+1}{2{a}^{2}}+\frac{\sqrt{{\left(2a\sqrt{m{\mu}^{C}+n{\mu}^{C}}+1\right)}^{2}4{a}^{2}m{\mu}^{C}}}{2{a}^{2}}$
Where
$\alpha \text{\hspace{0.17em}}=\text{\hspace{0.17em}}\frac{1}{{z}_{\alpha}}$
.
The figure below gives a schematic view of the Skellam distribution. In this example, the mean of the Control is
$m{\mu}^{C}=5$
, the mean of the test sample ranges from
$n{\mu}^{T}=\text{\hspace{0.17em}}5,\text{\hspace{0.17em}}\dots \text{\hspace{0.17em}},\text{\hspace{0.17em}}20$
. When
$n{\mu}^{T}=\text{\hspace{0.17em}}5$
, the Test and the Control have no difference, the difference follows a Skellam distribution with mean 0 and variance
$n{\mu}^{T}+m{\mu}^{C}=\text{\hspace{0.17em}}10$
(the black solid line), thus the LOB is simply the upper 95%tile of this distribution. When
$n{\mu}^{T}=\text{\hspace{0.17em}}19$
, the Skellam distribution has 95% of the distribution above the LOB, thus the LOD is
$n{\mu}^{T}=\text{\hspace{0.17em}}19$
.
Figure 2: Illustration of the Identification of LOD.
Operation Procedure
Calculate the mean and the sum of the controls. The mean gives the estimate of
${\mu}^{C}$
; the sum gives the estimate of the mean of the Poisson distribution
$m{\mu}^{C}$
.
$LOB\text{\hspace{0.17em}}=\text{\hspace{0.17em}}{z}_{\alpha}\sqrt{n{\mu}^{C}+m{\mu}^{C}}$
, where
${z}_{\alpha}=\text{\hspace{0.17em}}1.645$
 LOD is the
$X=n{\mu}^{T}$
such that
$X=\frac{2a\sqrt{m{\mu}^{C}+n{\mu}^{C}}+1}{2{a}^{2}}+\frac{\sqrt{{\left(2a\sqrt{m{\mu}^{C}+n{\mu}^{C}}+1\right)}^{2}4{a}^{2}m{\mu}^{C}}}{2{a}^{2}}$
Where
$\alpha =\frac{1}{{z}_{\alpha}}$
.
 If the observed sum of the Test samples is more than the calculated LOD, then the Test sample is claimed to be Positive; otherwise it is negative.
Results and Discussion
Simulation
Simulations were conducted to compare the performance of the four methods. For each simulated ‘experiment’, the data consisted of 6 replicates from the Control condition and 3 replicates from the Test condition. It was assumed that the data from both conditions followed Poisson distributions. Two eventrate scenarios were considered for the Controls:
${\mu}^{C}=\text{\hspace{0.17em}}10$
and
${\mu}^{C}=\text{\hspace{0.17em}}30$
(i.e. the event rate for the Test ranges from 30 to 60). For the Test condition, the difference in event rate ranged from 0% to 100% for the case of
${\mu}^{C}=\text{\hspace{0.17em}}30$
, and from 0% to 200% for the case of
${\mu}^{C}=\text{\hspace{0.17em}}10$
( i.e, the event rates range from 10 to 20). The cases of 0% change provide assessment of the false positive rate; the other cases provide the assessment of the test power.
For each case 1,000 simulation runs were conduction. Figure 3 below show the simulation result. The most powerful test in these four methods is DFR (eq), followed by ttest and then the LOD method. As expected the DFR (2x) shows the least power. For the case of Control=10, the LOD method doesn’t show any power until the difference is over 50%, in which case the number of events in the test sample is 15. Since for a Poisson distribution with event rate of 10, its standard deviation is 5, the LOD method essentially treats any change that is below one standard deviation as irreproducible and thus determined as insignificant. For the case of Control=30, the power starts when the difference is about 30% (so the number of events is about 40).
Figure 3: The performance of the four test methods.
It is true that Ttest and DFR (eq) have higher test power than the LOD method, but notice that the LOD method has the steepest curve. This indicates that LOD method has the highest level of reproducibility, which means the highest consistency of the significant calls. To see this, let’s use Control=10 as the example and consider when the Difference is about 50%. In this case, out of the 1,000 simulation runs DFR (eq) called about 40% time significant and 60% negative, so the reproducibility is 40%. On the other hand, almost none of them were called significant. Though the truth is indeed that the Test is different from the Control, but the LOD method requires a higher level of reproducibility (thus higher level of confidence) to determine a positive case. The steep curve of the LOD method means that the ‘gray zone’ (the uncertainty range) is narrow, which is a good characteristic of diagnostic tests and assay procedures [10].
Application to real data
Data from the three the European CIMT Immunoguiding Program (CIP) proï¬ciency panel phases were used to compare different analysis methods [1113]. In the referred studies, groups of 11, 13 and 16 laboratories (phases I, II and III, respectively) quantiï¬ed the number of CD8 T cells speciï¬c for two model antigens within PBMC samples that were centrally prepared and then distributed to the participating laboratories. All participants were allowed to use their preferred ELISPOT protocol. Therefore, the data sets generated in these studies can be considered representative of results generated by a wide range of different protocols commonly applied within Europe. Each participating center was asked to test in triplicate 18 preselected donors (5 in the ï¬rst phase, 8 in the second phase and 5 in the third phase) with two synthetic peptides (HLAA*0201 restricted epitopes of CMV and Inï¬‚uenza) as well as PBMCs in medium alone for background determination. In total, nineteen different laboratories participated in at least one of the three phases and they reported a total of 717 triplicate experiments (this includes control and experimental wells).
In the CIP study, the donors were selected so that 21 donor/antigen combinations (6 in the ï¬rst phase, 8 in the second phase and 7 in the third phase) were expected to demonstrate a positive response with the remaining 15 donor/antigen combinations not expected to demonstrate a positive response. In this research, we specifically selected those cases that are positive in only one of the scenario (CMV or FLU, not both) so that the data from the 2 triplicates of the negative conditions are pooled, so that for each test the comparison is between the triplicate of the positive condition versus the 6 replicates from the negative condition. Similarly, for the negative cases, we only considered those that negative in both CMV and FLU so that we can compare either CMV or FLU to the rest, which also has n=3 versus n=6.
The table below shows the donors that were used in this research. Note that for the Negative cases, each donor can be used for testing both CMV positivity and FLU positivity, so from the analysis point of view we have 4 instead of 2 true negative cases.

CMV Positive 
FLU Positive 
Negative 
Phase I 

Donor 2 and 3 
Donor 4 
Phase II 
Donor 8 
Donor 1, 5, 6, and 7 
Donor 3 
Phase III 
Donor 1 
Donor 4 and 5 

Table 2: The known Positive and Negative cases in CIP Study.
Figure 4 plots the variability versus the mean for the triplicate experiments. The black line represents the 45degree line (X=Y), the red line is the spline fit. We can see that the relationship of variance = mean can be entertained, therefore a Poisson model is reasonable.
Figure 4: Variance versus Mean for CIP data.
Since different labs used different protocols, the number of targeted cells is “standardized” to be 500,000 so that the number of total cells is comparable across assays. This means that is a lab had 250,000 as the number of targeted cells and found 4 events, the standardized number of event is 8. Note that, this standardization is unnecessary if the percentpositivity is used, but is useful if the count of positive cells is used to quantify the response. Accuracy, sensitivity, specificity, positive predicted value (PPV), and negative predicted value(NPV) are calculated to compare the four test methods. (Table 3) is the contingency table of predicted and true values. In table 3, b is the type I error, which is the number of false positive samples. C is the type II error, which equals to the number of false negative samples. Accuracy
$\left(\frac{a+d}{a+b+c+d}\right)$
is the sum of true positive and true negative divided by the number of total samples. Sensitivity is
$\frac{a}{a+c}$
and specificity is
$\frac{d}{b+d}$
. Sensitivity would decrease if type II error rate increases. Similarly, specificity would decrease if type I error rate becomes larger. PPV
$\left(\frac{a}{a+b}\right)$
means how many samples is true positive in the samples predicted to be positive while NPV
$\left(\frac{c}{c+b}\right)$
is the rate of true negative in the samples predicted to be negative.

Truth 

Positive 
Negative 
Total 
Test results 
Positive 
a 
b 
a+b 
Negative 
c 
d 
c+d 

Total 
a+c 
b+d 
a+b+c+d 
Table 3: Contingency table of predicted and true values.
Table 4 below shows the performance summary statistics of each test method. Overall, DFR has the best performance, Ttest and LOD have similar performance, and the worst in the four tests is DFR2x. This result reasonably agrees with the simulation study.

Ttest 
DFR 
DFR2x 
LOD 
Accuracy 
0.791 
0.821 
0.694 
0.786 
Sensitivity 
0.723 
0.764 
0.595 
0.730 
Specificity 
1.000 
1.000 
1.000 
0.958 
PPV 
1.000 
1.000 
1.000 
0.982 
NPV 
0.539 
0.578 
0.444 
0.535 
Table 4: Performance of the Four Methods on CIP Data.
Discussion
The simulations and applications used Poisson distribution simply as an example. The idea can be easily generalized to distributions of any known or unknown formats. The critical factors are simply the mean and the standard deviation of the distribution that are used to determine the LOB and LOD.
Conceptually, the LOD method is very closely related to Cohen’s D, which is defined as
$D=\frac{\overline{x}\overline{y}}{s}$
, where
$\overline{x}\overline{y}$
is the difference of the sample means, and s is the pooled standard deviation. Notice that the typical test statistics is
$=\frac{\overline{x}\overline{y}}{s\sqrt{n}}$
. Notice that the difference is in the contribution of the sample size. The typical statistical test can achieve a small p value by having a large enough sample size (lots of replicates), but the results of each of the individual replicate can be higher irreproducible. On the contrary, the LOD method guarantees that the overlap of the two distributions, the sample size doesn’t have impact on the decision making (though in reality a larger sample size gives a better estimation of the distribution).
Figure 5 below show the histogram of 2 pooled distributions that have Cohen’s D of 14. Though in truth the two distributions are different for d=1, we can see that they are not separable. It is easy to imagine how low our confidence would be if the test determines that Test is different Control.
Figure 5: Illustration of Cohen’s D.
Conclusion
This research proposed and investigated a statistical method for determining the positivity of the immuneresponse using ELISpot assays. Different from the typical statistical methods that focus on the pvalue of a hypothesis test, this new method focuses on the reproducibility of the decision. This is critical in the use of clinical biomarkers that are used for diagnostic, prognostic, and predictive purposes. In these applications, the typical hypothesis mainly concerns with the statistical evidence that is based on the current given data, the LOD method mainly concerns with the analytical variability and focuses the reproducibility of the future data.
Acknowledgement
We would like to thank the European CIMT Immunoguiding Program (CIP) proï¬ciency panel for sharing the real data set.
Supplementary Materials
Decision for choosing z alpha =1.645:
The critical value of 1.645 is determined by empirical observations, it is an approximation to the critical factor for Skellam distributions. The two plots below gives 2 examples. The first is when the Control sample # of events = 1, the Test sample # of events varies from 1 to 50; the 2nd plot is when the Control # of events = 10 whereas the Test sample # of events varies from 10 to 50. The ‘z factor’ on the Yaxis is (mean – 5th percentile)/std.
Where ‘mean’ is the mean of each Skellam distribution, ‘5th percentile’ is the 5%tile of the Skellam distribution, and ‘std’ is the standard deviation of the Skellam distribution. It can be seen that 1.645 is a good approximation to all of the z factor values, this is called “
${z}_{\alpha}$
” from now on.
Note that the 95% upper limit (UL),
$1.645\sqrt{2{\mu}_{1}}$
, depends on the event rate of the control sample. Here if we assume that the target total # of cells is 10,000 and 100,000 is used to normalize the count of interesting events. In the following 3 pairs of plot, 3 different situations are assumed:
${\mu}_{1}=1$
(which closely mimics the IFNg+ event rate in the preclinical model),
${\mu}_{1}=2$
, and
${\mu}_{1}=5$
. Again, note that the upper 95% limit for each of these individual Poisson distribution is roughly
$UL\text{\hspace{0.17em}}=\text{\hspace{0.17em}}{\mu}_{1}+1.645\sqrt{2{\mu}_{1}}$
Therefore the 95% upper limit for the 3 scenarios are about 3, 5, and 9 respectively, these are the ‘maximum’ # of events possibly observed in the controls. This information will help us determine likely (expected) event rate for the controls (
${\mu}_{1}$
), which can be dependent upon the assay and parameter of interest. The methodology developed here is for general purposes.
Using the first pair as an example, and assuming
${\mu}_{1}=1$
, plotted are a series of ‘probability density plots’ for different
${\mu}_{2}$
values, ranging from 1 to 30. The black solid curve is the NULL distribution when
${\mu}_{2}={\mu}_{1}=1$
. The vertical black line is the LOB = 0+1.645
$\sqrt{2\text{\hspace{0.17em}}\ast \text{\hspace{0.17em}}1}$
=2.33. The red solid curve is corresponds to the test sample (
${\mu}_{2}=\text{\hspace{0.17em}}8$
in this case) such that 95% of the YX values are above LOB. Therefore, the LOD in terms of the difference is
${\mu}_{2}{\mu}_{1}=\text{\hspace{0.17em}}81\text{\hspace{0.17em}}=\text{\hspace{0.17em}}7$
; in terms of the event rate for the test sample
${\mu}_{2}=\text{\hspace{0.17em}}8$
. The right panel plots the %’s of the curve for each case so that is above the LOB. The blue dashed line corresponds to 95%, the black dashed line corresponds to 5%. The red dashed line identifies the value of
${\mu}_{2}{\mu}_{1}$
so that 95% of its curve is above LOB.
References
 Czerkinsky CC, Nilsson LA, Nygren H, Ouchterlony O, Tarkowski A (1983) A solidphase enzymelinked immunospot (ELISPOT) assay for enumeration of specific antibodysecreting cells. J Immunol Methods 65(12): 109121.
 Navarrete MA (2015) ELISpot and DCELISpot Assay to Measure Frequency of AntigenSpecific IFNγSecreting Cells. Methods Mol Biol. 1318: 7986.
 Shrestha R, Gyawali P, Yadav BK, Dahal S, Poudel B, et.al (2011) Invitro assessment of cellmediated immunity by demonstrating effectort cells for diagnosis of tuberculosis in Nepalese subjects." Nepal Med Coll J 13(4): 275278.
 Jin C, Roen DR, Lehmann PV, Kellermann GH (2013). An Enhanced ELISPOT Assay for Sensitive Detection of AntigenSpecific T Cell Responses to Borrelia burgdorferi. Cells 2 (3): 607620.
 Cox JH, Ferrari G, Janetzki S (2006) Measurement of cytokine release at the single cell level using the ELISPOT assay. Methods 38(4): 274282.
 Mogg R, Fan F, Li X, Dubey S, Fu T, et al. (2003) Statistical Crossvalidation of Merck's IFNγ ELISpot Assay Positivity Criterion. AIDS Vaccine, New York, NY.
 Hudgens MG, Self SG, Chiu YL, Russell ND, Horton H, et al. (2004) Statistical considerations for the design and analysis of the ELISpot assay in HIV1 vaccine trials. J Immunol Methods 288(12): 1934.
 CLSI. Evaluation of Detection Capability for Clinical Laboratory Measurement Procedures; Approved Guideline (2nd Edn). CLSI document EP17A2. Wayne, PA: Clinical and Laboratory Standards Institute 2012.
 Alzaid, Abdulhamid A, Omair, Maha A (2010) On The Poisson Difference Distribution Inference and Applications. BULLETIN of the Malaysian Mathematical Sciences Society 33(1): 1745.
 CLSI. User protocol for evaluation of qualitative test performance; Approved guideline (2nd edn), CLSI document EP12A2. Wayne, Clinical and Laboratory Standards Institute 2008.
 Britten CM, Gouttefangeas C, Welters MJ, Pawelec G, Koch S, et al. (2007) The CIMTmonitoring panel: A twostep approach to harmonize the enumeration of antigenspecific CD8+ T lymphocytes by structural and functional assays. Cancer Immunol Immunother 57(3): 289302.
 Mander A, Gouttefangeas C, Ottensmeier C, Welters M. J. P., Low L, et al. (2010) Serum is not required for ex vivo IFNγ ELISPOT: A collaborative study of different protocols from the European CIMT Immunoguiding program. Cancer Immunol Immunother 59(4): 619627.
 Moodie Z, Huang Y, Gu L, Hural J, Self SG (2006) Statistical positivity criteria for the analysis of ELISpot assay data in HIV1 vaccine trials. J Immunol Methods 315(12): 121132.

