Statistical Harmonization Methods in Individual Participants Data Meta-Analysis are Highly Needed

Meta-analysis has a long history within the medical sciences and epidemiology [1,2,3]. The main goal of a meta-analysis is to improve the precision of a specific effect size of a treatment or an exposure on a clinical or disease outcome by pooling or combining multiple studies. It is frequently conducted within a systematic review of the scientific literature to guarantee that studies with appropriate information are not ignored or overlooked and to make sure that the pooled estimate represents an unbiased and precise estimate of the true effect size. Moreover, the pooled effect estimate is evaluated in the context of study heterogeneity. In case substantial heterogeneity in the effect sizes across studies is present, the pooled estimate is considered less reliable or even questionable. Thus not only the pooled estimate must be unbiased, also the estimate of the measure of heterogeneity should be correct. Within the area of medical statistics, meta-analysis has evolved into an important and rich research field [4,5].

Meta-analysis can roughly be divided into aggregate data (AD) meta-analysis and individual participant data (IPD) meta-analysis. AD meta-analysis is the traditional form of meta-analysis and fully focuses on pooling effect sizes that are formulated at the study level, i.e. this analysis represents an analysis of analyses [6]. AD meta-analysis typically combines reported summary statistics from published articles. IPD meta-analysis considers the full data sets from the included studies and analyzes these data sets either as one large data set (one-stage IPD) or in two steps (two-stage IPD). In this two-stage approach the same statistical model is typically fitted to each study separately before the estimated summary statistics or effect sizes from these models are combined using methods from the AD meta-analysis. IPD meta-analysis is considered more appropriate than AD meta-analysis [7,8,9,10], in particular when observational studies are combined or pooled, since it provides the opportunity to correct for (the same) potential confounders at the individual level.

The IPD meta-analysis for pooling observational cohort studies, in particular when a one-stage analysis is applied, has triggered an important issue that was not recognized with traditional AD meta-analysis in the past or at least it was not considered relevant. The issue is that outcome and/or risk variables can be measured with different instruments across studies. For instance, memory can be measured with the Rey Auditory Verbal Learning Test [11] or the Buschke Cued Recall Procedure [12]. Both approaches record a discrete or integer score per individual that indicates how many items have been correctly remembered from a set of tested items, but this does not imply at all that these tests perfectly measure the same construct of memory, since one test is visually while the other is verbally. Other examples that frequently provide differences between studies are variables like education and income. School systems and teaching content are quite different across countries and are therefore difficult to compare when the IPD meta-analysis includes international studies. This is also true when just the number of years in education is used. Income can differ due to time differences among studies (i.e. due to inflation), in their monetary unit (e.g. dollars versus euros), and if salaries across occupations are not valued in the same way across countries. Even if variables are identically measured across studies, they may not be recorded in the same way, since variables can be observed either numerically or categorically.

Although substantial emphasize has been given to the way that a meta-analysis should be analyzed and reported [13,14], issues regarding this lack of content equivalence of variables across studies have received little attention in the statistical literature. One reason is that AD meta-analysis and two-stage IPD meta-analysis may simply overlook the lack of content equivalence, since the statistical analysis pools only the study-level summary statistics. In an AD meta-analysis, the selected articles may not have reported the details on the measurement instruments for all included variables. Indeed, smoking (yes/no) does not tell us the time frame of smoking nor what is being smoked. When data from multiple studies is not allowed to be shared with others, a two-stage IPD meta-analysis is executed by the researchers at the location of the study using the same set of statistical codes (this is called a federated data meta-analysis). The collected variables, on for instance memory, income, and education, may just be put in the statistical analysis of the study without giving it much thought, whether they are identically measured or whether they represent something different across studies. Even if lack of content equivalence is recognized, researchers may argue that these non-identically measured variables may still be strongly correlated, indicating that it is okay to use them across studies as if they would have been measured with the same instruments. This can however never be the sole argument, since many variables may be (strongly) correlated, but this does not mean that we can interchange them for a statistical analysis or any other purpose for that matter.

In case the multi-study data is pooled at one location, differences in variables, that should have been content equivalent, would become more easily apparent. For instance, if memory is measured with a 15-item or 12-items test, this may easily be detected in an explorative data analysis. Then some kind of data manipulation is conducted to align the scales of the variables before they can be reported into one variable memory. Adjusting the scales though would make the range identical, but it does not change the difference in resolution. Additionally, if the instruments for memory are also different, adjustments of the scales alone might not be satisfactory enough whenever the tests do not capture the exact same construct. To make variables content equivalent, it would be necessary to converse each value on an individual observed with one instrument to an equivalent value with the other instrument for the exact same individual. This implies a precise conversion or calibration model that could interchange values from different instruments per individual. This process or activity is what we call statistical harmonization of variables.

Calibration of test forms (e.g. IQ tests, candidate assessment tests) has been the subject domain of psychometricians for a long time and they have called this discipline “test equating and linking” [15]. However, many of the suggested statistical approaches to create equivalent test forms have focused on methods that are independent of subject characteristics, typically referred to as “population invariance”. The basic idea is that psychometric tests are measurement invariant, which means that they capture the true construct within a varying and diverse population [16]. Indeed, differences in scores among subgroups quantifies differences in performance and not differences in interpretation. Under population invariance, researchers will map the distribution of test scores from one test form onto the distribution of test scores from another test form using sampling data from the population. Typical statistical approaches are location-scale and quantile normalization, but in the case that invariance is violated, there is no real solution anymore to test equating and linking. Our view is that lack of invariance suggests the need for subject-specific calibration models, i.e. calibration models that can correct for subject-specific variables [17,18].

IPD meta-analysis within the medical sciences, has used subject-specific normalization based methods to make variables content equivalent across studies [19], but little is known about the performance of how well these statistical methods harmonize non-equivalent variables across studies for IPD meta-analysis [20]. Approaches that have been considered are algorithmic procedures, subgroup normalization, linear regression, and item response theory (IRT). Algorithmic procedures map the variable into a set of categories using specific thresholds that may depend on covariates like age, gender, and education [21]. For subgroup normalization a well-defined subgroup or control group that exists in each study is selected and the values of the variables are normalized per study with the mean and standard deviation of the selected subgroup [22]. For linear regression the variable is regressed on covariates to eliminate their influence on the variable and to generate covariate independent residuals per study. These residuals are then normalized using simple methods from equating and linking [22]. IRT models typically harmonize the latent ability of individuals by using bridging items, i.e. items that are measured in multiple studies and that can connect all the studies together, assuming that bridging items across studies are identically measured [23,24].

To push the research agenda on harmonization of variables in meta-analysis forward, we believe that new and innovative statistical methods for harmonization, extending the existing methods, are needed that would incorporate participants characteristics. Moreover, these methods should be compared on real data and in simulation studies to judge which approach works best and under what conditions. For instance, methods may want to make a distinction between harmonization of outcomes, exposures of interest, or potential confounders. Additionally, the methods should not just investigate whether effect sizes of interest are unbiasedly estimated, but also how measures of study heterogeneity are affected by harmonization. The only form of heterogeneity that should be captured in a meta-analysis is the heterogeneity due to populations, since harmonization should eliminate the inconsistencies in measuring variables. We also believe that it would be highly beneficial if separate sampling studies are performed to quantify calibration models for complex measurements across subgroups of participants. This should lead to accepted and standardized conversion procedures, similar to the work that has been conducted in metrology on physics measurements in the past (1 inch = 2.54 cm and 1 kg = 2.2046 lbs). Finally, we strongly recommend that papers on meta-analysis should mention the lack of content equivalence for their selected set of variables in detail and report the approach that they have used to accommodate the issue in the analysis.