Despite many statistical applications brush the question of data quality aside, it is a fundamental concern inherent to external data collection. In this paper, data quality relates to the confidence one can have about the covariate values in a regression framework. More precisely, we study how to integrate the information of data quality given by a \((n \times p)\)-matrix, with n the number of individuals and p the number of explanatory variables. In this view, we suggest a latent variable model that drives the generation of the covariate values, and introduce a new algorithm that takes all these information into account for prediction. Our approach provides unbiased estimators of the regression coefficients, and allows to make predictions adapted to some given quality pattern. The usefulness of our procedure is illustrated through simulations and real-life applications.

The authors would like to thank the insurance company and the data provider for the data used in Sect. 6. They are also grateful to the editor and to the two anonymous reviewers for very helpful comments on a previous version of this article.
A Proofs
1.1 A.1 Proof of Lemma 1
Thanks to the Eq. (1), the expected value remains unchanged. Consider two covariates and their joint quality \((X_j, Q_j)\), \((X_k,Q_k)\) \((j\ne k)\). The qualities \(Q_j\) and \(Q_k\) corresponds to the expectation of the quality variables \(\Omega _j\) and \(\Omega _k\), where \(\Omega _j\) and \(\Omega _k\) are Bernoulli random variables. Introduce \(\Omega _{jk} = (\Omega _j, \Omega _k)\) and denote \(w_m\) the realization of \(\Omega _{jk}\). Clearly, \(w_m \in \ \{ (0,0),(0,1),(1,0),(1,1)\}\). We can therefore write:
Then, the final expression depends on the assumptions about the correlation structure between the covariates,
Under (X-A2) and (Z-A1), \(Cov(Z_{j},Z_{k}) = 0\). We always have in our framework \(Cov(X_{j}^{real},Z_{k}) = 0\). We thus obtain:
Similarly, under (X-A2) and (Z-A2), the covariance can be written,
Using the Kolmogorov’s strong law of large number, \(\bar{Q}_j\) and \(\bar{Q}_k\) convreges almost surely to \(Q_j\) and \(Q_k\). \(\square\)
1.2 A.2 Proof of Lemma 2
For each covariate \(X_{j}\) (\(j = 1, \ldots , p\)), \(Z_j\) has the same distribution as \(X_j^{real}\). Given the latent variable model (Eq. 1) and under (X-A1) and (Z-A1), it is straightforward to show that:
Then, the covariance matrix \(\Sigma _{jk}\) equals to:
Let’s focus now on the covariance matrix \(\Sigma _{jk}\) under (X-A2) and (Z-A1). As mentioned before, the variance remains unchanged. To lighten the equations, let write \(Var(X_j^{real}) = Var_j^{real}\). However, we know from Lemma 1 that the covariance is changing proportionally to the mean quality under (Z-A1):
Recall that the quality \(Q_j\) corresponds to \(\mathbb {E}(\Omega _j)\) for \(j = 1, \ldots , p\), where \(\Omega _j\) is a Bernoulli random variable. We write \(cor(X_k^{real},Y)\) the Pearson correlation here between \(X_k^{real}\) and Y. We assumed the non singularity of the real information matrix, i.e. \(|cor(X_j^{real},X_k^{real})| \ne 1\) (and thus \(|cor(X_j,X_k)| \ne 1\)) and notice that we state
To lighten the notation, we denote \(\rho ^{real}_{jk}\) the Pearson correlation between the two covariates \(X_j^{real}\) and \(X_k^{real}\). To find a relation between \((\Sigma ^{real}_{jk})^{-1}\) and \(\Sigma _{jk}^{-1}\), denote \((\Sigma ^{-1})_{kk}\) the kth diagonal term and \((\Sigma ^{-1})_{jk}\) the element on the jth row and kth column. We can now easily state the following relations:
leading to:
1.3 Proof of Theorem 1
Recall that the covariates are supposed centered and that we are under (X-A1). Using our notations, the classical OLS regression coefficient \({\hat{\beta }}\) satisfies:
As the sample size tends to infinity, \(\beta\) is the solution of
When focusing on \(M_2\), we have for \(j = 1, \ldots ,p\):
According to Lemma 2, \(\Sigma = \Sigma ^{real}\), i.e, \(Var(X_j) = Var(X_j^{real})\). Moreover, using the Lemma 1,
Recall that the quality \(Q_j\) corresponds to \(\mathbb {E}(\Omega _j)\), where \(\Omega _j\) is a Bernoulli random variable. Finally, the difference easily follows:
If \(X_j^{real}\) is not centered, the intercept changes by \(- \mathbb {E}(X_j) \frac{\beta ^{M_2}_j (1 - Q_j)}{Q_j}\) for each covariate \(X_j\). The other coefficients stay unchanged by centering. Indeed, without loss of generality, consider \(\mathbb {E}(X_1) \ne 0\) and \(X_1^c= X_1 - \mathbb {E}(X_1)\). The shift is easily found
Therefore, only the intercept in the centered case shifts (by \(\mathbb {E}(X_1)\beta _1\)) due to \(X_1\) centering. We will first center the variable and uncenter it afterward. We center first the variable \(X_1\) for the model \(M_2\). The intercept shifts by \(\mathbb {E}(X_1)\beta _1^{M_2}\) then we can apply the previous results in the centered case. Finally, we recenter the variable \(X_1\) for the real model, with a shift of \(\mathbb {E}(X_1)\beta _1\). Therefore, the global shift is equal to \(\mathbb {E}(X_1)\beta _1^{M_2} - \mathbb {E}(X_1)\beta _1\). This part of proof for the intercept ends, replacing \(\beta _1\) by \(\frac{\beta _1^{M_2}}{Q_j}\).
To end the proof, \(\bar{Q}_j\) and \({\hat{\beta }}_j\) converges almost surely to \(\beta _j\) and \(Q_j\) using SLLN and the maximum likelihood properties. The \({\hat{\beta }}^{M_2}\) calculated in Eq. (23) with the empircal estimators converges almost surely. Indeed, the Kolmogorov’s SLLN ensures the converge a.s. of each estimators: variance and covariance. The continuous mapping theorem and the fact that the product of series converging a.s. converges a.s. ends to achieve the proof of the convergence a.s.. Therefore,
Remark that no Gaussian properties on \({\hat{\beta }}_j^{M_2}\) can be state, the errors not being gaussian. \(\square\)
1.4 A.4 Proof of Theorem 2
Recall that the covariates are supposed centered. The proof with uncentered covariates would only modify the intercept values (see the previous proof A.3). The ordinary OLS regression coefficient \({\hat{\beta }}\) satisfies:
As n tends to infinity, the regression coefficient \(\beta\) is the solution of
To lighten the notation, for two correlated, covariates \(X_k\), \(X_j\), \(k \ne j\) lets denote \(\rho = \rho _{jk}^{real}\). Then using Gauss–Jordan elimination, for two correlated covariates, \(|\rho | \ne 1\), \(X_k\), \(X_j\), \(k \ne j\) in linear regression, we can state:
Thanks to the Lemma 1, under the assumption (X-A2) and (Z-A1), we can write
Recall that the quality \(Q_j\) corresponds to \(\mathbb {E}(\Omega _j)\) for \(j = 1, \ldots , p\), where \(\Omega _j\) is a Bernoulli random variable. Using the Cramer system and \(D \ne 0\) due to the assumption \(\{i | q_{ij} \ne 0 \} \ne \emptyset\) for \(j = 1, \ldots ,n\), the system can be solved,
where \(\gamma = 1- \rho ^2 Q_j^2 Q_k^2\), \(b_k^{M_2} = \sqrt{\frac{Var_k^{real}}{Var(Y)}} \beta ^{M_2}_k\) and \(D = \text {Det} \begin{vmatrix} Q_k&- Q_j^2 Q_k \rho \\ - Q_k^2Q_{j} \rho&Q_j \end{vmatrix}\). By simpliying,
A relation between \(\beta\) and \(\beta ^{M_2}\) immediately follows:
The relation between \(\beta\) and \(\beta ^{M_2}\) depends on the correlation between the two variables. The shift of the intercept is the same as in the (X - A1) statement. By replacing the values with the corresponding estimator, (SLLN and the continuous mapping theorem ensuring the convergence almost surely),
which ends the proof. (The proof under (Z-A2) is done in the same way.) In the other way round for the Corollary 14, for given mean quality indexes, \((Q_k,Q_j)\), we can find \(\beta ^{M_2}\) according to the \(\beta\).
Indeed, in the same way than Eq. (26), with similar notations,
To end the proof, the different values are replaced by their empirical estimator. SLLN and the continuous mapping theorem ensure the convergence almost surely of the estimator of last equation. \(\square\)
B Multivariate case
Until now, we have studied the case of pairwise correlated covariates. The ordinary OLS regression coefficient \({\hat{\beta }}\) follows:
As n tends to \(\infty\), \(\beta\) is the solution of
If \(\Sigma\) is invertible, different methods exist as the Gauss–Jordan elimination to find a solution. However, one could remark that the relation between \(\beta ^{M_2}_k\) and \(\beta _k\) depends only on the Pearson correlation \(\rho _{jk}^{real}\) and \(Q_{j}\) and \(Q_{k}\) for all covariates k correlated to covariate k. The different proofs on the OLS coefficient could be extended with this method.
Chatelain, P., Milhaud, X. Estimation and prediction with data quality indexes in linear regressions. Comput Stat 39, 3373–3404 (2024). https://doi.org/10.1007/s00180-023-01441-6
