Computational methods for bias reduction in surveys

Castro Martín, Luis

Computational methods for bias reduction in surveys

Castro Martín, Luis

Dirigida por:

María del Mar Rueda García Director/a

Universidad de defensa: Universidad de Granada

Fecha de defensa: 01 de julio de 2022

Tribunal:

Juan Eloy Ruiz Castro Presidente/a
Francisco Javier Alonso Morales Secretario/a
Sergio Martínez Puertas Vocal
Maria Giovanna Ranalli Vocal
María Virtudes Alba Fernández Vocal

Tipo: Tesis

Teseo: 722553 DIALNET DIGIBUG editor

Resumen

Probability sampling has been a fundamental framework over time in order to carry out surveys from which reliable conclusions can be extracted and properly justified. However, the application of its basic principles is now being threatened by the surge of new technologies. Online surveys are becoming a standard due to their ability to obtain big data in a simple, cheap and efficient manner. In contrast, the methodologies associated with these kinds of surveys are usually non-probabilistic. Often, a link with the questionnaire is publicly shared, following a snowball sampling design, implying the absence of representative design weights. This causes an important self-selection bias. Even when there is a sampling frame available, the reduced response rates associated with the lack of human interaction produce an important non-response bias. Finally, coverage biases are also common because part of the target population does not have access to some of the required mediums, whether it is an internet connection, a smartphone or some specific social network account. Despite all these problems, their use is widely extended. Besides, the decrease over the last years in the response rates of traditional surveys has affected the viability of the alternatives. Therefore, great effort has been spent on developing techniques which allow us to reduce bias in non-probability surveys. The objective is proposing new methodologies in order to preserve the credibility of statistical studies while also making use of the advantages of new technologies. The main proposals for this purpose are Propensity Score Adjustment, which estimates the inclusion probabilities in order to obtain some representative sample weights, and Statistical Matching, which is based on predicting and imputing the individual’s responses. Both rely on an auxiliary probability sample containing some covariates in common with our non-probability sample, which includes the target variable of interest. We contribute to the development of these techniques by proposing computational methods which significantly improve their efficacy. First, we consider their application with different advanced machine learning models, culminating in state-of-the-art techniques which optimize the results obtained. We also propose a novel method for combining Propensity Score Adjustment and Statistical Matching, improving the bias reduction obtained with each method separately. We implement many of these methods along with other bias reduction alternatives for non-probability surveys in NonProbEst, an easy-to-use R package. Additionally, we extend their application to more contexts. The Propensity Score Adjustment method, combined with calibration techniques, can be considered for overlapping panel surveys in order to obtain transversal as well as longitudinal estimates over time. This compensates the bias resulting from the non-response in successive measurements. In this way we propose several reliable estimators which are then applied to diverse parameters of interest in a research project about the evolution of COVID-19. We also consider a scenario in which the auxiliary probabilistic sample includes the target variable as well. An extensive comparative study is carried out with different possible strategies. The results show the benefits of the proposed methodologies. Note: This thesis is presented as a compendium of six publications in relation with the contents of the thesis. The full version of the papers is included in Appendices A1 - A6.