Outlier detection methods to improve the quality of citizen science data

  • PDF / 945,364 Bytes
  • 9 Pages / 595.276 x 790.866 pts Page_size
  • 30 Downloads / 192 Views

DOWNLOAD

REPORT


ORIGINAL PAPER

Outlier detection methods to improve the quality of citizen science data Jennifer S. Li 1

&

Andreas Hamann 1 & Elisabeth Beaubien 1

Received: 11 November 2019 / Revised: 3 July 2020 / Accepted: 6 July 2020 # ISB 2020

Abstract Citizen science involves public participation in research, usually through volunteer observation and reporting. Data collected by citizen scientists are a valuable resource in many fields of research that require long-term observations at large geographic scales. However, such data may be perceived as less accurate than those collected by trained professionals. Here, we analyze the quality of data from a plant phenology network, which tracks biological response to climate change. We apply five algorithms designed to detect outlier observations or inconsistent observers. These methods rely on different quantitative approaches, including residuals of linear models, correlations among observers, deviations from multivariate clusters, and percentile-based outlier removal. We evaluated these methods by comparing the resulting cleaned datasets in terms of time series means, spatial data coverage, and spatial autocorrelations after outlier removal. Spatial autocorrelations were used to determine the efficacy of outlier removal, as they are expected to increase if outliers and inconsistent observations are successfully removed. All data cleaning methods resulted in better Moran’s I autocorrelation statistics, with percentile-based outlier removal and the clustering method showing the greatest improvement. Methods based on residual analysis of linear models had the strongest impact on the final bloom time mean estimates, but were among the weakest based on autocorrelation analysis. Removing entire sets of observations from potentially unreliable observers proved least effective. In conclusion, percentile-based outlier removal emerges as a simple and effective method to improve reliability of citizen science phenology observations. Keywords Citizen science . Data cleaning . Outlier detection . Data management . Plant phenology . Climate change

Introduction Citizen science is broadly defined as scientific inquiry that includes volunteers for data collection and/or processing (Silvertown 2009). Citizen science has been documented as early as 3500 years ago with citizens and officials recording locust outbreaks in China (Miller-Rushing et al. 2012). Today, volunteer observers contribute to various research fields, including conservation science, population ecology, environmental risk assessments, pollution detection, and monitoring of the environment to detect change (e.g., Bonney et al. 2009; Silvertown 2009; Dickinson et al. 2012). Citizen scientists

* Jennifer S. Li [email protected] 1

Department of Renewable Resources, Faculty of Agricultural, Life, and Environmental Sciences, University of Alberta, 751 General Services Building, Edmonton, AB T6G 2H1, Canada

enable large-scale scientific data collection that would otherwise not be possible. In general, any type of biological or e