Cost estimation of spatial join in spatialhadoop
- PDF / 3,317,329 Bytes
- 39 Pages / 439.642 x 666.49 pts Page_size
- 65 Downloads / 169 Views
Cost estimation of spatial join in spatialhadoop A. Belussi1 · S. Migliorini1
· A. Eldawy2
Received: 11 June 2019 / Revised: 28 February 2020 / Accepted: 18 May 2020 / © Springer Science+Business Media, LLC, part of Springer Nature 2020
Abstract Spatial join is an important operation in geo-spatial applications, since it is frequently used for performing data analysis involving geographical information. Many efforts have been done in the past decades in order to provide efficient algorithms for spatial join and this becomes particularly important as the amount of spatial data to be processed increases. In recent years, the MapReduce approach has become a de-facto standard for processing large amount of data (big-data) and some attempts have been made for extending existing frameworks for the processing of spatial data. In this context, several different MapReduce implementations of spatial join have been defined which mainly differ in the use of a spatial index and in the way this index is built and used. In general, none of these algorithms can be considered better than the others, but the choice might depend on the characteristics of the involved datasets. The aim of this work is to deeply analyse them and define a cost model for ranking them based on the characteristics of the dataset at hand (i.e., selectivity or spatial properties). This cost model has been extensively tested w.r.t. a set of synthetic datasets in order to prove its effectiveness. Keywords Spatial join · Cost model · SpatialHadoop · MapReduce · Big spatial data analysis
1 Introduction In the last few years a large amount of effort has been devoted by researchers to provide a MapReduce implementation of several operations that are usually required for performing S. Migliorini
[email protected] A. Belussi [email protected] A. Eldawy [email protected] 1
Department of Computer Science, University of Verona, Verona, Italy
2
Department of Computer Science and Engineering, University of California Riverside, Riverside, CA, USA
Geoinformatica
big data analysis. In particular, the join operation has attracted much attention, since it is frequently used in data processing, for instance a join is necessary for linking log data to user records. This effort has produced a set of different MapReduce implementations of the join operation [7, 15], each one applicable to a particular situation. Therefore, several works have followed in order to propose some sort of heuristics, which allow the system to decide which implementation to apply, given some parameters that characterize the specific case. More specifically, starting from a set of parameters describing both the operation to perform (target parameters) and the input datasets (data parameters), such heuristics are able to produce an estimation of the cost for executing it on a cluster with a given configuration (system parameters). This estimation engine is usually called cost model. Only few studies are available in literature which propose a cost model for MapReduce implementatio
Data Loading...