Predictive intelligence of reliable analytics in distributed computing environments

  • PDF / 5,641,599 Bytes
  • 20 Pages / 595.224 x 790.955 pts Page_size
  • 7 Downloads / 198 Views

DOWNLOAD

REPORT


Predictive intelligence of reliable analytics in distributed computing environments Yiannis Kathidjiotis1 · Kostas Kolomvatsos1 · Christos Anagnostopoulos1

© The Author(s) 2020

Abstract Lack of knowledge in the underlying data distribution in distributed large-scale data can be an obstacle when issuing analytics & predictive modelling queries. Analysts find themselves having a hard time finding analytics/exploration queries that satisfy their needs. In this paper, we study how exploration query results can be predicted in order to avoid the execution of ‘bad’/non-informative queries that waste network, storage, financial resources, and time in a distributed computing environment. The proposed methodology involves clustering of a training set of exploration queries along with the cardinality of the results (score) they retrieved and then using query-centroid representatives to proceed with predictions. After the training phase, we propose a novel refinement process to increase the reliability of predicting the score of new unseen queries based on the refined query representatives. Comprehensive experimentation with real datasets shows that more reliable predictions are acquired after the proposed refinement method, which increases the reliability of the closest centroid and improves predictability under the right circumstances. Keywords Predictive intelligence · Exploration query prediction · Centroid refinement · Machine learning

1 Introduction Due to the importance and relevance of data in distributed computing environments, large-scale data analytics, predictive modelling, and exploration tasks, they have rightfully found their place in almost all, if not all, of today’s industries. While having access to humongous amounts of data is very beneficial, it has introduced many new challenges. One of them is that they cannot be accessed directly (like a traditional data management systems would be accessed); instead, subsets of them can be acquired through exploration querying.

 Christos Anagnostopoulos

[email protected] Yiannis Kathidjiotis [email protected] Kostas Kolomvatsos [email protected]; [email protected] 1

Essence: Pervasive & Distributed Intelligence Research Lab, School of Computing Science, University of Glasgow, Glasgow, G12 8QQ, UK

Although exploration querying acts as a solution for accessing distributed data, in most cases there is lack of knowledge about the underlying data distributions and their impact on the results. As a result, users/analysts may find it hard to come up with a query to execute and it can be even harder to find a query that will return a satisfying number of results. The number of results returned by a query, which for future convenience will be referred to as score, can vary from being to little to be significant, to being extremely high, which can be more than needed. Apart from the frustration that might be involved in finding the correct query, executing the aforementioned queries can lead to the waste of network and storage