psda: A tool for extracting knowledge from symbolic data with an application in Brazilian educational data

  • PDF / 2,884,627 Bytes
  • 17 Pages / 595.276 x 790.866 pts Page_size
  • 90 Downloads / 190 Views

DOWNLOAD

REPORT


METHODOLOGIES AND APPLICATION

psda: A tool for extracting knowledge from symbolic data with an application in Brazilian educational data Wagner J. F. Silva1 · Renata M. C. R. Souza1

· F. J. A. Cysneiros1

© Springer-Verlag GmbH Germany, part of Springer Nature 2020

Abstract Symbolic polygonal data analysis is a new type of framework to extract valuable knowledge from a new structure of data using regular polygon built from data in class, big data, and complex data. This paper introduces a toolbox for symbolic polygonal data, named psda, that contains the main descriptive measures for this type of variable, e.g., mean, variance, correlation, and a polygonal linear regression model (plr). It is applied at the Brazilian Basic Education Assessment System (SAEB), giving a new perspective to the managers of the counties to realize the public policy in the Brazilian educational system. The hypothesis test showed that the polygonal linear regression model presented the best performance compared to some symbolic interval regression models in the SAEB application. Keywords Polygonal data · psda · Symbolic data analysis · Regression · Descriptive measures · R

1 Introduction Data analysis is a fundamental framework for extraction of knowledge on biology, statistics, computing science, data mining and so on. This statistical approach is composed of many techniques, e.g., mean, variance, correlation, graphics and others developed over the years. For centuries the object of study of data analysis has been a p-dimensional point in R p ; this framework is known as classical data analysis. From technological advances, the structure of data has been improved every day. Unfortunately, classical data is limited to the study in p-dimensional point. Billard and Diday (2003, 2007) introduces a new type of data considering complex and diverse structures of data, e.g., histogram, probability distributions, intervals, list of categories, etc. This type of data is called symbolic data, and its study is known as symbolic data analysis (SDA). First step in SDA is to build Communicated by V. Loia.

B

Renata M. C. R. Souza [email protected] Wagner J. F. Silva [email protected] F. J. A. Cysneiros [email protected]

1

the symbolic dataset, where the rows are subsets of individual entities having a common property, called classes. In order to take the variability of the individuals inside each class, these new units are described by variables that can take symbolic values. According to Diday (2016) these classes are considered as new units of a higher level of generalization than individuals and they allow to reduce the initial huge size of an input data set by summarizing it. Moreover, classes can represent real units that interest the data analyst. In this context, classes can be seen as a new data framework before extracting knowledge by data science methods and tools. The second step in SDA is to extend machine learning and statistical techniques to symbolic data. The class framework provides some advantages (Diday 2016): – When the popu