A Data-Driven Multidimensional Indexing Method for Data Mining in Astrophysical Databases

  • PDF / 2,174,829 Bytes
  • 7 Pages / 600 x 792 pts Page_size
  • 5 Downloads / 235 Views

DOWNLOAD

REPORT


A Data-Driven Multidimensional Indexing Method for Data Mining in Astrophysical Databases Marco Frailis Dipartimento di Fisica, Universit`a degli Studi di Udine, Via delle Scienze 208, 33100 Udine, Italy Email: [email protected]

Alessandro De Angelis INFN, Sezione di Trieste, Gruppo Collegato di Udine, Via delle Scienze 208, 33100 Udine, Italy Email: de [email protected]

Vito Roberto Dipartimento di Matematica e Informatica, Universit`a degli Studi di Udine, Via delle Scienze 208, 33100 Udine, Italy Email: [email protected] Received 1 June 2004; Revised 2 March 2005 Large archives and digital sky surveys with dimensions of 1012 bytes currently exist, while in the near future they will reach sizes of the order of 1015 . Numerical simulations are also producing comparable volumes of information. Data mining tools are needed for information extraction from such large datasets. In this work, we propose a multidimensional indexing method, based on a static R-tree data structure, to efficiently query and mine large astrophysical datasets. We follow a top-down construction method, called VAMSplit, which recursively splits the dataset on a near median element along the dimension with maximum variance. The obtained index partitions the dataset into nonoverlapping bounding boxes, with volumes proportional to the local data density. Finally, we show an application of this method for the detection of point sources from a gamma-ray photon list. Keywords and phrases: multidimensional indexing, VAMSplit R-tree, nearest-neighbor query, one-class SVM, point sources.

1.

INTRODUCTION

At present, several projects for the multiwavelength observation of the universe are underway, for example, SDSS, GALEX, POSS2, DENIS, and so forth [1]. In the next years, new spatial missions will be launched (e.g., GLAST, Swift [2, 3]), surveying the wall sky at different wavelengths (gammaray, X-ray, optical). In the astroparticle and astrophysical fields, data are mostly characterized by multidimensional arrays. For instance, in X-ray and gamma-ray astronomy, the data gathered by detectors are lists of detected photons whose properties include position (RA, DEC), arrival time, energy, error measures both for the position and the energy estimates (dependent on the instrument response), and quality measures of the events. Source catalogs, produced by the analysis of the raw data, are lists of point and extended sources characterized by coordinates, magnitude, spectral indexes, flux, and so forth. Data mining applied to multidimensional data analyzes the relationships between the attributes of a multidimensional object stored into the database and the attributes of

the neighboring ones. Typical queries required by this kind of analysis are the following: (i) point queries, to find all objects overlapping the query point; (ii) range queries, to find all objects having at least one common point with a query window; and (iii) nearest-neighbor queries, to find all objects that have a minimum distance from the query object. Another important