Approximate Clustering of Time Series Using Compact Model-Based Descriptions

Clustering time series is usually limited by the fact that the length of the time series has a significantly negative influence on the runtime. On the other hand, approximative clustering applied to existing compressed representations of time series (e.g.

  • PDF / 687,635 Bytes
  • 16 Pages / 430 x 660 pts Page_size
  • 67 Downloads / 270 Views

DOWNLOAD

REPORT


Abstract. Clustering time series is usually limited by the fact that the length of the time series has a significantly negative influence on the runtime. On the other hand, approximative clustering applied to existing compressed representations of time series (e.g. obtained through dimensionality reduction) usually suffers from low accuracy. We propose a method for the compression of time series based on mathematical models that explore dependencies between different time series. In particular, each time series is represented by a combination of a set of specific reference time series. The cost of this representation depend only on the number of reference time series rather than on the length of the time series. We show that using only a small number of reference time series yields a rather accurate representation while reducing the storage cost and runtime of clustering algorithms significantly. Our experiments illustrate that these representations can be used to produce an approximate clustering with high accuracy and considerably reduced runtime.

1

Introduction

Clustering time series data is a very important data mining task for a wide variety of application fields including stock marketing, astronomy, environmental analysis, molecular biology, and medical analysis. In such application areas the time series have usually an enormous length which has a significantly negative influence on the runtime of the clustering process. As a consequence, a lot of research work has focused on efficient methods for similarity search in and clustering of time series in the past years. Time series are sequences of discrete quantitative data assigned to specific moments in time, i.e. a time series X is a sequence of values X = x1 , . . . , xN , where xi is the value at time slot i. This sequence is often also taken as a N dimensional feature vector, i.e. X ∈ N . The performance of clustering algorithms for time series data is mainly limited by the cost required to compare pairs of time series (i.e. the processing cost of the used distance function). As indicated above, time series are usually very large containing several thousands of values per sequence. Consequently, the

Ê

J.R. Haritsa, R. Kotagiri, and V. Pudi (Eds.): DASFAA 2008, LNCS 4947, pp. 364–379, 2008. c Springer-Verlag Berlin Heidelberg 2008 

Approximate Clustering of Time Series

365

comparison of two time series can be very expensive, particularly when considering the entire sequence of values of the compared objects. The most prominent approaches to measure the similarity of time series are the Euclidean distance and Dynamic Time Warping (DTW). The choice of the distance function mainly depends on the application. In some applications, the Euclidean distance produce better results whereas in other applications, DTW is superior. The big limitation of DTW is its high computational cost of O(N 2 ) while the Euclidean distance between two time series can be computed in O(N ). Since we consider large databases and long time series (i.e. large values of N ) in this paper,

Data Loading...