iMass : an approximate adaptive clustering algorithm for dynamic data using probability based dissimilarity

  • PDF / 509,315 Bytes
  • 3 Pages / 612.284 x 802.205 pts Page_size
  • 27 Downloads / 301 Views

DOWNLOAD

REPORT


iMass: an approximate adaptive clustering algorithm for dynamic data using probability based dissimilarity Panthadeep BHATTACHARJEE

, Pinaki MITRA

Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati 781039, India c Higher Education Press 2020 

1

Research goals

The primary goal of this work is to propose an approximate incremental solution to MBSCAN [1] known as the iMass clustering algorithm for processing point-wise insertions dynamically (Table 1, Fig. 1). MBSCAN acts as a robust baseline because it replaces distance based density measure of DBSCAN [2] by a mass-based [3, 4] approach.

2

Background

The MBSCAN algorithm comprises of the following components: iForest, mass based dissimilarity matrix(mass-matrix), μ-neighborhood mass, core and non-core points, clusters and outliers. The iForest combines multiple iT rees, each being a binary tree representing a particular hierarchical partitioning model Hi , i = 1, 2, 3, ..., t, where t denotes the maximum numTable 1 Dataset

Libras

Segment

Wine

S1

ber of iT rees. The following Eq. (1) defines the mass of a region containing points a and b: 

Mr (a, b|H; D) =

1(c ∈ r),

where r is any region, D is the dataset. ∀a, b ∈ D, we have Eq. (2) defining the mass of smallest local region [1] containing a and b: R(a, b|H; D) = argminr⊂H

 s.t. {a,b}∈r

1(c ∈ r).

241 270 300 320 360 1541 1600 2000 2200 2310 119 125 155 170 178 601 650 700 800 900

MBSCAN/s 13 14 14 14 16 7 7 7 7 7 14 14 13 14 14 14 14 14 14 14

Received April 3, 2019; accepted December 12, 2019 E-mail: [email protected]; [email protected]

iForest iMass/s 0.00258 0.002525 0.002359 0.00317 0.001829 0.00196 0.00065 0.00067 0.00068 0.00068 0.0015 0.00207 0.00162 0.00150 0.002 0.00218 0.00182 0.00181 0.00199 0.00337

(2)

c∈D

Mass based dissimilarity [1] of a and b (Eq. (3)) is defined as the expected probability of R(a,b | H;D):

Efficiency achieved by building iForest and mass-matrix incrementally due to iMass. Datasets (See UCI Machine Learning Repository) D

(1)

r⊆H s.t. {a,b}∈r c∈D

Reduction/% 99.98 99.98 99.98 99.97 99.98 99.97 99.99 99.99 99.99 99.99 99.98 99.98 99.98 99.98 99.98 99.98 99.98 99.98 99.98 99.97

MBSCAN/s 4.4741 5.1014 6.8980 10.1717 16.5059 444.12 604.434 769.443 948.945 1189.24 0.5356 0.6408 1.3750 1.6561 2.057 49.102 79.1108 105.37 134.963 193.92

Mass-matrix iMass/s 0.4009 0.4953 0.6208 0.7097 0.9130 13.6027 14.5633 22.516 27.3928 30.1611 0.0985 0.1079 0.1701 0.2021 0.2134 2.6358 3.1398 3.69813 5.0534 6.3625

Reduction/% 91.03 90.28 90.99 93.02 94.46 96.93 97.59 97.07 97.11 97.46 81.60 83.15 86.96 87.79 89.62 94.63 96.03 96.49 96.25 96.71

2

Front. Comput. Sci., 2021, 15(2): 152314

Fig. 1 Libras and Segment dataset evaluation. (a) CPU execution time for iMass and MBSCAN (Libras dataset); (b) Percentage of affected nodes from iForest post insertion (Libras dataset); (c) CPU execution time for iMass and MBSCAN (Segment dataset); (d) Percentage of affected nodes from iForest post insertion (Segment dataset)

me (a, b|H; D) = EH(D) [PF