iMass : an approximate adaptive clustering algorithm for dynamic data using probability based dissimilarity
- PDF / 509,315 Bytes
- 3 Pages / 612.284 x 802.205 pts Page_size
- 27 Downloads / 301 Views
iMass: an approximate adaptive clustering algorithm for dynamic data using probability based dissimilarity Panthadeep BHATTACHARJEE
, Pinaki MITRA
Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati 781039, India c Higher Education Press 2020
1
Research goals
The primary goal of this work is to propose an approximate incremental solution to MBSCAN [1] known as the iMass clustering algorithm for processing point-wise insertions dynamically (Table 1, Fig. 1). MBSCAN acts as a robust baseline because it replaces distance based density measure of DBSCAN [2] by a mass-based [3, 4] approach.
2
Background
The MBSCAN algorithm comprises of the following components: iForest, mass based dissimilarity matrix(mass-matrix), μ-neighborhood mass, core and non-core points, clusters and outliers. The iForest combines multiple iT rees, each being a binary tree representing a particular hierarchical partitioning model Hi , i = 1, 2, 3, ..., t, where t denotes the maximum numTable 1 Dataset
Libras
Segment
Wine
S1
ber of iT rees. The following Eq. (1) defines the mass of a region containing points a and b:
Mr (a, b|H; D) =
1(c ∈ r),
where r is any region, D is the dataset. ∀a, b ∈ D, we have Eq. (2) defining the mass of smallest local region [1] containing a and b: R(a, b|H; D) = argminr⊂H
s.t. {a,b}∈r
1(c ∈ r).
241 270 300 320 360 1541 1600 2000 2200 2310 119 125 155 170 178 601 650 700 800 900
MBSCAN/s 13 14 14 14 16 7 7 7 7 7 14 14 13 14 14 14 14 14 14 14
Received April 3, 2019; accepted December 12, 2019 E-mail: [email protected]; [email protected]
iForest iMass/s 0.00258 0.002525 0.002359 0.00317 0.001829 0.00196 0.00065 0.00067 0.00068 0.00068 0.0015 0.00207 0.00162 0.00150 0.002 0.00218 0.00182 0.00181 0.00199 0.00337
(2)
c∈D
Mass based dissimilarity [1] of a and b (Eq. (3)) is defined as the expected probability of R(a,b | H;D):
Efficiency achieved by building iForest and mass-matrix incrementally due to iMass. Datasets (See UCI Machine Learning Repository) D
(1)
r⊆H s.t. {a,b}∈r c∈D
Reduction/% 99.98 99.98 99.98 99.97 99.98 99.97 99.99 99.99 99.99 99.99 99.98 99.98 99.98 99.98 99.98 99.98 99.98 99.98 99.98 99.97
MBSCAN/s 4.4741 5.1014 6.8980 10.1717 16.5059 444.12 604.434 769.443 948.945 1189.24 0.5356 0.6408 1.3750 1.6561 2.057 49.102 79.1108 105.37 134.963 193.92
Mass-matrix iMass/s 0.4009 0.4953 0.6208 0.7097 0.9130 13.6027 14.5633 22.516 27.3928 30.1611 0.0985 0.1079 0.1701 0.2021 0.2134 2.6358 3.1398 3.69813 5.0534 6.3625
Reduction/% 91.03 90.28 90.99 93.02 94.46 96.93 97.59 97.07 97.11 97.46 81.60 83.15 86.96 87.79 89.62 94.63 96.03 96.49 96.25 96.71
2
Front. Comput. Sci., 2021, 15(2): 152314
Fig. 1 Libras and Segment dataset evaluation. (a) CPU execution time for iMass and MBSCAN (Libras dataset); (b) Percentage of affected nodes from iForest post insertion (Libras dataset); (c) CPU execution time for iMass and MBSCAN (Segment dataset); (d) Percentage of affected nodes from iForest post insertion (Segment dataset)
me (a, b|H; D) = EH(D) [PF
Data Loading...