iMass : an approximate adaptive clustering algorithm for dynamic data using probability based dissimilarity
- PDF / 509,315 Bytes
 - 3 Pages / 612.284 x 802.205 pts Page_size
 - 27 Downloads / 357 Views
 
		    iMass: an approximate adaptive clustering algorithm for dynamic data using probability based dissimilarity Panthadeep BHATTACHARJEE
 
 , Pinaki MITRA
 
 Department of Computer Science and Engineering, Indian Institute of Technology, Guwahati 781039, India c Higher Education Press 2020 
 
 1
 
 Research goals
 
 The primary goal of this work is to propose an approximate incremental solution to MBSCAN [1] known as the iMass clustering algorithm for processing point-wise insertions dynamically (Table 1, Fig. 1). MBSCAN acts as a robust baseline because it replaces distance based density measure of DBSCAN [2] by a mass-based [3, 4] approach.
 
 2
 
 Background
 
 The MBSCAN algorithm comprises of the following components: iForest, mass based dissimilarity matrix(mass-matrix), μ-neighborhood mass, core and non-core points, clusters and outliers. The iForest combines multiple iT rees, each being a binary tree representing a particular hierarchical partitioning model Hi , i = 1, 2, 3, ..., t, where t denotes the maximum numTable 1 Dataset
 
 Libras
 
 Segment
 
 Wine
 
 S1
 
 ber of iT rees. The following Eq. (1) defines the mass of a region containing points a and b: 
 
 Mr (a, b|H; D) =
 
 1(c ∈ r),
 
 where r is any region, D is the dataset. ∀a, b ∈ D, we have Eq. (2) defining the mass of smallest local region [1] containing a and b: R(a, b|H; D) = argminr⊂H
 
  s.t. {a,b}∈r
 
 1(c ∈ r).
 
 241 270 300 320 360 1541 1600 2000 2200 2310 119 125 155 170 178 601 650 700 800 900
 
 MBSCAN/s 13 14 14 14 16 7 7 7 7 7 14 14 13 14 14 14 14 14 14 14
 
 Received April 3, 2019; accepted December 12, 2019 E-mail: [email protected]; [email protected]
 
 iForest iMass/s 0.00258 0.002525 0.002359 0.00317 0.001829 0.00196 0.00065 0.00067 0.00068 0.00068 0.0015 0.00207 0.00162 0.00150 0.002 0.00218 0.00182 0.00181 0.00199 0.00337
 
 (2)
 
 c∈D
 
 Mass based dissimilarity [1] of a and b (Eq. (3)) is defined as the expected probability of R(a,b | H;D):
 
 Efficiency achieved by building iForest and mass-matrix incrementally due to iMass. Datasets (See UCI Machine Learning Repository) D
 
 (1)
 
 r⊆H s.t. {a,b}∈r c∈D
 
 Reduction/% 99.98 99.98 99.98 99.97 99.98 99.97 99.99 99.99 99.99 99.99 99.98 99.98 99.98 99.98 99.98 99.98 99.98 99.98 99.98 99.97
 
 MBSCAN/s 4.4741 5.1014 6.8980 10.1717 16.5059 444.12 604.434 769.443 948.945 1189.24 0.5356 0.6408 1.3750 1.6561 2.057 49.102 79.1108 105.37 134.963 193.92
 
 Mass-matrix iMass/s 0.4009 0.4953 0.6208 0.7097 0.9130 13.6027 14.5633 22.516 27.3928 30.1611 0.0985 0.1079 0.1701 0.2021 0.2134 2.6358 3.1398 3.69813 5.0534 6.3625
 
 Reduction/% 91.03 90.28 90.99 93.02 94.46 96.93 97.59 97.07 97.11 97.46 81.60 83.15 86.96 87.79 89.62 94.63 96.03 96.49 96.25 96.71
 
 2
 
 Front. Comput. Sci., 2021, 15(2): 152314
 
 Fig. 1 Libras and Segment dataset evaluation. (a) CPU execution time for iMass and MBSCAN (Libras dataset); (b) Percentage of affected nodes from iForest post insertion (Libras dataset); (c) CPU execution time for iMass and MBSCAN (Segment dataset); (d) Percentage of affected nodes from iForest post insertion (Segment dataset)
 
 me (a, b|H; D) = EH(D) [PF		
Data Loading...