A Survey of Open Source Data Mining Systems

Open source data mining software represents a new trend in data mining research, education and industrial applications, especially in small and medium enterprises (SMEs). With open source software an enterprise can easily initiate a data mining project us

  • PDF / 288,812 Bytes
  • 12 Pages / 430 x 660 pts Page_size
  • 50 Downloads / 178 Views

DOWNLOAD

REPORT


Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen 518055, China [email protected], [email protected] 2 Australian Taxation Office, Australia [email protected]

Abstract. Open source data mining software represents a new trend in data mining research, education and industrial applications, especially in small and medium enterprises (SMEs). With open source software an enterprise can easily initiate a data mining project using the most current technology. Often the software is available at no cost, allowing the enterprise to instead focus on ensuring their staff can freely learn the data mining techniques and methods. Open source ensures that staff can understand exactly how the algorithms work by examining the source codes, if they so desire, and can also fine tune the algorithms to suit the specific purposes of the enterprise. However, diversity, instability, scalability and poor documentation can be major concerns in using open source data mining systems. In this paper, we survey open source data mining systems currently available on the Internet. We compare 12 open source systems against several aspects such as general characteristics, data source accessibility, data mining functionality, and usability. We discuss advantages and disadvantages of these open source data mining systems. Keywords: Open source software, data mining, FLOSS.

1

Introduction

Open source software has a solid foundation with, for example, the GNU project [1] dating from 1984. Open source software (also referred to as free, libr`e, open source software, or FLOSS) is well known through the GNU software suite upon which GNU/Linux is based, but also through the widely used MySQL, Apache, JBoss, and Eclipse software, just to highlight a few. Open source host site sourceforge.net, for example, lists over 100,000 open source projects. Open source for business intelligence (BI) has also been gathering momentum in recent years. The open source database, MySQL [2], is widely used in building data warehouses and data marts to support BI applications. The open  

This paper was supported by the National Natural Science Foundation of China (NSFC) under grants No.60603066. Corresponding author.

T. Washio et al. (Eds.): PAKDD 2007 Workshops, LNAI 4819, pp. 3–14, 2007. c Springer-Verlag Berlin Heidelberg 2007 

4

X. Chen et al.

source data mining platform, Weka [3], has been a popular platform for sharing algorithms amongst researchers. In the past 6 months, for example, there have been between 16,353 and 29,950 downloads per month1 . Open source data mining is particularly important and effective for small and medium enterprises (SMEs) wishing to adopt business intelligence solutions for marketing, customer service, e-business, and risk management. Due to the high cost of commercial software and the uncertainty associated with bringing data mining into an enterprise, many SMEs look to adopt a low cost approach to experimenting data mining solutions and in gaining data mining expertise. With open source software an e