Introducing a New Scalable Data-as-a-Service Cloud Platform for Enriching Traditional Text Mining Techniques by Integrat
A good deal of digital data produced in academia, commerce and industry is made up of a raw, unstructured text, such as Word documents, Excel tables, emails, web pages, etc., which are also often represented in a natural language. An important analytical
- PDF / 528,200 Bytes
- 13 Pages / 439.363 x 666.131 pts Page_size
- 29 Downloads / 192 Views
High-Performance Computing Center Stuttgart, Nobelstr. 19, 70569 Stuttgart, Germany {cheptsov,tenschert}@hlrs.de 2 Institute of the Society for the Promotion of Applied Information Sciences at the Saarland University, Martin-Luther-Str. 14, 66111 Saarbrücken, Germany [email protected] 3 University of Ulm, Institute of Artificial Intelligence, 89069 Ulm, Germany [email protected] 4 Objectivity, Inc., 3099 North First Street, Suite 200 San Jose, CA 95134 USA [email protected] 5 derivo GmbH, James-Franck-Ring, 89081 Ulm, Germany [email protected]
Abstract. A good deal of digital data produced in academia, commerce and industry is made up of a raw, unstructured text, such as Word documents, Excel tables, emails, web pages, etc., which are also often represented in a natural language. An important analytical task in a number of scientific and technological domains is to retrieve information from text data, aiming to get a deeper insight into the content represented by the data in order to obtain some useful, often not explicitly stated knowledge and facts, related to a particular domain of interest. The major challenge is the size, structural complexity, and frequency of the analysed text sets’ updates (i.e., the ‘big data’ aspect), which makes the use of traditional analysis techniques and tools impossible. We introduce an innovative approach to analyse unstructured text data. This allows for improving traditional data mining techniques by adopting algorithms from ontological domain modelling, natural language processing, and machine learning. The technique is inherently designed with parallelism in mind, which allows for high performance on large-scale Cloud computing infrastructures. Keywords: Data-as-a-Service, Text Mining, Ontology Modelling, Cloud computing.
1
Introduction
The modern IT technologies are increasingly getting data-centric, fostered by the broad availability of data acquisition, collection and storing platforms. The concepts Z. Huang et al. (Eds.): WISE 2013 Workshops 2013, LNCS 8182, pp. 62–74, 2014. © Springer-Verlag Berlin Heidelberg 2014
Introducing a New Scalable Data-as-a-Service Cloud Platform
63
of linked and open data have enabled a principally new dimension of data analysis, which is no longer limited to internal document collections, i.e., “local data”, but comprises a number of heterogeneous data sources, in particular from the Web, i.e., “global data”. However, existing data processing and analysis technologies are still far from being able to scale to demands of global and, in case of large industrial corporations, even of local data, which makes up the core of the “big data” problem. With regard to this, the design of the current data analysis algorithms requires to be reconsidered in order to enable the scalability to big data demands. The problem has two major aspects: (1) the solid design of current algorithms makes the integration with other techniques that would help increase the analysis quality impossible, and (2) sequential design of the algorithms pr
Data Loading...