Accessibility, Adaptability, and Extendibility: Dealing with the Small Data Problem

An underserved niche exists for data mining tools in complex analytical environments. We propose three attributes of analytical tool development that facilitates rapid operationalization of new tools into complex, dynamic environments: accessibility, adap

PDF / 171,142 Bytes
12 Pages / 439.37 x 666.142 pts Page_size
47 Downloads / 201 Views

DOWNLOAD

REPORT

Abstract An underserved niche exists for data mining tools in complex analytical environments. We propose three attributes of analytical tool development that facilitates rapid operationalization of new tools into complex, dynamic environments: accessibility, adaptability, and extendibility. Accessibility we deﬁne as the ability to load data into an analytical system quickly and seamlessly. Adaptability we deﬁne as the ability to apply a tool rapidly to new, unanticipated use cases. Extendibility we deﬁne as the ability to create new functionality “in the ﬁeld” where it is being used and, if needed, harden that new functionality into a new, more permanent user interface. Distributed “big data” systems generally do not optimize for these attributes, creating an underserved niche for new analytical tools. In this paper we will deﬁne the problem, examine the three attributes, and describe the architecture of an example system called Citrus that we have built and use that is especially focused on these three attributes. Keywords Human factors

Text analysis Data mining Analytical tools

1 Introduction Data mining needs for national security are complex. The industry has seen analytical tool capabilities evolve quickly over the years. A decade ago, the ability to perform modest text analysis over several thousand documents on an individual desktop was considered an accomplishment and large scale distributed computing

T. Bauer (&) D. Garcia Sandia National Laboratories, Albuquerque, NM, USA e-mail: [email protected] D. Garcia e-mail: [email protected] © Springer International Publishing Switzerland 2017 I.L. Nunes (ed.), Advances in Human Factors and System Interactions, Advances in Intelligent Systems and Computing 497, DOI 10.1007/978-3-319-41956-5_20

219

220

T. Bauer and D. Garcia

required highly specialized hardware and staff with engineering degrees. Today, a high-end but still stock laptop can process millions of documents and a bright high school student can set up a basic Hadoop cluster. Current conventional wisdom has led many information technology departments serving analytical environments to focus on building large scale, distributed computational systems. There are advantages to this. Consolidating analytical capabilities into a centralized, shared location reduces the need for individual deployments of software and distribution of data. Consolidating makes the system and data easier to manage. Having a single location to “update everything” makes it easier for an IT department to deploy new capabilities on a wide scale. It might seem that this would lead to rapid deployment of new analytical capabilities. After all, if an IT department can put new capabilities in a single place and have them immediately widely available, this should improve the rapid operationalization of new analytical capabilities. On the one hand, this approach may work for certain computing technologies such as web-based email. In these kinds of situations, the solutions offered are generally “one size ﬁts all” and ther

Data Loading...

Accessibility, Adaptability, and Extendibility: Dealing with the Small Data Problem

Recommend Documents

Dealing with the Dangers

Editorial: Dealing with the Missing Data Challenge in Clinical Trials

Dealing with time in the multiple traveling salespersons problem with moving targets

New Means of Data Collection and Accessibility

Dealing with Heterogeneity

Dealing with Errors

Dealing with Units

Dealing with High Dimensional Sentiment Data Using Gradient Boosting Machines

Dealing with Hard Problems

The Small Set Vertex Expansion Problem

Dealing with Data Streams: Complex Event Processing vs. Data Stream Mining

Dealing with Imbalanced and Weakly Labelled Data in Machine Learning using Fuzzy and Rough Set Methods