Using unethical data to build a more ethical world
- PDF / 947,582 Bytes
- 8 Pages / 595.276 x 790.866 pts Page_size
- 47 Downloads / 206 Views
OPINION PAPER
Using unethical data to build a more ethical world How CallMiner handles imperfections in speech recognition Jamie Brandon1 Received: 27 August 2020 / Accepted: 29 August 2020 © Springer Nature Switzerland AG 2020
Abstract Data scientists use data to train models. Those models calculate probabilities to capture patterns in the data. It’s difficult to build ethical models when the available training data contains racism, sexism, or other stereotypes. Contact center data, including calls, chats, texts, and emails, is no exception. Instead of building a model to automate decision-making processes, we use the unethical findings from our model as an insight. We discuss debiasing options for removing racism from the model but find that removing this bias removes a crucial insight that an analyst deserves to know. By leaving the model with all the biases learned from the training data, we can provide better analytics. Analysts can recommend solutions that start to dismantle the systemic racism present in our society. Debiasing is not always appropriate. Censoring the model makes it harder to identify what can be done to prevent racism in our procedures and society. Keywords Ethics · NLP · Word embeddings · Debiasing
1 Introduction When a model performs poorly, it’s easy to blame the data. After all, the model simply captures patterns from the training data, like quick restaurant service correlating with a positive review. That model might be used to predict whether a new, unlabeled review is positive or negative. Models can also be used descriptively, showing insights to what might be causing the positive or negative reviews. When poor performance occurs, someone might blame the data. Maybe there’s not enough data to differentiate between positive and negative. Maybe there should be a category for neutral sentiment too. Perhaps data instances were labeled incorrectly, skewing the classifier in the wrong direction. There are plenty of ways the data can be incomplete, inconsistent, or inaccurate. Dirty data affects model performance, but model performance should not be the sole indicator of success. A model with a 20% accuracy score should not be put into production. A model that makes racist decisions 80% of the time should not either. That is to say, models and data can * Jamie Brandon [email protected] 1
CallMiner, Waltham, USA
be ethically dirty too. The model trained on unethical data may carry harmful notions about race, gender, etc. despite performing well on a test set. When a model captures the unethical bias from the training data, it’s easy to accidentally perpetuate harmful stereotypes. As practitioners, we can do more to protect marginalized groups. It’s not enough to simply blame the data for an unethical model. Data scientists usually carry no intention of building an unethical model. The bias exists in the training data, so the model captures that pattern. For example, when selecting features to build a model, someone might include zip code. They know that a person’s residen
Data Loading...