Chinese Word Segmentation via BiLSTM+Semi-CRF with Relay Node

  • PDF / 877,435 Bytes
  • 12 Pages / 595 x 842 pts (A4) Page_size
  • 27 Downloads / 199 Views

DOWNLOAD

REPORT


Chinese Word Segmentation via BiLSTM+Semi-CRF with Relay Node Nuo Qun1,2,# , Hang Yan1,2,# , Xi-Peng Qiu1,2,∗ , Member, CCF, and Xuan-Jing Huang1,2 , Member, CCF 1 2

School of Computer Science, Fudan University, Shanghai 200433, China Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai 200433, China

E-mail: {14110240023, 11300720199, xpqiu, xjhuang}@fudan.edu.cn Received March 23, 2019; revised July 4, 2019. Abstract Semi-Markov conditional random fields (Semi-CRFs) have been successfully utilized in many segmentation problems, including Chinese word segmentation (CWS). The advantage of Semi-CRF lies in its inherent ability to exploit properties of segments instead of individual elements of sequences. Despite its theoretical advantage, Semi-CRF is still not the best choice for CWS because its computation complexity is quadratic to the sentence’s length. In this paper, we propose a simple yet effective framework to help Semi-CRF achieve comparable performance with CRF-based models under similar computation complexity. Specifically, we first adopt a bi-directional long short-term memory (BiLSTM) on character level to model the context information, and then use simple but effective fusion layer to represent the segment information. Besides, to model arbitrarily long segments within linear time complexity, we also propose a new model named Semi-CRFRelay. The direct modeling of segments makes the combination with word features easy and the CWS performance can be enhanced merely by adding publicly available pre-trained word embeddings. Experiments on four popular CWS datasets show the effectiveness of our proposed methods. The source codes and pre-trained embeddings of this paper are available on https://github.com/fastnlp/fastNLP/. Keywords Semi-Markov conditional random field (Semi-CRF), Chinese word segmentation, bi-directional long short-term memory, deep learning

1

Introduction

The lack of obvious boundaries between Chinese words makes Chinese word segmentation (CWS) an important and preliminary pre-process step for Chinese natural language processing (NLP). Currently, a popular framework is considering CWS as a sequence labeling problem [1] , and each character is assigned a segmentation tag to indicate its relative position inside the word. Therefore, conditional random field (CRF) [2] is widely used to predict the sequence of segmentation tags. Recently, various neural models [3–6] have introduced neural networks to learn features automatically, which alleviate the efforts in feature engineering. However, these character-based methods are difficult to utilize the word-level features. Some researchers

make great efforts to incorporate word-level information for CWS [7–11] . Among them, Semi-Markov conditional random field (Semi-CRF) [12] is a very exciting model to find the best segmentation. The Semi-CRF directly scores the entire candidate segmentation and can fully utilize both the character-level and word-level information. [13] argues that any global feature functions