On semi-supervised learning

  • PDF / 1,837,305 Bytes
  • 24 Pages / 439.37 x 666.142 pts Page_size
  • 90 Downloads / 218 Views

DOWNLOAD

REPORT


On semi-supervised learning A. Cholaquidis1

· R. Fraiman1 · M. Sued2

Received: 8 January 2019 / Accepted: 10 November 2019 © Sociedad de Estadística e Investigación Operativa 2019

Abstract Major efforts have been made, mostly in the machine learning literature, to construct good predictors combining unlabelled and labelled data. These methods are known as semi-supervised. They deal with the problem of how to take advantage, if possible, of a huge amount of unlabelled data to perform classification in situations where there are few labelled data. This is not always feasible: it depends on the possibility to infer the labels from the unlabelled data distribution. Nevertheless, several algorithms have been proposed recently. In this work, we present a new method that, under almost necessary conditions, attains asymptotically the performance of the best theoretical rule when the size of the unlabelled sample goes to infinity, even if the size of the labelled sample remains fixed. Its performance and computational time are assessed through simulations and in the well- known “Isolet” real data of phonemes, where a strong dependence on the choice of the initial training sample is shown. The main focus of this work is to elucidate when and why semi-supervised learning works in the asymptotic regime described above. The set of necessary assumptions, although reasonable, show that semi-parametric methods only attain consistency for very wellconditioned problems. Keywords Semi-supervised learning · Small training sample · Consistency Mathematics Subject Classification 62G08 · 68T05 · 68Q32

B

A. Cholaquidis [email protected]; [email protected] R. Fraiman [email protected] M. Sued [email protected]

1

Facultad de Ciencias, Universidad de la República, Montevideo, Uruguay

2

Instituto de Cálculo, Facultad de Ciencias Exactas y Naturales, Buenos Aires, Argentina

123

A. Cholaquidis et al.

1 Introduction Semi-supervised learning (SSL) dates back to the 1960s, starting with the pioneering works of Scudder (1965), Fralick (1967) and Agrawala (1970), among others. Later on, the problem was addressed by the highly influential works of Castelli and Cover (1995, 1996). The first one shows that when the size l of the unlabelled sample is equal to infinity, the classification error converges exponentially fast to the Bayes risk, if the size n of the labelled sample converges to infinity. In the second one, it is assumed that the density of the covariates is given by a parametric model p(x) = π p(x|y = θ )+(1−π ) p(x|y = 1−θ ), where p(x) is known except for the parameters θ ∈ {0, 1} and π ∈ (0, 1). Under regularity conditions, consistency is shown if the minimum between n and l converges to infinity. Recently, SSL has gained paramount importance due to the huge amount of data coming from diverse sources, such as the internet, genomic research, text classification and many others; see, for instance, Zhu (2008) or Chapelle et al. (2006) for a survey on SSL. This large amount of data is typically unlabelled; th