Sample size determination for biomedical big data with limited labels

  • PDF / 1,156,834 Bytes
  • 13 Pages / 595.276 x 790.866 pts Page_size
  • 85 Downloads / 270 Views

DOWNLOAD

REPORT


(2020) 9:12

ORIGINAL ARTICLE

Sample size determination for biomedical big data with limited labels Aaron N. Richter1   · Taghi M. Khoshgoftaar1 Received: 1 September 2019 / Accepted: 2 January 2020 © Springer-Verlag GmbH Austria, part of Springer Nature 2020

Abstract The era of big data has produced vast amounts of information that can be used to build machine learning models. In many cases, however, there is a point where adding more data only marginally increases model performance. This is especially important for scenarios of limited labeled data, as annotation can be expensive and time consuming. If the required sample size for accurate model performance can be determined early, then resources can be allocated appropriately to minimize time and cost. In this study, we explore sample size determination methods for four real-world biomedical datasets, spanning genomics, proteomics, electronic health records, and insurance claims data, all with millions of instances each and120 positive instances. Table 4 shows the fit schedules for the best fitting curves for each dataset (according to MAE). Note that we are only able to know which curves fit best because we know the actual test curve. In a real scenario of sample size determination with limited labels, we would not know the actual learning curve. Therefore, we are unable to know which fit schedule produces the correct approximation for a learning curve. Future work is needed to develop methods that can evaluate the accuracy of these curves even when the actual learning curve is not known.

5.3 Semi‑supervised method Even if the absolute values of AUC are overly optimistic with the semi-supervised method, we believe that the most important point in a learning curve for sample size determination is the point of convergence, as that is what will be used to decide how much data to label for an experiment. The pointwise slopes as calculated by LRLS for the Splice data are shown in Fig. 5, with a sample convergence point at a slope < 0.0001 . This means that each addition of 100

13

positive (31,808 total) instances only increases the AUC of the model by 0.0001. The actual data hit the convergence point at 1300 instances, while the semi-supervised curve hits at 1000, and the inverse power law at 700. Therefore, if the point of convergence is the key consideration, the semisupervised method is more accurate. Most learning curves exhibit a trend where the initial portion of the curve shows exponentially increasing performance, followed by a period of gradual increase, then a plateau. We visually identified these portions in the actual learning curve then compared the slopes for each portion (Table 5). The Exp region is from 10 to 100 positive instances, the Increase region is from 100 to 2500, and the Plateau is >2500. For an approximation method, the most important part of the curve is the Increase region, as the end of that region indicates the start of the Plateau, or point of diminishing return. While the inverse power law method achieves nearly the exact slope