Bidirectional long short-term memory for surgical skill classification of temporally segmented tasks

  • PDF / 775,559 Bytes
  • 10 Pages / 595.276 x 790.866 pts Page_size
  • 6 Downloads / 168 Views

DOWNLOAD

REPORT


ORIGINAL ARTICLE

Bidirectional long short-term memory for surgical skill classification of temporally segmented tasks Jason D. Kelly1

· Ashley Petersen2 · Thomas S. Lendvay3 · Timothy M. Kowalewski1

Received: 13 March 2020 / Accepted: 23 September 2020 © CARS 2020

Abstract Purpose The majority of historical surgical skill research typically analyzes holistic summary task-level metrics to create a skill classification for a performance. Recent advances in machine learning allow time series classification at the sub-task level, allowing predictions on segments of tasks, which could improve task-level technical skill assessment. Methods A bidirectional long short-term memory (LSTM) network was used with 8-s windows of multidimensional timeseries data from the Basic Laparoscopic Urologic Skills dataset. The network was trained on experts and novices from four common surgical tasks. Stratified cross-validation with regularization was used to avoid overfitting. The misclassified cases were re-submitted for surgical technical skill assessment to crowds using Amazon Mechanical Turk to re-evaluate and to analyze the level of agreement with previous scores. Results Performance was best for the suturing task, with 96.88% accuracy at predicting whether a performance was an expert or novice, with 1 misclassification, when compared to previously obtained crowd evaluations. When compared with expert surgeon ratings, the LSTM predictions resulted in a Spearman coefficient of 0.89 for suturing tasks. When crowds re-evaluated misclassified performances, it was found that for all 5 misclassified cases from peg transfer and suturing tasks, the crowds agreed more with our LSTM model than with the previously obtained crowd scores. Conclusion The technique presented shows results not incomparable with labels which would be obtained from crowdsourced labels of surgical tasks. However, these results bring about questions of the reliability of crowd sourced labels in videos of surgical tasks. We, as a research community, should take a closer look at crowd labeling with higher scrutiny, systematically look at biases, and quantify label noise. Keywords Surgical skill · Crowd sourcing · Bidirectional LSTM · Surgical technical skill · Machine learning

Introduction Computationally assessing the skill of a surgeon in an objective manner using tool motion has proven a complex problem with many challenges. Previous research has relied mostly on summary performance metrics from kinematic data [1– 3]. Unfortunately, these metrics typically failed to completely discriminate novices from experts, that is to never misclas-

B

Jason D. Kelly [email protected]

1

Department of Mechanical Engineering, University of Minnesota, Minneapolis, MN, USA

2

Division of Biostatistics, University of Minnesota, Minneapolis, MN, USA

3

Department of Urology, Seattle Children’s Hospital, Seattle, WA, USA

sify “obvious” novices vs. “obvious” experts—the so-called minimally acceptable classifier (MAC) criterion [4]. Recent advances in machine learning techniques