Comparison of Image-Based and Text-Based Source Code Classification Using Deep Learning

  • PDF / 3,232,185 Bytes
  • 13 Pages / 595.276 x 790.866 pts Page_size
  • 33 Downloads / 226 Views

DOWNLOAD

REPORT


ORIGINAL RESEARCH

Comparison of Image‑Based and Text‑Based Source Code Classification Using Deep Learning Elife Ozturk Kiyak1 · Ayse Betul Cengiz1 · Kokten Ulas Birant2 · Derya Birant2  Received: 31 March 2020 / Accepted: 30 July 2020 © Springer Nature Singapore Pte Ltd 2020

Abstract Source code classification (SCC) is a task to assign codes into different categories according to a criterion such as according to their functionalities, programming languages or vulnerabilities. Many source code archives are organized according to the programming languages, and thereby, the desired code fragments can be easily accessed by searching within the archive. However, manually organizing source code archives by field experts is labor intensive and impractical because of the fastgrowing available source codes. Therefore, this study proposes new convolutional neural network (CNN) architectures to build source code classifiers that automatically identify programming languages from source codes. This is the first study in which the performances of deep learning algorithms on programming language identification are compared on both image and text files. In this study, the experiments are performed on three source code datasets to identify eight programming languages, including C, C++, C# , Go, Python, Ruby, Rust, and Java. The comparative results indicate that although textbased SCC and image-based SCC approaches achieve very high ( > 93.5% ) and similar accuracies, text-based classification has significantly better performance in terms of execution time. Keywords  Source code classification · Software engineering · Programming languages · Deep learning · Image classification · Text mining

Introduction Until now, various programming languages have been developed such as C, C++, C#, Java, and Python, and used for many software engineering projects. The source codes This article is part of the topical collection “Deep learning approaches for data analysis: A practical perspective” guest edited by D. Jude Hemanth, Lipo Wang and Anastasia Angelopoulou. * Derya Birant [email protected] Elife Ozturk Kiyak [email protected] Ayse Betul Cengiz [email protected] Kokten Ulas Birant [email protected] 1



The Graduate School of Natural and Applied Sciences, Dokuz Eylul University, 35390 Izmir, Turkey



Department of Computer Engineering, Dokuz Eylul University, 35390 Izmir, Turkey

2

written in different languages have been continuously pushing into active repositories such as GitHub, SourceForge, and Bitbucket. With the increase of open-source programming environments in recent years, the number of users who benefit from these environments is also growing. They can add their own codes written in different programming languages, or easily access the ready-written codes and make some changes on them. There is a significant increase in the use of online coding platforms such as CodeForces, and Google Colab. Thus, a substantial volume of source codes has become accessible on many online platforms. In addition to th