Software development is an important part of the industry in both Singapore and China. SingStat reported that in 2018 Information and Communications Services (ICS) industry contributed 20.04 billion SGD in terms of value-added, with Computer Programming and Consultancy (CPC) being the largest segment. In China, the Ministry of Industry and Information Technology reported that in 2018, the Chinese ICT industry contributed 896.2 billion RMB in terms of profits, and created 270 thousand new employments. Due to the large scale and rapid growth in the ICT industry, there is a continuous challenge to develop and evolve software at a fast pace, with as few reliability and security issues as possible, and at a lower cost, especially in the face of a tight labour market characterized with more opportunities than talents. A recent study shows that Software Engineer is the job role with the largest gap between demand and supply, and several related roles (software architect, full-stack developers, and system engineers) also appearing in the top 10.
To address the demands of more high-quality software and a shortage of new talents, software analytics research has come to the fore. Research in software analytics seeks to automate software engineering activities by developing novel and customized data science solutions applied to developer activity data to automate software engineering tasks. Like other areas, the big data phenomenon is also observed for software development; software practitioners today produce a plethora of data stored in publicly accessible repositories, e.g., millions of repositories on GitHub with billions of individual commits. These pieces of passive data (aka “Big Code”) can potentially be analyzed by designing novel data science methods in order to build automated tools that can help developers in their day-to-day activities. Unfortunately, existing software analytics solutions have primarily been trained and applied on data of relatively small scale, e.g., data from a few or at most dozens of projects. This project will address limitations and revolutionize existing software analytics work by unlocking the power of big code for automating three common software development tasks: coding, commenting, and debugging.
The researchers from both sides will work together towards the following objectives:
1) Code auto-completion. Code auto-completion assists developers by suggesting APIs to use, effectively improving developer productivity. We will develop a data science-powered code auto-completion systems.
2) Code summarization: Commenting a piece of code with a suitable natural language description is among software development best practices. Unfortunately, many projects today are developed and maintained with minimal or no comments. We will build a data science-powered solution to automatically summarize a piece of code in natural language.
3) Just-in-time defect prediction: Software systems today are typically built piece-by-piece by developers pushing commits to a version control system. We want to build a data science-powered solution to identify defective commits.
The proposed research scope aligns with the priorities of improving productivity for addressing growing needs in Information and Communications Services for both Singapore and China.
Project Title: Making Big Code Active: From Billions of Code Tokens to Automation
Host University: Singapore Management University
Principal Investigator: Dr. David Lo, Associate Professor, School of Information Systems, Singapore Management University
Co-Investigator: Dr. Lingxiao Jiang, Associate Professor, School of Information Systems, Singapore Management University
Collaborators: Prof. Xiaohu Yang, Professor, College of Computer Science and Technology, Zhejiang University; Prof. Jianling Sun, Professor, College of Computer Science and Technology, Zhejiang University; Prof. Xinyu Wang, Professor, College of Computer Science and Technology, Zhejiang University