NCISurvey

A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond

¹Shanghai AI Lab, ²The University of Hong Kong, ³East China Normal University,
⁴Fudan University, ⁵NetEase Fuxi AI Lab, ⁶Google DeepMind, ⁷A*STAR

Abstract: Neural Code Intelligence -- leveraging deep learning to understand, generate, and optimize code -- holds immense potential for transformative impacts on the whole society. Bridging the gap between Natural Language and Programming Language, this domain has drawn significant attention from researchers in both research communities over the past few years. This survey presents a systematic and chronological review of the advancements in code intelligence, encompassing over 50 representative models and their variants, more than 20 categories of tasks, and an extensive coverage of over 680 related works. We follow the historical progression to trace the paradigm shifts across different research phases (e.g., from modeling code with recurrent neural networks to the era of Large Language Models). Concurrently, we highlight the major technical transitions in models, tasks, and evaluations spanning through different stages. For applications, we also observe a co-evolving shift. It spans from initial endeavors to tackling specific scenarios, through exploring a diverse array of tasks during its rapid expansion, to currently focusing on tackling increasingly complex and varied real-world challenges. Building on our examination of the developmental trajectories, we further investigate the emerging synergies between code intelligence and broader machine intelligence, uncovering new cross-domain opportunities and illustrating the substantial influence of code intelligence across various domains. Finally, we delve into both the opportunities and challenges associated with this field, alongside elucidating our insights on the most promising research directions.

Representative Benchmarks

Table 1: Representative benchmarks for different types of code-related downstream tasks, including the number of programming languages they cover and brief descriptions.

Code Pre-trained Models

Table 2: Math Reasoning. An overview of Code Pre-trained Models’ architecture and pre-training strategies, along with whether these models leverage code structure information during the pre-training phase.

Code LLMs

Table 3: An overview of CodeLLMs categorized based on their architecture, along with their parameter size, base model (if any), vocabulary size, context length, training objectives, data scale used for training (measured by K/B/T in number of tokens, or measured by GB for disk size).

BibTeX

@article{sun2024survey, title={A survey of neural code intelligence: Paradigms, advances and beyond}, author={Sun, Qiushi and Chen, Zhirui and Xu, Fangzhi and Cheng, Kanzhi and Ma, Chang and Yin, Zhangyue and Wang, Jianing and Han, Chengcheng and Zhu, Renyu and Yuan, Shuai and others}, journal={arXiv preprint arXiv:2403.14734}, year={2024} }

A Survey of Neural Code Intelligence: Paradigms, Advances and Beyond

Figure 1: The trajectory of neural code intelligence’s evolution is encapsulated through the development of language models for code. This is delineated by four principal branches, each representing a distinct category of models.

Overview of Representative Works

Figure 2: A chronological overview of representative works in neural code intelligence over recent years.

Paradigm Shifts

Figure 3: Schematic illustration of different paradigms of applying language models for code to downstream applications.