- Open Access
- Article
Enhancing Python Code Embeddings: Fusion of Code2vec with Large Language Models
by Long H. Ngo and Jonathan Rivalan
Smile France, Asnières-sur-Seine, 92600, France
* Author to whom correspondence should be addressed.
Journal of Engineering Research and Sciences, Volume 4, Issue 1, Page # 1-7, 2025; DOI: 10.55708/js0401001
Keywords: Machine learning, Neural network, Large Language Model, Distributed representations, Code search
Received: 30 October 2024, Revised: 14 December 2024, Accepted: 15 December 2024, Published Online: 19 January 2025
(This article belongs to the Special Issue Special Issue on Multidisciplinary Sciences and Advanced Technology 2024 & Section Biochemical Research Methods (BRM))
APA Style
Ngo, L. H., & Rivalan, J. (2025). Enhancing Python code embeddings: Fusion of code2vec with large language models. Journal of Engineering Research and Sciences, 4(1), 1–7. https://doi.org/10.55708/js0401001
Chicago/Turabian Style
Ngo, Long H., and Jonathan Rivalan. “Enhancing Python Code Embeddings: Fusion of Code2vec with Large Language Models.” Journal of Engineering Research and Sciences 4, no. 1 (2025): 1–7. https://doi.org/10.55708/js0401001.
IEEE Style
L. H. Ngo and J. Rivalan, “Enhancing Python code embeddings: Fusion of code2vec with large language models,” Journal of Engineering Research and Sciences, vol. 4, no. 1, pp. 1–7, 2025, doi: 10.55708/js0401001.
Automated code comprehension has recently become integral to software development. Neural networks, widely employed in natural language processing tasks, can capture the semantic meanings of language by representing it in vector form. Although programming code differs from natural language, we hypothesize that neural models can learn both the syntactic and semantic attributes inherent in code. This study presents an innovative approach to improve code representation and understanding for Python, building upon a previous work (code2vec extended with ASTminer). The novel method integrates embeddings from Large Language Models (LLMs) with code2vec vectors, aiming to align both semantic and syntactic information in code representations. We explore various fusion techniques, including simple concatenation, weighted sum, or attention-based mechanism, to combine LLM embeddings with code2vec vectors. We explore how semantic information from LLMs complements the structural information from code2vec, and discuss the potential impact of this synergy on software development practices. These findings open new directions for more accurate and adaptable code understanding models, with implications for improving documentation, code search, and overall software development efficiency.
- L. H. Ngo, V. Sekar, E. Leclercq, J. Rivalan, “Exploring code2vec and astminer for python code embeddings”, “2023 IEEE 3rd International Conference on Software Engineering and Artificial Intelligence (SEAI)”, pp. 53–57, IEEE, 2023, doi:10.1109/SEAI59139.2023.10217505.
- V. Kovalenko, E. Bogomolov, T. Bryksin, A. Bacchelli, “Pathminer: a library for mining of path-based representations of code”, “Proceedings of the 16th International Conference on Mining Software Repositories”, pp. 13–17, IEEE Press, 2019, doi:10.1109/MSR.2019.00015.
- U. Alon, M. Zilberstein, O. Levy, E. Yahav, “code2vec: Learning distributed representations of code”, Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 1–29, 2019, doi:10.1145/3290353.
- S. Hu, Y. Zuo, L. Wang, P. Liu, “A review about building hidden layer methods of deep learning”, Journal of Advances in Information Technology, vol. 7, no. 1, 2016, doi:10.12720/jait.7.1.58-63.
- Y. Sakai, Y. Eto, Y. Teranishi, “Structured pruning for deep neural networks with adaptive pruning rate derivation based on connection sensitivity and loss function”, Journal of Advances in Information Technology, 2022, doi:10.12720/jait.13.1.1-7.
- M. Allamanis, H. Peng, C. Sutton, “A convolutional attention network for extreme summarization of source code”, “International conference on machine learning”, pp. 2091–2100, PMLR, 2016, doi:10.48550/arXiv.1602.03001.
- M. White, C. Vendome, M. Linares-Vásquez, D. Poshyvanyk, “Toward deep learning software repositories”, “2015 IEEE/ACM 12thWorking Conference on Mining Software Repositories”, pp. 334–345, IEEE, 2015, doi:10.1109/MSR.2015.33.
- M. Allamanis, E. T. Barr, C. Bird, C. Sutton, “Learning natural coding conventions”, “Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering”, pp. 281–293, 2014, doi:10.1145/2635868.2635883.
- M. Allamanis, C. Sutton, “Mining source code repositories at massive scale using language modeling”, “2013 10th working conference on mining software repositories (MSR)”, pp. 207–216, IEEE, 2013, doi:10.1109/MSR.2013.6624029.
- D. Movshovitz-Attias,W. Cohen, “Natural language models for predicting programming comments”, “Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)”, pp. 35–40, 2013, doi:10.18653/v1/P13-2007.
- U. Alon, M. Zilberstein, O. Levy, E. Yahav, “A general path-based representation for predicting program properties”, ACM SIGPLAN Notices, vol. 53, no. 4, pp. 404–419, 2018, doi:10.1145/3192366.3192412.
- P. Bielik, V. Raychev, M. Vechev, “Phog: probabilistic model for code”, “International conference on machine learning”, pp. 2933–2942, PMLR, 2016, doi:10.48550/arXiv.1602.05259.
- V. Raychev, M. Vechev, A. Krause, “Predicting program properties from” big code””, ACM SIGPLAN Notices, vol. 50, no. 1, pp. 111–124, 2015, doi:10.1145/2676726.2677009.
- V. Raychev, P. Bielik, M. Vechev, “Probabilistic model for code with decision trees”, ACM SIGPLAN Notices, vol. 51, no. 10, pp. 731–747, 2016, doi:10.1145/2983990.2984041.
- M. Allamanis, E. T. Barr, C. Bird, C. Sutton, “Suggesting accurate method and class names”, “Proceedings of the 2015 10th joint meeting on foundations of software engineering”, pp. 38–49, 2015, doi:10.1145/2786805.2786849.
- V. Raychev, M. Vechev, E. Yahav, “Code completion with statistical language models”, “Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation”, pp. 419–428, 2014, doi:10.1145/2594291.2594321.
- A. Mishne, S. Shoham, E. Yahav, “Typestate-based semantic code search over partial programs”, “Proceedings of the ACM international conference on Object oriented programming systems languages and applications”, pp. 997–1016, 2012, doi:10.1145/2384616.2384698.
- M. Amodio, S. Chaudhuri, T. W. Reps, “Neural attribute machines for program generation”, arXiv preprint arXiv:1705.09231, 2017, doi:10.48550/arXiv.1705.09231.
- Y. Lu, S. Chaudhuri, C. Jermaine, D. Melski, “Data-driven program completion”, arXiv preprint arXiv:1705.09042, 2017, doi:10.48550/arXiv.1705.09042.
- C. Maddison, D. Tarlow, “Structured generative models of natural source code”, “International Conference on Machine Learning”, pp. 649–657, PMLR, 2014, doi:10.48550/arXiv.1401.0514.
- M. Allamanis, E. T. Barr, P. Devanbu, C. Sutton, “A survey of machine learning for big code and naturalness”, ACM Computing Surveys (CSUR), vol. 51, no. 4, pp. 1–37, 2018, doi:10.1145/3212695.
- M. Vechev, E. Yahav, et al., “Programming with “big code””, Foundations and Trends® in Programming Languages, vol. 3, no. 4, pp. 231–284, 2016, doi:10.1561/2500000028.
- V. Kovalenko, E. Bogomolov, T. Bryksin, A. Bacchelli, “Building implicit vector representations of individual coding style”, “Proceedings of the IEEE/ACM 42nd International Conference on Software EngineeringWorkshops”, pp. 117–124, 2020, doi:10.1145/3387940.3391461.
- X. Gu, H. Zhang, S. Kim, “Deep code search”, “Proceedings of the 40th International Conference on Software Engineering”, pp. 933–944, 2018, doi:10.1145/3180155.3180167.
- B. Mitra, N. Craswell, et al., “An introduction to neural information retrieval”, Foundations and Trends® in Information Retrieval, vol. 13, no. 1, pp. 1–126, 2018, doi:10.1561/1500000061.
- H. Husain, H.-H.Wu, T. Gazit, M. Allamanis, M. Brockschmidt, “Code- SearchNet challenge: Evaluating the state of semantic code search”, arXiv preprint arXiv:1909.09436, 2019, doi:10.48550/arXiv.1909.09436.
- I. Sheikh, I. Illina, D. Fohr, G. Linares, “Learning word importance with the neural bag-of-words model”, “Proceedings of the 1st Workshop on Representation Learning for NLP”, pp. 222–229, 2016, doi:10.18653/v1/W16-1626.
- K. Cho, B. Van Merriënboer, D. Bahdanau, Y. Bengio, “On the properties of neural machine translation: Encoder-decoder approaches”, “Proceedings of SSST-8, EighthWorkshop on Syntax, Semantics and Structure in Statistical Translation”, 2014, doi:10.3115/v1/W14-4012.
- K. Yoon, “Convolutional neural networks for sentence classification [ol]”, arXiv Preprint, 2014, doi:10.48550/arXiv.1408.5882.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, “Attention is all you need”, Advances in neural information processing systems, vol. 30, 2017, doi:10.48550/arXiv.1706.03762.
- B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez, et al., “Code llama: Open foundation models for code”, arXiv preprint arXiv:2308.12950, 2023, doi:10.48550/arXiv.2308.12950.
- B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Dang, et al., “Qwen2. 5-coder technical report”, arXiv preprint arXiv:2409.12186, 2024, doi:10.48550/arXiv.2409.12186.