Clasificación de similitud entre códigos de C#

Autores/as

DOI:

https://doi.org/10.5281/zenodo.16757264

Palabras clave:

entrenamiento, gramática, red neuronal siamesa, similitud, árbol de sintaxis abstracta, MSC 68T05, MSC 68T07, MSC 68N19, MSC 68N15, MSC 68Q32

Resumen

El fraude es un problema recurrente en la educación. Los profesores luchan constantemente contra él, tanto en exámenes como en proyectos y evaluaciones teóricas. El sistema de evaluación de la asignatura de Programación en la carrera de Ciencia de la Computación de la Universidad de La Habana, depende, en gran medida, de la realización de proyectos evaluativos por parte de los estudiantes. Contar con herramientas computacionales que permitan de forma automática detectar plagios en tales proyectos será de gran utilidad en el ámbito académico. Esta investigación aborda el problema de la detección de similitudes en el código fuente de proyectos de C#. El proceso comienza con la extracción de árboles de sintaxis abstracta (AST), para establecer una representación estructural del código. A partir de los AST, se extraen características que describen elementos esenciales del mismo y dichas características se agrupan en vectores que se utilizan como entrada de una red neuronal siamesa. Dicha red se entrena como un modelo clasificador de similitudes capaz de detectar parejas de códigos de C# que han sido clonados.

Descargas

Los datos de descargas todavía no están disponibles.

Citas

Abadi, M., A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Eastwood, S. Gray, D. Harvey, G. Irving, M. Isard, Y. Jia, R. M. K., J. Kratz, K. Malhotra, B. McGinnis, S. Moore, D. Murray, D. Orr, M. Schuster, J. Susskind, Z. Tu, V. V., P. Warden, X. Wu, and R. Zadeh: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2016. https://arxiv.org/abs/1603.04467.

Abba, H.L., A. Roko, A.B. Muhammad, A. Usman, and A. Almu: Enhanced Semantic Similarity Detection of Program Code Using Siamese Neural Network. International Journal of Advanced Networking and Applications, 14(2):5353–5360, 2022. https://www.ijana.in/papers/V14I2-5.pdf.

AI, Perplexity: Perplexity AI, 2022. https://www.perplexity.ai.

Aiken, A.: Moss (measure of software similarity), 1994. https://theory.stanford.edu/~aiken/moss/.

Baxter, I.D., A. Yahin, L. Moura, M. Sant’Anna, and L. Bier: Clone detection using abstract syntax trees. In Proceedings of the International Conference on Software Maintenance (ICSM), pages 368–377, 1998. https://doi.org/10.1109/ICSM.1998.738528.

Benedikt, G., F. Matthias, K. Jens, and P. Helmut: Jplag: A system for detecting software plagiarism. https://github.com/jplag/jplag.

Bishop, C.M. and L.J.F.H. van der Maaten: L1 Distance in Machine Learning: Applications and Analysis. IEEE Transactions on Neural Networks, 6(1):11–19, 1995. https://doi.org/10.1109/72.460515.

Bromley, J., Y. LeCun, I. Guyon, R. Shah, L. Bottou, and Y. Hu: Signature verification using a Siamese time delay neural network. In IEEE International Conference on Neural Networks, 1993, pages 669–674. IEEE, 1993. https://dl.acm.org/doi/10.5555/2987189.2987282.

Duchi, J., E. Hazan, and Y. Singer: Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(7):2121–2159, 2011. https://dl.acm.org/doi/10.5555/1953048.2021068.

Goodfellow, I., Y. Bengio, and A. Courville: Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

Hahnloser, R.H., R.S. Finkelstein, and S.H. Seung: Sigmoid Activation Functions for Neural Networks. Neural Computation, 12(4):909–931, 2000. https://doi.org/10.1162/089976600300015000.

Hoq, M., Y. Shi, J. Leinonen, D. Babalola, C.F. Lynch, T.W. Price, and B. Akram: Detecting ChatGPT-Generated Code Submissions in a CS1 Course Using Machine Learning Models. In Proceedings of the 55th ACM Technical Symposium on Computer Science Education V1, SIGCSE 2024, pages 526–532, New York, NY, USA, 2024. Association for Computing Machinery, ISBN 9798400704239. https://doi.org/10.1145/3626252.3630826.

Koch, G., R. Zemel, and R. Salakhutdinov: Siamese neural networks for one-shot image recognition. Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015. https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf.

Krizhevsky, A., I. Sutskever, and G.E. Hinton: ImageNet Classification with Deep Convolutional Neural Networks. Communications of the ACM, 60(6):84–90, 2017. https://doi.org/10.1145/3065386.

MATCOM: Domino. https://github.com/matcom/domino.

MATCOM: Hulk. https://github.com/matcom/hulk.

MATCOM: Moogle. https://github.com/matcom/moogle.

MATCOM: Wall-E. https://github.com/matcom/cool-compiler-base-2019.

Mehrotra, N., N. Agarwal, P. Gupta, S. Anand, D. Lo, and R. Purandare: Modeling Functional Similarity in Source Code With Graph-Based Siamese Networks. IEEE Transactions on Software Engineering, 48(10):3771–3789, 2022. https://doi.org/10.1109/TSE.2021.3105556.

Microsoft: GitHub Copilot, 2021. https://github.com/features/copilot.

Microsoft: C# Version History, 2024. https://learn.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-version-history.

Mikolov, T., K. Chen, G. Corrado, and J. Dean: Efficient estimation of word representations in vector space, 2013. https://doi.org/10.48550/arXiv.1301.3781.

OpenAI: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774, 2023.

Parr, T.: The Definitive ANTLR 4 Reference. Pragmatic Bookshelf, Raleigh, NC, 2013. https://dl.icdst.org/pdfs/files3/a91ace57a8c4c8cdd9f1663e1051bf93.pdf.

Phind: Phind, 2023. https://www.phind.com.

Qian, N.: On the Momentum Term in Gradient Descent Learning Algorithms. Neural Networks, 12(1):145–151, 1999. https://doi.org/10.1016/S0893-6080(98)00116-6.

Srivastava, N., G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov: Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014. http://jmlr.org/papers/v15/srivastava14a.html.

Tankala, D. K., Venugopal T. and V. Boddu: Java source code similarity detection using Siamese networks. Journal of Theoretical and Applied Information Technology, 100(17):5507, 2022. http://www.jatit.org/volumes/Vol100No17/17Vol100No17.pdf.

Wang, W., G. Li, B. Ma, X. Xia, and Z. Jin: Code Similarity Detection Technique Based on AST Unsupervised Clustering Method. In Proceedings of the 2020 International Conference on Computational Communications and Networks (ICCC), 2020. https://doi.org/10.1109/ICCC51575.2020.9344882.

Wang, W., G. Li, B. Ma, X. Xia, and Z. Jin: Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 261–271, 2020. https://doi.org/10.1109/SANER48275.2020.9054857.

Yang, S., Cheng L. Zeng Y. Lang Z. Zhu H. and Z. Shi: Asteria: Deep Learning-based AST-Encoding for Cross-platform Binary Code Similarity Detection. In 2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 224–236, 2021. https://doi.org/10.1109/DSN48987.2021.00036.

Descargas

Publicado

2025-08-06 — Actualizado el 2025-09-23

Versiones

Cómo citar

[1]
Choy Fernández, M. de L. et al. 2025. Clasificación de similitud entre códigos de C#. Ciencias matemáticas. 38, 2 (sep. 2025), 47–56. DOI:https://doi.org/10.5281/zenodo.16757264.

Número

Sección

Artículo Original