Engineering Trustworthy Software with Large Language Models: A Hybrid Framework for Automated Testing, Repair, and Reliability Assurance

Authors

  • Faris Sattar Hadi Information Technology Research and Development Center, University of KUFA, Iraq

Keywords:

Large Language Models, Trustworthy Software Engineering, Automated Software Testing, Program Repair, Software Reliability

Abstract

Large Language Models (LLMs) have recently become powerful automation enablers in
software engineering due to their outstanding capabilities in code synthesis, automatic
programming and program fixing. Notwithstanding of all these advances, probabilistic nature of
LLMs raises substantial concerns about the software accuracy, reliability as well as trustworthiness
in long term, particularly when such models become deployable without an engineering
supervision or a systematic validation. We are missing an essential step in reliability-centered and
system-level engineering because existing research mainly investigates single LLM-assisted tasks
and frequently assume task-level performance capabilities. Through combining LLM-based
automation with traditional software adversarial challenging, end-to-end program repair and
reliability attestations, this work provides a potential hybridized framework to produce reliable
software using Large Language Models. To systematically control the proliferation of LLMs through
the software development life-cycle, the framework provides a deterministic authentication
pipeline, reliability-sensitive valuation metrics and feedback-based adaptation loops. We
conducted an extensive empirical evaluation, comparing the proposed framework against: (i)
baseline tools from existing work; (ii) traditional automated repair techniques, and (iii) unrestricted
plans based on LLM. Results indicate that the hybrid approach significantly enhances test coverage
and discovery of faults, yields a higher proportion of semantically correct patches with greatly
mitigated over fitting, and improves the maintainability and fault recurrence for better long term
software reliability. These results lead to question whether pure LLM does not indeed guarantees
a robust AI assisted software engineering. It's those hybrid models that you want - ones which meld
deterministic software engineering with a grain of management craft. We formulate and prove both
the dictionary based theoretical observation and its empirical counterpart in the context of Large
Language Models, in so far as is possible with robust software engineering for systems at that scale.

References

Sommerville, I., Software Engineering, 10th ed., Pearson, 2016.

Pressman, R. S., & Maxim, B. R., Software Engineering: A Practitioner’s Approach, McGraw-Hill,

Chen, M. et al., “Evaluating Large Language Models Trained on Code,” arXiv:2107.03374, 2021.

Nijkamp, E. et al., “CodeGen: An Open Large Language Model for Code with Multi-Turn Program

Synthesis,” ICLR, 2023.

Vaithilingam, P. et al., “Expectations, Outcomes, and Challenges of Using AI-Powered Code

Generation Tools,” ICSE, 2024.

Pearce, H. et al., “Assessing the Security of AI-Assisted Code Generation,” IEEE Symposium on

Security and Privacy, 2022.

Zhang, Y. et al., “An Empirical Study on the Reliability of Large Language Models in Software

Engineering Tasks,” IEEE Transactions on Software Engineering, 2024.

Amershi, S. et al., “Software Engineering for Machine Learning: A Case Study,” ICSE, 2019.

Fraser, G., & Arcuri, A., “Whole Test Suite Generation,” IEEE Transactions on Software Engineering,

Monperrus, M., “Automatic Software Repair: A Bibliography,” ACM Computing Surveys, 2018.

Fan, L. et al., “Automated Testing with Large Language Models: An Empirical Study,” ESEC/FSE,

Li, Z. et al., “Large Language Models for Program Repair: Opportunities and Challenges,” FSE, 2024.

Humbatova, N. et al., “Data Drift and Reliability Challenges in AI-Driven Software Systems,” IEEE

Software, 2023.

Laprie, J.-C., “Dependable Computing: Concepts, Limits, Challenges,” IEEE Special Issue, 1995.

Avizienis, A. et al., “Basic Concepts and Taxonomy of Dependable and Secure Computing,” IEEE

Transactions on Dependable and Secure Computing, 2004.

Wohlin, C. et al., “Continuous Experimentation in Software Engineering,” IEEE Software, 2017.

Allamanis, M. et al., “A Survey of Machine Learning for Big Code and Naturalness,” ACM

Computing Surveys, 2018.

Austin, J. et al., “Program Synthesis with Large Language Models,” arXiv, 2021.

Vaithilingam, P. et al., “Expectations, Outcomes, and Challenges of AI Code Generation Tools,”

ICSE, 2024.

Pearce, H. et al., “Security Risks of AI-Generated Code,” IEEE Security & Privacy, 2022.

Anand, S. et al., “An Orchestrated Survey of Automated Software Testing,” Journal of Systems and

Software, 2013.

Fan, L. et al., “Automated Test Generation Using Large Language Models,” ESEC/FSE, 2023.

Tufano, M. et al., “Unit Test Quality: An Empirical Study,” ICSE, 2020.

Le Goues, C. et al., “A Systematic Study of Automated Program Repair,” ICSE, 2012.

Xia, X. et al., “LLM-Based Program Repair: Opportunities and Limitations,” FSE, 2024.

Jiang, Y. et al., “On the Correctness of AI-Generated Patches,” IEEE Transactions on Software

Engineering, 2024.

Ribeiro, M. et al., “Beyond Accuracy: Reliability Metrics for AI Systems,” ACM FAccT, 2020

Humbatova, N. et al., “Engineering Reliability in AI-Based Software Systems,” IEEE Software, 2023.

Sculley, D. et al., “Hidden Technical Debt in Machine Learning Systems,” NeurIPS, 2015.

Downloads

Published

2026-02-20

How to Cite

Faris Sattar Hadi. (2026). Engineering Trustworthy Software with Large Language Models: A Hybrid Framework for Automated Testing, Repair, and Reliability Assurance . CENTRAL ASIAN JOURNAL OF MATHEMATICAL THEORY AND COMPUTER SCIENCES, 7(2), 43–56. Retrieved from https://cajmtcs.casjournal.org/index.php/CAJMTCS/article/view/882

Issue

Section

Articles