PFE-SELF-RAG: Balancing self-RAG evaluation metrics via Pareto efficiency

Document Type : Research Paper

Authors

Department of Computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran

Abstract

Self-RAG enhances Retrieval-Augmented Generation (RAG) by enabling Large Language Models (LLMs)  to dynamically retrieve external knowledge and self-evaluate outputs. However, the original Self-RAG heavily relies on a manually tuned weighted-sum mechanism for combining critique scores, rendering the system brittle and poorly adaptable to diverse query distributions. To address these limitations, Pareto Front Enhanced Self-RAG (PFE-SELF-RAG) is proposed as a tuning-free Multi-Objective Optimization(MOO) framework. It first applies Maximal Marginal Relevance (MMR) to enrich context diversity, then incorporates two evaluation strategies: Pareto Front-based selection and Geometric Mean (GM) Aggregation. The primary significance of this approach lies in eliminating fragile manual weight tuning. By mathematically modeling the trade-off between factual accuracy and relevance, PFE-SELF-RAG enables adaptive candidate selection, allowing the number and quality of outputs to vary dynamically. This represents the first formal application of Pareto optimization to candidate ranking in self-reflective RAG systems, establishing a principled alternative to heuristic aggregation. Evaluations on PopQA, ARC Challenge, PubHealth, and TriviaQA demonstrate substantial impact. The Full Pareto Set strategy consistently outperforms the Self-RAG baseline, achieving %58.6 on PopQA (%+3.7), %68.0 on ARC Challenge (%+1.6), %73.0 on PubHealth (%+0.6), and %71.3 on TriviaQA (%+4.3). These improvements underscore the practical impact of replacing brittle heuristics with principled optimization, establishing PFE-SELF-RAG as a robust and scalable standard for self-reflective RAG systems.

Keywords

Main Subjects


[1] Alcalá-Fdez, J., Sánchez, L., García, S., del Jesus, M. J., Ventura, S., Garrell, J. M., Otero, J., Romero, C., Bacardit, J., Rivas, V. M., et al. (2009). KEEL: A software tool to assess evolutionary algorithms for data mining problems. Soft Computing, 13(3), 307–318. https://doi.org/10.1007/s00500-008-0323-y
[2] Asai, A., Wu, Z., Wang, Y., Sil, A., & Hajishirzi, H. (2023). Self-RAG: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511. https://doi.org/10.48550/arXiv.2310.11511
[3] BahraniPour, F., Farshi, M., & Ebrahimi Mood, S. (2025). Enhanced multi-objective cuckoo search with migration operator for benchmark optimization and IoT task scheduling in cloud-fog computing. The Journal of Supercomputing, 81(8), 1024. https://doi.org/10.1007/s11227-025-07531-0
[4] Barker, M., Bell, A., Thomas, E., Carr, J., Andrews, T., & Bhatt, U. (2025). Faster, cheaper, better: Multi-objective hyperparameter optimization for LLM and RAG systems. arXiv preprint arXiv:2502.18635. https://doi.org/10.48550/arXiv.2502.18635
[5] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., & Neelakantan, A. (2020). Language models are few-shot learners [Preprint]. arXiv preprint arXiv:2005.14165. https://doi.org/10.48550/arXiv.2005.14165
[6] Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. Proceedings of the 21st Annual ACM SIGIR Conference on Research and Development
in Information Retrieval, 335–336. https://doi.org/10.1145/290941.291025
[7] Chankong, V., & Haimes, Y. Y. (2008). Multiobjective Decision Making: Theory and Methodology. Dover Publications.
[8] Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., & Tafjord, O. (2018). Think you have solved question answering? Try ARC, the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457. https://doi.org/10.48550/arXiv.1803.05457
[9] Deb, K. (2011). Multi-objective optimization using evolutionary algorithms: An introduction. Technical Report No. 2011003, IIT Kanpur.
[10] Deb, K., Pratap, A., Agarwal, S., & Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182–197. https://doi.org/10.1109/4235.996017
[11] Dubois, Y., Li, X., Taori, R., Zhang, T., Gulrajani, I., Ba, J., & Hashimoto, T. B. (2024). Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387. https://doi.org/10.48550/arXiv.2305.14387
[12] Guu, K., Lee, K., Tung, Z.,& Chang, M.-W. (2020). REALM: Retrievalaugmented language model pre-training. arXiv preprint arXiv:2002.08909. https://doi.org/10.48550/arXiv.2002.08909
[13] He, Q., & Maghsudi, S. (2025). Pareto multi-objective alignment for language models. arXiv preprint arXiv:2508.07768. https://doi.org/10.48550/arXiv.2508.07768
[14] Jeong, S., Baek, H., Cho, S., Hwang, S. J., & Park, J. C. (2024). Adaptive-RAG: Synergy of retrieval and generation via adaptive strategies. arXiv preprint arXiv:2403.14403. https://doi.org/10.48550/arXiv.2403.14403
[15] Joshi, M., Choi, E., Weld, D. S., & Zettlemoyer, L. (2017). TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551. https://doi.org/10.48550/arXiv.1705.03551
[16] Karpukhin, V., Oguz, B., Min, S., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (2020). Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906. https://doi.org/10.48550/arXiv.2004.04906
[17] Kryscinski, W., McCann, B., Xiong, C., & Socher, R. (2019). Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840. https://doi.org/10.48550/arXiv.1910.12840
[18] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., & Kiela, D. (2020). Retrieval-augmented generation for knowledgeintensive NLP tasks. arXiv preprint arXiv:2005.11401. https://doi.org/10.48550/arXiv.2005.11401
[19] Mallen, A., Asai, A., Hajishirzi, H., Choi, E., & Khashabi, D. (2023). PopQA: An open-domain question answering benchmark for entity-centric long-tail queries. Proceedings of EMNLP, 9802–9822. https://doi.org/10.18653/v1/2023.acl-long.546
[20] Marler, R. T., & Arora, J. S. (2004). Survey of multi-objective optimization methods for engineering. Structural and Multidisciplinary Optimization, 26(6), 369–395. https://doi.org/10.1007/s00158-003-0368-6
[21] Miettinen, K. (1999). Nonlinear Multiobjective Optimization. Springer. https://doi.org/10.1007/978-1-4615-5563-6
[22] Pan, Y., Shen, Y., Qin, J., & Zhang, L. (2024). Deep reinforcement learning for multi-objective optimization in BIM-based green building design. Automation in Construction, 166, 105598. https://doi.org/10.1016/j.autcon.2024.105598
[23] Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333–389. https://doi.org/10.1561/1500000019
[24] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. https://doi.org/10.48550/arXiv.2307.09288
[25] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762. https://doi.org/10.48550/arXiv.1706.03762
[26] Xiong, W., Li, X. L., Iyer, S., Du, J., Lewis, P., Wang, W. Y., Mehdad, Y., Yih, W.-t., Riedel, S., Kiela, D., & Oguz, B. (2020). Answering complex open-domain questions with multi-hop dense retrieval. arXiv preprint arXiv:2009.12756. https://doi.org/10.48550/arXiv.2009.12756
[27] Yan, S.-Q., Gu, J.-C., Zhu, Y., & Ling, Z.-H. (2024). Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884. https://doi.org/10.48550/arXiv.2401.15884
[28] Zhang, Y., Li, I., Swayamdipta, S., Choi, Y., & Smith, N. A. (2023). PubHealth: A dataset for fact verification in public health claims. Proceedings of the 61st Annual Meeting of the ACL, 12345–12360. https://doi.org/10.18653/v1/2022.findings-naacl.1 
[29] Zitzler, E., Deb, K., & Thiele, L. (2000). Comparison of multiobjective evolutionary algorithms: Empirical results. Evolutionary Computation, 8(2), 173–195. https://doi.org/10.1162/106365600568202