Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
There are significant differences among artificial intelligence large language models when answering scientific questions
1
Zitationen
18
Autoren
2025
Jahr
Abstract
Introduction: This study investigates the efficacy of large language models (LLMs) for generating accurate scientific responses through a comparative evaluation of five prominent free models: Claude 3.5 Sonnet, Gemini, ChatGPT 4o, Mistral Large 2, and Llama 3.1 70B. Methods: Sixteen expert scientific reviewers assessed these models in terms of depth, accuracy, relevance, and clarity. Results: Claude 3.5 Sonnet emerged as the highest scoring model, followed by Gemini, with notable variability among the other models. Additionally, retrieval-augmented generation (RAG) techniques were applied to improve LLM performance, and prompts were refined to improve answers. The results indicate that although LLMs such as Claude 3.5 Sonnet have potential for scientific tasks, other models may require more development or additional prompt engineering to reach comparable accuracy. Reviewers' perceptions of artificial intelligence (AI) utility and trustworthiness showed a positive shift after evaluation. However, ethical concerns, particularly with respect to transparency and disclosure, remained consistent. Discussion: The study highlights the need for structured frameworks for evaluating LLMs and ethical considerations essential for responsible AI integration in scientific research. These findings should be interpreted with caution, as the limited sample size and domain-specific focus of the exam questions restrict the generalizability of the results.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.697 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.602 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.127 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.872 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Autoren
- Francisco Javier Álvarez‐Martínez
- Luis Miguel Pedrero Esteban
- Lucas Frungillo
- Estefanía Butassi
- Alessandro Zambon
- María Herranz-López
- Mario Aranda
- Federica Pollastro
- Anne Sylvie Tixier
- J.V. García‐Pérez
- David Arraéz-Román
- Andrew C. Ross
- Pedro Mena
- RuAngelie Edrada‐Ebel
- James Lyng
- Vicente Micol
- Fernando Borrás
- Enrique Barrajón‐Catalán
Institutionen
- Universitat de Miguel Hernández d'Elx(ES)
- Centro de Investigaciones Energéticas, Medioambientales y Tecnológicas(ES)
- University of Edinburgh(GB)
- National University of Rosario(AR)
- University of Bologna(IT)
- Pontificia Universidad Católica de Chile(CL)
- Università degli Studi del Piemonte Orientale “Amedeo Avogadro”(IT)
- Université d'Avignon et des Pays de Vaucluse(FR)
- Institut National de Recherche pour l'Agriculture, l'Alimentation et l'Environnement(FR)
- Universitat Politècnica de València(ES)
- Universidad de Granada(ES)
- University of Leeds(GB)
- University of Parma(IT)
- University of Strathclyde(GB)
- University College Dublin(IE)