Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Assessing the Performance of 8 AI Chatbots in Bibliographic Reference Retrieval: Grok and DeepSeek Outperform ChatGPT, but None are Entirely Accurate
1
Zitationen
2
Autoren
2026
Jahr
Abstract
Abstract Purpose This study evaluates the reliability of eight generative artificial intelligence chatbots—including ChatGPT, Claude, Gemini, and DeepSeek—when functioning as autonomous agents for academic bibliographic generation, specifically assessing their accuracy within a university research framework. Design/methodology/approach Using a standardized prompting methodology, 400 references were generated and analyzed across five core knowledge areas: Health, Engineering, Experimental Sciences, Social Sciences, and Humanities. Each agent’s output was rigorously audited against five formal criteria (authorship, year, title, source, and location) and categorized by error frequency and document type. Findings Results indicate a significant reliability gap, with only 26.5 % of references entirely accurate and nearly 40 % flawed or fabricated; while Grok and DeepSeek avoided hallucinations, Copilot, Perplexity, and Claude showed the highest failure rates, particularly when generating journal article citations. Research limitations The study focuses on the free versions of these AI agents, so results may vary with paid models or future architectural updates that integrate real-time web browsing more effectively. Practical implications These findings underscore the critical risks of uncritical reliance on AI agents for academic tasks, highlighting an urgent need for enhanced information literacy and the development of specialized critical thinking skills to navigate AI-mediated research. Originality/value This original and unpublished research provides a pioneering comparative analysis of multiple AI agents as research intermediaries, revealing structural limitations in their generative logic and offering a unique benchmark for the reliability of AI-driven bibliographic data in higher education.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.707 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.613 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.159 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.875 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.