OpenAlex · Aktualisierung stündlich · Letzte Aktualisierung: 20.05.2026, 05:50

Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.

Assessing the Performance of 8 AI Chatbots in Bibliographic Reference Retrieval: Grok and DeepSeek Outperform ChatGPT, but None are Entirely Accurate

2026·1 Zitationen·Journal of Data and Information ScienceOpen Access
Volltext beim Verlag öffnen

1

Zitationen

2

Autoren

2026

Jahr

Abstract

Abstract Purpose This study evaluates the reliability of eight generative artificial intelligence chatbots—including ChatGPT, Claude, Gemini, and DeepSeek—when functioning as autonomous agents for academic bibliographic generation, specifically assessing their accuracy within a university research framework. Design/methodology/approach Using a standardized prompting methodology, 400 references were generated and analyzed across five core knowledge areas: Health, Engineering, Experimental Sciences, Social Sciences, and Humanities. Each agent’s output was rigorously audited against five formal criteria (authorship, year, title, source, and location) and categorized by error frequency and document type. Findings Results indicate a significant reliability gap, with only 26.5 % of references entirely accurate and nearly 40 % flawed or fabricated; while Grok and DeepSeek avoided hallucinations, Copilot, Perplexity, and Claude showed the highest failure rates, particularly when generating journal article citations. Research limitations The study focuses on the free versions of these AI agents, so results may vary with paid models or future architectural updates that integrate real-time web browsing more effectively. Practical implications These findings underscore the critical risks of uncritical reliance on AI agents for academic tasks, highlighting an urgent need for enhanced information literacy and the development of specialized critical thinking skills to navigate AI-mediated research. Originality/value This original and unpublished research provides a pioneering comparative analysis of multiple AI agents as research intermediaries, revealing structural limitations in their generative logic and offering a unique benchmark for the reliability of AI-driven bibliographic data in higher education.

Ähnliche Arbeiten

Autoren

Institutionen

Themen

Artificial Intelligence in Healthcare and EducationAI in Service InteractionsTopic Modeling
Volltext beim Verlag öffnen