Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
Degradation of Multi-Task Prompting Across Six NLP Tasks and LLM Families
2
Zitationen
2
Autoren
2025
Jahr
Abstract
This study investigates how increasing prompt complexity affects the performance of Large Language Models (LLMs) across multiple Natural Language Processing (NLP) tasks. We introduce an incremental evaluation framework where six tasks—JSON formatting, English-Italian translation, sentiment analysis, emotion classification, topic extraction, and named entity recognition—are progressively combined within a single prompt. Six representative open-source LLMs from different families (Llama 3.1 8B, Gemma 3 4B, Mistral 7B, Qwen3 4B, Granite 3.1 3B, and DeepSeek R1 7B) were systematically evaluated using local inference environments to ensure reproducibility. Results show that performance degradation is highly architecture-dependent: while Qwen3 4B maintained stable performance across all tasks, Gemma 3 4B and Granite 3.1 3B exhibited severe collapses in fine-grained semantic tasks. Interestingly, some models (e.g., Llama 3.1 8B and DeepSeek R1 7B) demonstrated positive transfer effects, improving in certain tasks under multitask conditions. Statistical analyses confirmed significant differences across models for structured and semantic tasks, highlighting the absence of a universal degradation rule. These findings suggest that multitask prompting resilience is shaped more by architectural design than by model size alone, and they motivate adaptive, model-specific strategies for prompt composition in complex NLP applications.