Understanding Chain-of-Thought (CoT) Reasoning in Vision-Language Models for Earth Observation (EO)

Publikasjonsdetaljer

Utgiver: Norsk Regnesentral

Chain-of-Thought (CoT) prompting has emerged as a simple yet powerful strategy to elicit structured reasoning in large language models. This study investigates how CoT prompting influences reasoning behavior and task performance of large/vision–language models (LLMs/VLMs) applied to Earth Observation (EO). We compare an EO-specialized model (Falcon) with three general-purpose models (LLaVA, LLaVA-CoT, and o3) across two datasets: RSVQAxBEN, a large open EO benchmark, and a proprietary aerial dataset from Narvik. Experiments contrast baseline and CoT-style prompts to assess both factual accuracy and reasoning quality, complemented by an LLM-as-judge evaluation. Results show that CoT prompting benefits only the large-scale o3 model, while smaller or mid-scale models experience degraded accuracy—confirming that effective reasoning is an emergent property of scale. CoT adds transparency by revealing how models reason, though its outputs can still be partly opaque due to safety or internal constraints. On Narvik, o3 generalizes well to unseen EO data, but CoT prompting does not improve quantitative accuracy. These findings suggest that CoT currently offers greater value for interpretability than for performance. Future work should explore inference-time perception–reasoning strategies—where an EO model like Falcon provides scene-level facts that guide o3’s reasoning—to improve both trustworthiness and accuracy without retraining.