Transactions on Machine Learning Research (TMLR) Β· 2026

RobustMAD: Evaluating Real-World Robustness of Multimodal Small Language Models for Deployable Anomaly Detection Assistants

1Singapore University of Technology and Design 2A*STAR, Singapore 3Nanyang Technological University 4Chongqing University
* Corresponding author
410
Curated images
4,510
Multiple-choice Qs
7,380
Open-ended Qs
4
Robustness categories
560+
Expert person-hours
93.8%
Human–LLM agreement
Figure 1: A model correctly identifies frayed copper wires under a neutral query (Query A) but hallucinates a defect-free state under a confirmation-seeking query (Query B).
Figure 1 β€” Fragility of logical grounding in state-of-the-art MSLMs. The same model correctly flags exposed, frayed copper wires under a neutral query (A), yet hallucinates a “defect-free” verdict under a confirmation-seeking rephrasing (B). Minor, everyday variations in query phrasing can flip a safe inspection into a catastrophic false negative.

Abstract

Multimodal industrial anomaly inspection assistants are a critical component of next-generation smart factories, enabling interactive vision–language querying. However, multimodal large language models remain impractical for on-site deployment due to prohibitive computational demands and privacy risks from cloud inference. Compact multimodal small language models (MSLMs) offer a deployable alternative, yet progress is constrained by the lack of comprehensive robustness analyses and meaningfully challenging benchmarks that reflect real-world industrial conditions.

To address this gap, we develop RobustMAD, the first deployment-motivated benchmark designed to comprehensively evaluate model robustness through diverse open-ended queries spanning object understanding, anomaly detection, unanswerable problems, and visual quality degradations. Contrary to conventional assumptions, top-performing MSLMs exhibit promising capabilities—surprisingly outperforming even the larger GPT-5 Nano. However, they still fall short of safety-critical requirements, and RobustMAD reveals critical robustness gaps that pose operational risks. Three recurring failure modes emerge: (i) fragile multimodal grounding under fine-grained distinctions or degraded visual conditions, (ii) insufficiently comprehensive responses, and (iii) weak logical grounding on unanswerable or ill-posed queries, leading to hallucinated outputs. Grounded in these insights, we provide actionable guidance for the design of next-generation multimodal industrial inspection assistants.

Key Contributions

🧭

First MSLM robustness study

The first comprehensive evaluation of multimodal small language model robustness for industrial anomaly inspection—a domain where on-device deployment matters most but remains largely unexplored.

πŸ§ͺ

A deployment-motivated benchmark

RobustMAD systematically probes robustness via open-ended queries over object understanding, anomaly detection, unanswerable problems, and realistic visual-quality degradations—exposing failure modes standard benchmarks cannot.

πŸ“

Multi-dimensional scoring

A user-centered LLM-judge scheme rates technical accuracy, comprehensiveness, relevance, and clarity, enabling standardized, interpretable evaluation of open-ended responses. Human-verified ground truth is released.

πŸ’‘

Surprising findings & guidance

Even generalist MSLMs can beat much larger models—challenging the “bigger is better” assumption—while exposing overlooked failure modes that motivate concrete design guidance.

The RobustMAD Benchmark

RobustMAD captures both knowledge-based robustness — reasoning under diverse, imperfect, domain-intensive queries — and visual quality robustness, re-evaluating every question under deployment-relevant perturbations such as motion blur and low lighting. From 410 representative images of MVTec AD and VisA, we curate 4,510 multiple-choice and 7,380 open-ended questions across four categories.

Category 1

General Object Understanding

Fundamental understanding of an object’s name, visual attributes, and intended function.

Category 2

Stand-alone Anomaly Detection

Detecting and localizing defects from a single image, with deeper reasoning about functional impact.

Category 3

Pairwise Anomaly Detection

Cross-image reasoning against a defect-free reference—reflecting real inspection workflows.

Category 4

Unanswerable / Ill-posed Queries

Staying logically grounded when queries are unanswerable, ill-posed, or reference non-existent attributes.

Figure 2: Overview of the RobustMAD benchmark dataset showing four knowledge-based robustness categories with example multiple-choice and open-ended questions, re-evaluated under visual quality degradations.
Figure 2 β€” Overview of the RobustMAD benchmark. Four knowledge-based robustness categories, each with carefully crafted MCQ and open-ended questions, are re-evaluated under deployment-relevant visual degradations (motion blur, low-light).

How RobustMAD Was Built

A three-stage pipeline pairs GPT-5 generation with intensive human verification. Twelve graduate-level experts contributed ~560 person-hours of review, and a five-judge human study found 93.8% agreement with the LLM judge.

Figure 3: The RobustMAD data construction pipeline with three stages: robustness problem construction, benchmark data generation, and two-stage quality control.
Figure 3 β€” Data construction pipeline. Human experts co-design robustness categories and question archetypes; GPT-5 generates image-conditioned MCQ and open-ended QA pairs from domain-knowledge captions; a two-stage quality-control process (LLM screening + human verification) produces the final benchmark.
20Object types
39Condition categories
2Image sources (MVTec AD, VisA)
3Visual quality types
12Expert annotators
11Models evaluated

Key Findings

01 Bigger is not always better

Despite having fewer parameters, the best MSLMs outperform much larger models. Qwen3-VL-4B-Instruct surpasses Phi-4-Multimodal-Instruct (6B) and even the proprietary GPT-5 Nano—evidence that targeted architecture and domain-relevant training can matter more than raw scale for industrial anomaly detection.

02 MCQ success hides open-ended fragility

Under structured MCQs, top models look strong (e.g., 97.6% on object understanding). But under realistic open-ended queries, performance collapses: only 72.8% of the best model’s answers even reach a passing score, dropping to 36.0% for the weakest—a gap aggregate accuracy completely masks.

Figure 4: Distribution of overall scores (1-5) across four MSLMs, with the percentage of questions scoring at least 3 indicated for each model.
Figure 4 β€” Open-ended overall-score distribution. Percentage of responses reaching a passing score (≥3) varies sharply across MSLMs of similar size, revealing large robustness differences invisible to MCQ accuracy.

03 Accuracy & comprehensiveness are the weak links

Decomposing the overall score shows MSLMs score much lower on technical accuracy and comprehensiveness than on relevance and style—the dimensions that most matter for safety-critical inspection are exactly where models fail.

Figure 5: Average multi-dimensional sub-scores and overall score per robustness category, showing lower technical accuracy and comprehensiveness than relevance and style.
Figure 5 β€” Multi-dimensional sub-scores. Across categories, technical accuracy and comprehensiveness (red-toned) lag behind relevance and style/clarity (green-toned), heavily penalizing the overall score (black).

04 Domain depth & visual quality remain brittle

Broad object recognition does not imply deep domain knowledge: accuracy drops sharply on domain-intensive objects (e.g., PCBs, sensors). Realistic blur and low light further degrade already-weak scores—non-trivial risks for safety-critical, on-line inspection.

Three Recurring Failure Modes

i

Fragile multimodal grounding

Reasoning beyond macro visual attributes breaks down under fine-grained distinctions or degraded visuals—language priors override visual evidence, misclassifying objects and missing subtle defects.

ii

Insufficiently comprehensive responses

Even when broadly correct, answers omit precise object names, defect locations, or additional defects—falling short of the specificity production inspection standards demand.

iii

Weak logical grounding on unanswerable queries

Faced with ill-posed or unanswerable queries, models confidently hallucinate—e.g., a voltage rating on a medicine capsule—rather than recognizing missing evidence.

Figure 6: Representative qualitative examples of recurring failure modes across MSLMs, including language-prior bias, missed fine-grained defects, adversarial defects, image-quality vulnerability, vague responses, and hallucinations on unanswerable queries.
Figure 6 β€” Qualitative failure examples. Representative cases spanning language-prior bias, missed fine-grained defects, defects acting as unintended adversarial signals, image-quality vulnerability, vague responses, and hallucinations on unanswerable queries.

Guidance for Next-Generation Inspection Assistants

These failures are not merely prompt-driven—safety-critical systems cannot rely on perfectly engineered prompts. Addressing them requires changes in model architecture and training.

Vision-side architecture is as critical as the LLM

Qwen3-VL-4B-Instruct’s edge stems from encoder design, tight vision–language token alignment, and cross-image reference grounding—outperforming InternVL3.5-4B despite a shared Qwen3 backbone. Vision-side design choices deserve first-class attention.

Train explicitly for the failure modes

Post-training should target fine-grained visual distinctions, multi-defect scenarios, image-quality degradations, and cross-image differences—with supervision that enforces comprehensive, precise outputs and explicit handling of unanswerable queries.

Robustness to unanswerable queries must be learned

It does not emerge implicitly, even with modern reasoning-oriented training. Targeted negative and counterfactual data are needed to teach models when not to answer.

Small but well-designed models are a viable foundation for efficient, on-device inspection assistants—but closing the robustness gap demands architecture- and data-level innovation, not scale alone.

Citation

If you find RobustMAD useful, please cite our work:

@article{arunan2026robustmad,
  title   = {RobustMAD: Evaluating Real-World Robustness of Multimodal Small
             Language Models for Deployable Anomaly Detection Assistants},
  author  = {Arunan, Anushiya and Li, Xin and Qin, Yan and Tan, U-Xuan and
             Vuong, Nhu Khue and Li, Xiaoli and Yuen, Chau},
  journal = {Transactions on Machine Learning Research},
  year    = {2026},
  issn    = {2835-8856},
  url     = {https://openreview.net/forum?id=skrA9UYNIZ}
}

Acknowledgments

The work of Anushiya Arunan is supported by the A*STAR Graduate Scholarship (Computing). This research is supported by the Ministry of Education, Singapore, under its MOE Tier 1 (SKI 2021_08_03). We thank the graduate-level experts who contributed to the human review and LLM-judge validation of RobustMAD.