Multimodal industrial anomaly inspection assistants are a critical component of next-generation smart factories, enabling interactive vision–language querying. However, multimodal large language models remain impractical for on-site deployment due to prohibitive computational demands and privacy risks from cloud inference. Compact multimodal small language models (MSLMs) offer a deployable alternative, yet progress is constrained by the lack of comprehensive robustness analyses and meaningfully challenging benchmarks that reflect real-world industrial conditions.
To address this gap, we develop RobustMAD, the first deployment-motivated benchmark designed to comprehensively evaluate model robustness through diverse open-ended queries spanning object understanding, anomaly detection, unanswerable problems, and visual quality degradations. Contrary to conventional assumptions, top-performing MSLMs exhibit promising capabilities—surprisingly outperforming even the larger GPT-5 Nano. However, they still fall short of safety-critical requirements, and RobustMAD reveals critical robustness gaps that pose operational risks. Three recurring failure modes emerge: (i) fragile multimodal grounding under fine-grained distinctions or degraded visual conditions, (ii) insufficiently comprehensive responses, and (iii) weak logical grounding on unanswerable or ill-posed queries, leading to hallucinated outputs. Grounded in these insights, we provide actionable guidance for the design of next-generation multimodal industrial inspection assistants.
The first comprehensive evaluation of multimodal small language model robustness for industrial anomaly inspection—a domain where on-device deployment matters most but remains largely unexplored.
RobustMAD systematically probes robustness via open-ended queries over object understanding, anomaly detection, unanswerable problems, and realistic visual-quality degradations—exposing failure modes standard benchmarks cannot.
A user-centered LLM-judge scheme rates technical accuracy, comprehensiveness, relevance, and clarity, enabling standardized, interpretable evaluation of open-ended responses. Human-verified ground truth is released.
Even generalist MSLMs can beat much larger models—challenging the “bigger is better” assumption—while exposing overlooked failure modes that motivate concrete design guidance.
RobustMAD captures both knowledge-based robustness — reasoning under diverse, imperfect, domain-intensive queries — and visual quality robustness, re-evaluating every question under deployment-relevant perturbations such as motion blur and low lighting. From 410 representative images of MVTec AD and VisA, we curate 4,510 multiple-choice and 7,380 open-ended questions across four categories.
Fundamental understanding of an object’s name, visual attributes, and intended function.
Detecting and localizing defects from a single image, with deeper reasoning about functional impact.
Cross-image reasoning against a defect-free reference—reflecting real inspection workflows.
Staying logically grounded when queries are unanswerable, ill-posed, or reference non-existent attributes.
A three-stage pipeline pairs GPT-5 generation with intensive human verification. Twelve graduate-level experts contributed ~560 person-hours of review, and a five-judge human study found 93.8% agreement with the LLM judge.
Despite having fewer parameters, the best MSLMs outperform much larger models. Qwen3-VL-4B-Instruct surpasses Phi-4-Multimodal-Instruct (6B) and even the proprietary GPT-5 Nano—evidence that targeted architecture and domain-relevant training can matter more than raw scale for industrial anomaly detection.
Under structured MCQs, top models look strong (e.g., 97.6% on object understanding). But under realistic open-ended queries, performance collapses: only 72.8% of the best model’s answers even reach a passing score, dropping to 36.0% for the weakest—a gap aggregate accuracy completely masks.
Decomposing the overall score shows MSLMs score much lower on technical accuracy and comprehensiveness than on relevance and style—the dimensions that most matter for safety-critical inspection are exactly where models fail.
Broad object recognition does not imply deep domain knowledge: accuracy drops sharply on domain-intensive objects (e.g., PCBs, sensors). Realistic blur and low light further degrade already-weak scores—non-trivial risks for safety-critical, on-line inspection.
Reasoning beyond macro visual attributes breaks down under fine-grained distinctions or degraded visuals—language priors override visual evidence, misclassifying objects and missing subtle defects.
Even when broadly correct, answers omit precise object names, defect locations, or additional defects—falling short of the specificity production inspection standards demand.
Faced with ill-posed or unanswerable queries, models confidently hallucinate—e.g., a voltage rating on a medicine capsule—rather than recognizing missing evidence.
These failures are not merely prompt-driven—safety-critical systems cannot rely on perfectly engineered prompts. Addressing them requires changes in model architecture and training.
Qwen3-VL-4B-Instruct’s edge stems from encoder design, tight vision–language token alignment, and cross-image reference grounding—outperforming InternVL3.5-4B despite a shared Qwen3 backbone. Vision-side design choices deserve first-class attention.
Post-training should target fine-grained visual distinctions, multi-defect scenarios, image-quality degradations, and cross-image differences—with supervision that enforces comprehensive, precise outputs and explicit handling of unanswerable queries.
It does not emerge implicitly, even with modern reasoning-oriented training. Targeted negative and counterfactual data are needed to teach models when not to answer.
Small but well-designed models are a viable foundation for efficient, on-device inspection assistants—but closing the robustness gap demands architecture- and data-level innovation, not scale alone.
If you find RobustMAD useful, please cite our work:
@article{arunan2026robustmad,
title = {RobustMAD: Evaluating Real-World Robustness of Multimodal Small
Language Models for Deployable Anomaly Detection Assistants},
author = {Arunan, Anushiya and Li, Xin and Qin, Yan and Tan, U-Xuan and
Vuong, Nhu Khue and Li, Xiaoli and Yuen, Chau},
journal = {Transactions on Machine Learning Research},
year = {2026},
issn = {2835-8856},
url = {https://openreview.net/forum?id=skrA9UYNIZ}
}
The work of Anushiya Arunan is supported by the A*STAR Graduate Scholarship (Computing). This research is supported by the Ministry of Education, Singapore, under its MOE Tier 1 (SKI 2021_08_03). We thank the graduate-level experts who contributed to the human review and LLM-judge validation of RobustMAD.