画像1: Enhancing reliability and explainability in industrial anomalous sound detection

Tomoya Nishida

Research & Development Group
Hitachi, Ltd.

Why physical AI needs to understand sound

Physical AI is becoming increasingly important as AI moves beyond digital data to sense, understand, and support decisions in the real world. In critical sectors like manufacturing, energy, and mobility, this technology is essential for keeping systems safe, reliable, and efficient.

Among the various ways AI "senses" its environment, sound is a particularly rich source of information—a direct window into a machine’s inner workings. Machines “speak” through sound, and subtle acoustic changes often reveal the earliest signs of wear, misalignment, or failure. This makes acoustic monitoring a natural approach for predictive maintenance.

By continuously listening, we can detect unusual patterns and intervene before a breakdown occurs. However, building reliable and explainable anomalous sound detection systems remains a significant challenge. The fundamental hurdle is a lack of data: how can we design and assess models when real-world anomalous sounds are so rarely available? While methods exist to detect anomalies without relying on anomalous data, evaluating them remains difficult without real samples, making practical deployment challenging. In addition, even when an anomaly is detected, identifying the root cause of detected anomalies has traditionally depended on human expertise, because current systems cannot reliably explain why a sound is irregular. Therefore, conventional systems provide only limited benefits for improving the efficiency of subsequent repair and maintenance procedures.

Breaking the data barrier: Synthetic anomaly generation

Traditional anomaly detection systems have relied on real fault data for evaluation. But what if we could generate anomalies before any actual faults occur? Two recent innovations make this possible:

  • MIMII-Gen [1] learns from a limited set of anomalous data collected across various machine types and operating conditions. It then uses its knowledge to generate synthetic anomalous data for entirely new conditions where data is unavailable. An example of anomalous data generated by MIMII-Gen is shown in Figure 1.
  • MIMII-Agent [2] takes this a step further by synthesizing “pseudo-anomalous” sounds without using any abnormal data during training. It leverages Large Language Models (LLMs) to infer the typical sounds that a machine might produce when a specific malfunction occurs. By applying suitable processing techniques—predefined for different scenarios—it creates synthetic anomalous sounds.

In both methods, the anomalous sounds generated were used to evaluate anomaly detection systems (Figure 2.). The resulting performance ranking across the evaluated systems was consistent with the ranking obtained using real anomalous sounds, indicating that the generated data can reliably assess the systems’ relative performance. This demonstrates the effectiveness of these approaches in evaluating anomaly detection systems without relying solely on real fault data.

画像2: Enhancing reliability and explainability in industrial anomalous sound detection
画像3: Enhancing reliability and explainability in industrial anomalous sound detection

Bridging the clarity gap: Making anomaly detection explainable

Simply detecting an anomaly is often not enough for practical industrial use. Maintenance workers need to know why a sound is considered anomalous to make the next decisions. Historically, generating explanations required pairs of normal and anomalous data—which brings us back to the challenge of data scarcity.

To bridge this gap, we have developed two innovative approaches that provide clarity without requiring extensive anomalous training data:

  • Retrieval-augmented difference captioning (RAG-difference)uses audio embeddings for both detection and caption generation. As shown in the workflow in Figure 3, we first use audio captioning models to independently generate captions for the anomalous sound and similar normal sounds. An LLM then compares these captions to produce human-readable explanations of how the anomalous sound differs from the normal sound without extra training. A practical example of how this "difference captioning" looks can be seen in Figure 4.
  • Timbre-based explanationleverages computable psychoacoustic timbral metrics (e.g., sharpness, roughness, boominess) to describe how anomalies differ from normal sounds. Like our other methods, this works effectively without anomalous training data.

By linking anomalies to specific physical characteristics, these methods significantly enhance maintenance workflows. They can be used to improve inspection reporting, support the training for non-experts, and facilitate root cause analysis, turning complex acoustic data into actionable insights.

画像4: Enhancing reliability and explainability in industrial anomalous sound detection
画像5: Enhancing reliability and explainability in industrial anomalous sound detection

Conclusion: Impact and future directions

The integration of audio, text, and AI opens new horizons for industrial technology. By combining these modalities, we are moving toward a future defined by:

  • Predictive maintenance ecosystems powered by generative AI and LLMs.
  • Enhanced interpretability for compliance and safety audits.

Together, these technologies pave the way for physical AI systems that continuously listen, interpret, and act—turning industrial assets into collaborators in their own maintenance. At Hitachi, these innovations will contribute to HMAX by Hitachi, a suite of next-generation solutions that brings the power of AI to societal infrastructure, driving value from raw data to actionable insights, and ultimately to business transformation. By pushing the boundaries of what AI can hear and explain, we are building a more resilient and reliable industrial future.

Acknowledgements

I would like to express my gratitude to my colleagues, Harsh Purohit, Ryoya Ogura, Kota Dohi, Takashi Endo and Yohei Kawaguchi, with whom this research was conducted.

References

[1] H. Purohit, T. Nishida, K. Dohi, T. Endo and Y. Kawaguchi, “MIMII-Gen: Generative Modeling Approach for Simulated Evaluation of Anomalous Sound Detection System,” EUSIPCO 2025.
[2] H. Purohit, T. Nishida, K. Dohi, T. Endo and Y. Kawaguchi, “MIMII-Agent: Leveraging LLMs with Function Calling for Relative Evaluation of Anomalous Sound Detection,” DCASE Workshop 2025.
[3] R. Ogura, T. Nishida, and Y. Kawaguchi, “Retrieval-Augmented Difference Captioning to Explain UASD,” APSIPA ASC 2025.
[4] T. Nishida, K. Dohi, H. Purohit, T. Endo and Y. Kawaguchi, “Timbre-Based Anomaly Explanation Without Anomalous Training Data,” EUSIPCO 2025.

This article is a sponsored article by
''.