画像1: Making on-device speech AI practical for the industrial frontline — domain-aware, low-latency, and noise-robust

Natsuo Yamashita

Research & Development Group
Hitachi, Ltd.

Introduction

Industrial environments such as inspection floors, maintenance sites, and manufacturing plants are increasingly seeking to adopt speech AI to improve efficiency. However, practical deployment in these settings remains far more challenging than in consumer applications. To be truly effective on the frontline, speech AI must overcome three primary hurdles: recognizing domain-specific vocabulary, minimizing inference latency on-device, and maintaining robustness against heavy environmental noise.

At Hitachi, we are addressing these challenges to bridge the gap between advanced AI and industrial realities. In this article, we introduce how our recent research outcomes —domain-aware recognition, low-latency execution, and noise-robust preprocessing— jointly contribute to making on-device speech AI a practical reality for industry.

Domain-aware recognition for industrial vocabulary

Industrial speech often contains domain-specific terminology, such as equipment names, inspection codes, and structured expressions including numbers and symbols. These terms are rarely found in general-purpose training data, and conventional Automatic Speech Recognition (ASR) systems frequently misrecognize them. While Large Language Models (LLMs) can be used for Generative Error Correction (GER) of ASR outputs, they often suffer from "overcorrection," where unique technical codes are mistakenly replaced by more common words.

To address this issue, we propose an LLM-based GER framework constrained by phonetic information [1]. By incorporating phoneme-level consistency between the ASR output and candidate corrections, the system selectively corrects errors that are acoustically plausible while suppressing unnecessary rewrites. To further bolster performance, we utilize an LLM-driven synthetic data generation pipeline to create diverse training examples from a list of rare domain words (Figure 1). This constraint is particularly effective for rare words and structured expressions, such as numeric sequences and domain-specific codes, where overcorrection can significantly reduce user trust. As a result, the proposed approach improves reliability while preserving the stability required for industrial documentation and reporting workflows.

画像2: Making on-device speech AI practical for the industrial frontline — domain-aware, low-latency, and noise-robust

On-device speech recognition with reduced inference latency

In industrial settings, inspection phrases and commands must be processed locally on handheld devices, to ensure privacy and offline availability. However, generating text from speech is computationally expensive because standard ASR models predict each token one-by-one.
To accelerate this, we utilize speculative decoding—a technique where a system first "guesses" a sequence of tokens and then uses the main ASR model to verify them all at once. While traditional methods require a secondary "draft model" for these guesses, we propose a model-free approach using Token Map Drafting [2]. By leveraging the repetitive nature of industrial language, we store common phrases in a simple "token map" (Figure 2). This allows the device to fetch likely sequences instantly without running an extra AI model. This design significantly reduces latency while preserving the accuracy required for seamless, real-time inspection workflows.

画像3: Making on-device speech AI practical for the industrial frontline — domain-aware, low-latency, and noise-robust

Noise-robust preprocessing through end-to-end optimization

Industrial environments are inherently noisy, with background sounds from machinery, ventilation systems, and protective equipment. Such noise affects not only speech recognition accuracy but also upstream processing stages such as voice activity detection (VAD), the critical first step that identifies which segments of audio contain speech. If a VAD module incorrectly clips the start or end of a phrase due to noise, the downstream tasks, such as speech recognition, will inevitably fail.

To overcome this, we investigate a framework that jointly optimizes speech segment detection and the downstream task—in this case, Speech Emotion Recognition (SER) — in an end-to-end manner [3]. Our approach uses a Self-Supervised Learning (SSL) encoder to extract speech representations, which are then passed to VAD and SER modules that are trained together (Figure 3). This allows the segmentation process to adapt to the final task's requirements rather than relying on fixed, independent thresholds.

Experimental results show that this joint optimization is significantly more robust under noisy conditions, where conventional VAD-based pipelines often fail to maintain accuracy. Although originally studied for emotion recognition, this work highlights a broader implication for industrial speech AI: robust preprocessing should be designed together with downstream tasks. Such task-aware design forms the foundation for reliable on-device speech pipelines capable of operating in real-world industrial environments.

画像4: Making on-device speech AI practical for the industrial frontline — domain-aware, low-latency, and noise-robust

Conclusion

Making speech AI truly practical for industrial environments requires more than improving a single component. By combining domain-aware recognition, low-latency on-device execution, and noise-robust preprocessing, we move closer to speech interfaces that operators can trust in real-world conditions. These technologies demonstrate how Hitachi’s research bridges the gap between cutting-edge AI and the physical realities of industries. By ensuring that AI is not only accurate but also responsive and resilient to noise, we are enabling safer, faster, and more reliable human–machine interaction at the edge.

Acknowledgements

A part of this article was written with the assistance of Tuan Vu Ho. I would also like to acknowledge our colleagues, Masaaki Yamamoto, Hiroaki Kokubo and Yohei Kawaguchi, with whom this research was conducted.

References

[1] N. Yamashita, M. Yamamoto, H. Kokubo, Y. Kawaguchi, “LLM-based Generative Error Correction for Rare Words with Synthetic Data and Phonetic Context”, in Proc. Interspeech, pp. 3653-3657, 2025.
[2] T. V. Ho, H. Kokubo, M. Yamamoto, Y. Kawaguchi, “Model-free Speculative Decoding for Transformer-based ASR with Token Map Drafting”, in Proc. EUSIPCO, pp. 361-365, 2025.
[3] N. Yamashita, M. Yamamoto, Y. Kawaguchi, “End-to-End Integration of Speech Emotion Recognition with Voice Activity Detection using Self-Supervised Learning Features”, in Proc. APSIPA, pp. 537-542, 2025.

This article is a sponsored article by
''.