画像1: Enhancing circuit diagram understanding via near sight correction using VLMs

Remish Leonard Minz

R&D Centre
Hitachi India Pvt. Ltd.

Introduction

When an engineer needs to diagnose a fault or validate a design, the difference between minutes and hours depends upon how quickly a legacy circuit schematic can be interpreted and trusted.

At Hitachi, digital transformation is a core, organization-wide effort to convert deep engineering knowledge and legacy assets into scalable, data-driven capabilities. Electrical schematics represent a particularly high-value knowledge asset within this transformation, as they underpin system design, verification, maintenance, and long-term asset management across our industrial businesses. Much of this information, however, remains locked in image-based and document-centric formats. Moving from static schematics to machine-interpretable representations is therefore essential for enabling automation, analytics, and intelligent decision support across the full lifecycle of complex systems.

Vision Language Models (VLMs) offer a natural foundation for this transition, as they can read images and reason over visual content using language. In practice, their effectiveness on electrical schematics is limited. While modern VLMs identify individual components with high accuracy, they struggle to infer how those components are connected. Circuit semantics are encoded in fine-grained spatially distributed cues such as wire junctions, terminals, and crossings, that require coordinated attention across multiple regions of the image. Current VLM architectures tend to prioritize salient local features over global connectivity, leading to incomplete or incorrect structural understanding. Recognizing this limitation, my colleagues and I at Hitachi India R&D examined why VLMs underperform on electrical schematics and developed an approach to address this challenge, which we presented at a workshop of the IEEE/CVF International Conference on Computer Vision (ICCV) 2025 [1]. Our analysis builds on and is consistent with prior empirical findings showing that VLMs struggle to reliably capture connectivity-critical visual cues in circuit diagrams [2].

Industrial relevance and approach

Automating the interpretation of circuit diagrams can significantly reduce time, cost, and human error in design verification, maintenance planning, fault diagnosis, and the digitization of legacy schematics. In industrial environments such as those operated by Hitachi, circuit understanding is not an isolated technical task—it directly affects technician productivity, system reliability, and the speed at which engineering knowledge can be reused across the asset lifecycle. The core challenge, therefore, is enabling AI systems to understand not only which components are present in a schematic, but how those components are electrically connected.

This challenge is non-trivial for current Vision Language Models (VLMs). While they can recognize individual components reliably, they struggle with connectivity because circuit semantics are conveyed through fine-grained, spatially distributed cues—such as wire junctions, terminals, and wire crossings that may or may not represent true electrical connections. Accurately interpreting a circuit requires jointly attending to multiple distant regions of an image and reasoning about their relationships, a capability that existing VLMs handle inconsistently.

To address this gap, we developed a framework called Near Sight Correction (NSC). The term “near sight correction” reflects the observation that vision language models tend to focus on local visual features while under-attending to long-range connectivity; NSC compensates for this structural short-sightedness by explicitly labeling connection-critical cues before inference. NSC enhances raw circuit images by explicitly labeling connection-critical key points—such as junctions, terminals, and non-connecting crossovers—before the image is processed by a VLM. By making these structural cues explicit, NSC compensates for the model’s limited ability to infer global connectivity from subtle local features, enabling more reliable reasoning about how components are connected.

画像2: Enhancing circuit diagram understanding via near sight correction using VLMs

The NSC pipeline comprises several stages, including component and key-point detection using a YOLOv8-based detector, line and edge extraction, inference of true electrical junctions, and label refinement. The final output is a key-point-aware circuit image that preserves the original schematic while making its connectivity clearer and more machine-interpretable.

画像3: Enhancing circuit diagram understanding via near sight correction using VLMs

Because circuit understanding is fundamentally about connectivity, we evaluate more than just question-answering accuracy. In addition to answering visual questions, we prompt VLMs to generate a Mermaid graph—a lightweight textual graph representation that encodes components as nodes and electrical connections as edges in a structured, machine-readable format. Graph generation provides a direct and verifiable test of structural understanding, making it easier to assess whether a model has correctly inferred the underlying circuit topology.

画像4: Enhancing circuit diagram understanding via near sight correction using VLMs

We evaluate our approach on extensions of the CircuitVQA [3] dataset that tests general circuit question answering, connection-focused reasoning, and explicit graph generation. We compare a baseline pipeline that uses raw circuit images with an NSC-based pipeline that uses key-point-aware inputs. Across multiple vision language models, the NSC-based pipeline substantially increases the proportion of circuits for which all connections are correctly recovered in the generated graphs. These improvements in structural understanding translate into better performance on downstream visual question answering tasks, demonstrating that providing structured, connectivity-aware visual inputs is critical for reliable circuit interpretation in industrial AI systems.
We evaluated our method on three adaptations of CircuitVQA [3] dataset, which benchmark (a) general circuit VQA, (b) circuit VQA for connections and (c) graph generation. Table 1 compares the performance of two pipelines across three vision language models, showing the percentage of images in which all the connections were correctly identified out of dataset (c). The results show that higher quality, key-point-aware inputs (pipeline 6) substantially improve graph generation. This highlights that VLMs perform significantly better when provided with structured inputs, both for graph generation and subsequently for visual question answering (VQA) tasks.

Additional studies described in the original paper revealed the following insights:

  1. Better inputs improve reasoning: Labelled images led to more accurate graph generation and higher F1 scores for circuit VQA. This confirms that VLMs can reason more effectively when provided structured input.
  2. Component identification vs Connection understanding: Existing models identify components reliably but struggle to recognize how those components are connected.
  3. Claude Model excels in reasoning: Claude Sonnet 3.5 showed superior performance in connection-based reasoning, while GPT4o excelled in component-level VQA.
  4. VLMs still struggle with graph reasoning: Even when accurate graphs are generated, some models fail to interpret them correctly during VQA tasks.
画像5: Enhancing circuit diagram understanding via near sight correction using VLMs

Laying the foundation for AI tools to assist engineers

We believe this research lays the foundation for AI tools that assist engineers in circuit verification, digitization, and documentation. By labeling key points in schematic images, NSC improves VLM performance across multiple tasks — bringing both academic insights and actionable outcomes for industrial AI systems.

Going forward, we plan to release the datasets from this work and integrate NSC-based workflows into Hitachi’s digital platforms for maintenance, design validation, and technician support. Future directions include combining NSC with reinforcement learning and multi-agent systems to advance automation in diagnostics and design review, and deploying this solution across real-world use cases.

Acknowledgements

The author would like to acknowledge the contributions to this work from research colleagues at Hitachi India R&D: Prateek Mital, Vivek Kumar, Munender Varshney, Thiruvengadam Samon, Nikhil Kulkarni, and Nilanjan Chakravortty.

References

[1] S. Kulkarni, V. Kumar, R.L. Minz, M. Varshney, T. Samon, A. Mitra, N. Kulkarni, N. Chakravortty, P. Mital, and K. Banerjee. Enhancing Circuit Diagram Understanding via Near Sight Correction Using VLMs. Workshop paper at International Conference on Computer Vision 2025.
https://openaccess.thecvf.com/content/ICCV2025W/MMFM/html/Kulkarni_Enhancing_Circuit_Diagram_Understanding_via_Near_Sight_Correction_Using_VLMs_ICCVW_2025_paper.html
[2] P. Rahmanzadehgervi, L. Bolton, M.R. Taesiri, and A.T. Nguyen. Vision language models are blind. In Proceedings of the Asian Conference on Computer Vision (ACCV), pp. 18–34, 2024.
[3] R. Mehta, B. Singh, V. Varma, and M. Gupta. Circuitvqa: A visual question answering dataset for electrical circuit images. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 440–460, Springer, 2024.

This article is a sponsored article by
''.