
Shunichi Akatsuka
Research & Development Division
Hitachi America, Ltd.
Artificial intelligence (AI) is transforming how we address real-world challenges, from diagnosing diseases to personalizing education and enhancing customer service. Reinforcement learning (RL) has already proven its potential by mastering complex board games like Go [1] and fostering cooperation in real-time strategy games like StarCraft [2]. But when multiple AI systems need to work together — or even compete — how do we ensure they achieve the best outcomes for everyone involved? This challenge is at the heart of three recent projects my colleagues and I at Hitachi R&D in Japan and the US have been working on in collaboration with Mila – Quebec AI Institute [3].
Training AI to cooperate without being exploited
In two related projects, we tackled the challenge of teaching AI systems to cooperate in competitive environments while avoiding exploitation. In the first approach, Best Response Shaping [4], we introduced a "detective" – an imaginary opponent that observes the agent’s action policy and uses Monte Carlo Tree Search to exploit weaknesses in the agent’s strategy. Figure 1. shows the concept of our method. By training with reinforcement learning against this detective, the agent learns to counter it, enabling it to cooperate effectively with cooperative opponents while exploiting exploitative ones.
![画像: Figure 1. Our approach in “Best Response Shaping” [4]](https://d1uzk9o9cg136f.cloudfront.net/f/16783696/rc/2025/02/25/228b59e29d7666e66722f4b888ea05a1d56f6869_large.jpg#lz:orig)
Figure 1. Our approach in “Best Response Shaping” [4]
The second approach [5] empowers the agent to model its opponents directly, assuming they behave as Q-learning* agents. This method, called LOQA, allows the agent to dynamically adjust its actions to encourage cooperation and deter exploitation. Figure 2 illustrates the concept behind this approach.
* Q-Learning is a simple reinforcement learning algorithm used to estimate the value (expected return) of taking specific actions based on past experiences
![画像: Figure 2. Our approach in “Learning on Q-Learning Awareness” [5]](https://d1uzk9o9cg136f.cloudfront.net/f/16783696/rc/2025/02/25/ec92487b509b294c4c9090306d540915bf4ce891_large.jpg#lz:orig)
Figure 2. Our approach in “Learning on Q-Learning Awareness” [5]
Both methods were tested in the Iterated Prisoner’s Dilemma and the Coin Game—a more complex version of the Prisoner’s Dilemma—and demonstrated remarkable success in balancing cooperation and strategic adaptability.
A manager for AI systems
In our third project [6], we took inspiration from the way businesses operate. When teams of people work together, there’s often a manager who helps align everyone’s efforts toward a shared goal. We applied this concept to AI systems.
![画像: Figure 3. Architecture of manager AI in multi-agent environment [6]](https://d1uzk9o9cg136f.cloudfront.net/f/16783696/rc/2025/02/25/6e5ade5275fe18770652c7f146ddcd4997d7ac8a_large.jpg#lz:orig)
Figure 3. Architecture of manager AI in multi-agent environment [6]
Figure 3 illustrates the architecture of our framework, where the manager AI operates in a multi-agent environment. Through the orange arrows in the diagram, the manager AI influences the agents' decisions by providing additional information and tailored rewards (incentives).
In a simulated supply chain problem, this manager AI helped different factories coordinate their actions, resulting in faster deliveries and higher profits for all the factories. This framework could have broader applications, optimizing complex systems like manufacturing workflows or energy grids, where many different players need to cooperate.
Summary
All three of these projects share a common goal: making AI systems that are not only smarter but also better at working together. Whether it’s helping them understand their “teammates,” or managing their incentives, these innovations can make AI more effective in solving real-world problems.
Imagine logistics companies where autonomous delivery drones cooperate to share airspace and reduce delays, or factories that automatically align production schedules to minimize downtime. Or picture energy grids where AI systems coordinate to distribute power efficiently and reduce waste during peak demand. These methods make such scenarios not just possible but practical.
Acknowledgements
I would like to express my gratitude to Mila – Quebec AI Institute for their invaluable collaboration on this research. My colleagues and I are especially thankful to Professor Aaron Courville for his significant support throughout this research.
References
[1] D. Silver et al., Mastering the game of Go with deep neural networks and tree search, Nature, 529(7587): 484–489, 2016.
[2] O. Vinyals et al., Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575 (7782):350–354, 2019.
[3] Mila – Quebec AI Institute, Montreal, Quebec, Canada.
Website: https://mila.quebec/en
[4] M. Aghajohari, T. Cooijmans, J.A. Duque, S. Akatsuka, A. Courville, Best Response Shaping, Reinforcement Learning Conference (RLC), Amherst Massachusetts, August 9–12, 2024.
Link: https://rlj.cs.umass.edu/2024/papers/Paper108.html
[5] M. Aghajohari, T. Cooijmans, J.A. Duque, A. Courville, LOQA: Learning with Opponent Q-Learning Awareness, accepted for ICLR 2024.
Arxiv preprint: https://arxiv.org/abs/2405.01035
[6] S. Akatsuka, Y. Teramoto, A. Courville. Managing multiple agents by automatically adjusting incentives, the 2nd International Workshop on Democracy and AI, Macao, S.A.R, 2023.
Arxiv preprint: https://arxiv.org/abs/2409.02960