**Terufumi Morishita**

Research & Development Group,

Hitachi, Ltd.

Ensemble learning (EL) is a method in machine learning that allows us to combine predictions from multiple AI models to produce a better prediction. This allows us to apply “higher-performance," more accurate, AI in solving the complex tasks that are needed to resolve challenges in society. Consider, for example, a system that detects toxic posts on social media and issues an alert on the content. To build such a system, we usually rely on text-based AI models which predict the class of the textual content of the post (e.g., dis-/mis-information, offensive, sarcastic, and so on). Rather than using a single prediction from a single model, we can use multiple predictions from various models, for example by taking a vote on the predictions, to produce the final prediction. Such ensemble prediction is often more accurate than prediction of a single model, often much better. Indeed, recently, by combining state-of-the-art deep neural networks with a strong ensemble method, we have won 1st place in many international competitions targeting the above use cases [4, 6-8]. Yet, such use cases are just a few of the available examples. Indeed, since the methodology of EL is generic, it can be applied to almost every field where AI models fit in.

Due to its attractive nature, EL has been used in various domains of machine learning. Accordingly, many studies have been made on EL by the research community, and many empirical methods, e.g., Bagging *9, AdaBoost *10, and RandomForest *11, have been proposed so far. However, EL has not been understood well from theoretical viewpoints. Especially, the fundamental question about EL — what factors make an ensemble method good or bad? — have so far been unanswered.

In this blog post, we introduce our paper on a theoretical framework of EL, which was presented at ICML 2022, the highest-level international conference in machine learning *1. We proposed a fundamental theory that evaluates a given ensemble strategy by a well-grounded set of metrics. The theory answers the above question by revealing the strengths and weaknesses of an ensemble strategy in terms of its metrics. To demonstrate this, we analyzed a powerful ensemble strategy that uses diverse types of deep neural networks.

The open question in EL: What factors make an ensemble strategy good or bad?

The popular belief on the answer to this question is that accurate and diverse models lead to better performance. Accuracy is trivially important. For diversity, consider the extreme case where we use models with the least diversity: we use the same models in ensemble strategy. In such a case, since the model predictions include only the same single prediction, the ensemble prediction must be the same as that single prediction. The ensemble does not change the prediction and thus, does not boost performance at all. From this extreme case study, we can see that the diversity of models is also important.

Such intuition is justified by the previous theory [2, 3] (Figure 1, left). The study derived an error rate lower bound of a given ensemble strategy, using a variant of Fano's inequality of information theory. The error rate lower bound denotes the best-case error rate, and thus, it is the proxy for the performance of the strategy. Then, the study showed that the error rate lower bound is decomposed into two terms, relevance I_relev and redundancy I_redun. The relevance is the information theoretical version of accuracy and the redundancy diversity.

In summary, (i) the accuracy and the diversity have been considered to be important factors in EL. (ii) the previous theory is promising since it derives the two factors in a theoretically valid manner.

The theory that we are proposing

So, are accuracy and diversity enough to explain EL?

To consider this, let's take a case where we use simple voting ensemble with diverse models on a classification task. If we use diverse models, some models may predict the correct labels on a dataset instance while other models may predict the incorrect labels on the same instance. Sometimes, the incorrect models dominate over the correct models so that the voting prediction becomes incorrect. At that time, a certain amount of information is lost when the model predictions are combined to produce a voting prediction, since some of the model predictions are indeed correct. Considering the above case, we can see that accuracy and diversity are not enough, and we must consider such information loss.

So, we proposed a generalized ensemble theory (Figure 1, left) that incorporates the information loss, which we named "combination loss" I_combloss. We used the original Fano's inequality theory to derive a tighter error rate lower bound of an ensemble strategy. Then, we showed that the new error rate lower bound is decomposed into three terms: relevance, redundancy, and the combination loss.

The proposed theory is promising as a fundamental theory in EL in the following sense: (i) the theory recovers the missing metric in the previous theory, and (ii) as we see in the next experiments sections, the proposed theory explains well various behaviors of actual ensemble strategies.

Experimental verification of the proposed theory

Now, let's verify the proposed theory by experiments. To this end, we made the following hypothesis about how the proposed theory describes the actual ensemble strategies. The error rate lower bound denotes the best-case error rate. Thus, a strategy with a smaller error rate lower bound has a higher chance of having a smaller error rate. Therefore, it is expected that the theoretical error rate lower bounds calculated by the proposed theory correlate to the actual error rates.

To check this hypothesis, we built various ensemble strategies and measured the error rates and the error rate lower bounds (Figure 2). As can be seen, while the error rate lower bounds by the previous theory (Lemma 2.3, left) do not correlate to the actual error rate, those of the proposed theory (Lemma 3.1, right) do. This result verifies the proposed theory.

Please refer to our paper for the detailed setting of this experiments. Additionally, we show other verification experiments.

Unveiling ensemble strategies through the proposed theory

Next, we demonstrated how the proposed theory reveals the strengths and weaknesses of ensemble strategies through the three metrics.

Figure 3 shows each metric value for various ensemble strategies. For intuitive understanding, we show the “per-model” metric value (i_relev,i_redun, i_combloss) which is the metric value divided by the number of models in the ensemble. Each row shows metric values for a specific ensemble strategy. In this blog, we focus mainly on the most powerful strategy shown in the "Hetero-DNNs" row. Hetero-DNNs strategy uses different types of DNNs. Please refer to our paper for the full analysis of the other ensemble strategies.

Firstly, Hetero-DNNs has the smallest i_relev value in the column, or, Hetero-DNNs has the least accurate models. This is because Hetero-DNNs uses not only the strongest DNNs but also weaker DNNs. The smallest i_relev of Hetero-DNNs is contrasted with the largest i_relev of Random-Seed, which uses only the strongest type of DNNs and its seed variants.

Next, Hetero-DNNs has the smallest i_redun, or, it has the most diverse models, thanks to the diverse types of DNNs. This is again contrasted with the largest i_redun of Random-Seed, which uses only the single type of DNNs.

Finally, Hetero-DNNs has a large combination loss (i_combloss) when the models are combined with voting ("voting"), however, the combination loss becomes smaller when the models are combined by a logistic regression stacking meta-estimator ("LogR"). A stacking meta-estimator makes a prediction in a given dataset instance using the base model predictions as its input. The decrease of combination loss by the meta-estimator is again contrasted with the Random-Seed case, where the meta-estimator did not reduce the combination loss. These phenomena can be explained as follows. Since the models of Hetero-DNNs are diverse, the accuracy of the models varies greatly. If we take simple voting on such models, sometimes the inaccurate models dominate over the accurate models to produce an incorrect voting prediction. However, if we use a meta-estimator, it can put more weight on accurate models and less on inaccurate models to ameliorate this issue (see our paper for details). This is reflected in a reduced combination loss. For Random-Seed, since the models are less diverse and the accuracy of the models is similar, the model-weighting of the meta-estimator is not that useful.

In summary, Hetero-DNNs has less accurate but diverse models and when the models are combined by the meta-estimator, the combination loss is also small. Hetero-DNNs strategy is indeed powerful: with this strategy built on top of state-of-the-art NLP models, we took multiple 1st place awards in international competitions [4 - 8].

As we explained in this section, the proposed theory can reveal the strengths and weaknesses of a given ensemble strategy. This answers the open question raised in the introduction: "What factors make an ensemble strategy good?"

Conclusion

We proposed a fundamental theory on EL, which is a technology of using multiple AI models for higher performance. The theory pushes ahead our understanding of EL and will give us insights into designing more powerful ensemble strategies. We will go forward to apply such ensemble strategies to Lumada AI technologies.

Reference

*1 T. Morishita, G. Morio, S. Horiguchi, H. Ozaki & N. Nukaga. (2022). Rethinking Fano's Inequality in Ensemble Learning. ICML.

*2 G. Brown, An information theoretic perspective on multiple classifier systems. In International Workshop on Multiple Classifier Systems, pp. 344–353, 2009.

*3 Z-H. Zhou and N. Li, Multi-information ensemble diversity. In Multiple Classifier Systems, pp. 134–144, 2010.

*4 Hitachi News Release (2 December 2020): Hitachi wins first place in multiple tasks of CoNLL 2020 Shared Task and SemEval 2020, international competitions for natural language processing.

*5 T. Morishita, M. Gaku, H. Ozaki, and T. Miyoshi. 2020. Hitachi at SemEval-2020 Task 3: Exploring the Representation Spaces of Transformers for Human Sense Word Similarity. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp. 286–291, Barcelona (online). International Committee for Computational Linguistics.

*6 G. Morio, T. Morishita, H. Ozaki, and T. Miyoshi. 2020. Hitachi at SemEval-2020 Task 11: An Empirical Study of Pre-Trained Transformer Family for Propaganda Detection. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp. 1739–1748, Barcelona (online). International Committee for Computational Linguistics.

*7 T. Morishita, G. Morio, H. Ozaki and T. Miyoshi. 2020. Hitachi at SemEval-2020 Task 7: Stacking at Scale with Heterogeneous Language Models for Humor Recognition. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp. 791–803, Barcelona (online). International Committee for Computational Linguistics.

*8 T. Morishita, G. Morio, S. Horiguchi, H. Ozaki, and T. Miyoshi. 2020. Hitachi at SemEval-2020 Task 8: Simple but Effective Modality Ensemble for Meme Emotion Recognition. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp. 1126–1134, Barcelona (online). International Committee for Computational Linguistics.

*9 L. Breiman, “Bagging predictors,” Machine Learning, 24(2), pp. 123-140, 1996.

*10 Y. Freund and R. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” 1997.

*11 L. Breiman, “Random Forests,” Machine Learning, 45(1), pp. 5-32, 2001.