**Terufumi MORISHITA**

Research & Development Group

Hitachi, Ltd.

Recent large language models (LLMs) have shown to be able to skillfully solve a wide range of tasks, foreshadowing the realization of artificial intelligence (AI) as "a machine that thinks like humans”[1]. To realize such AI, two elements have long been considered important: knowledge and reasoning.[2-7] In the context of natural language processing, “knowledge” refers to a collection of facts about the world, such as “things with mass generate a gravitational field” and “the Earth has mass.” On the other hand, “reasoning” is a form of thinking that combines multiple pieces of knowledge according to certain rules to gain new knowledge. For example, by applying the reasoning rule ∀x F(x)→G(x),F(a) ⇒ G(a) (F=”has mass”, G=”generates a gravitational field”, a=”Earth”) to the aforementioned knowledge, we can gain new knowledge that “the Earth generates a gravitational field.”

Recent observations suggest that LLMs solve tasks primarily by analogy from existing knowledge rather than pure reasoning. [8-10] For instance, observations such as “being able to solve coding tests from past years but not the latest year” or “being able to solve famous arithmetic problems as they are, but not when the numbers are changed” imply that what seems like “reasoning” may actually be retrieving knowledge “memorized” during pre-training. This knowledge bias exists even in state-of-the-art LLMs like GPT-4. [11-14]If LLMs struggle with reasoning, this poses a challenge for the realization of generalizable AI that can solve problems truly unknown to humans, because knowledge can solve only tasks “that have been seen before.” To achieve generalizable AI, we need to pursue further research to improve the “reasoning capabilities” of LLMs.

This blog introduces one such approach taken by my colleagues and I which we presented at ICML 2023. Our approach was to teach LLMs logical reasoning using many synthetically generated training examples. Specifically, we proposed a framework named FLD (Formal Logic Deduction), which generates diverse patterns of deductive reasoning examples, as illustrated in Figure 1, based on formal logic theory. Given a set of facts and a hypothesis, an LM is required to generate proof steps to (dis-)prove the hypothesis and an answer.

We first investigated how well current LLMs, such as GPT-4 solves logic, specifically deductive reasoning problems, and found that even GPT-4 can only solve about half of the problems, suggesting that pure logical reasoning isolated from knowledge is still challenging for LLMs. Next, we empirically verified that training on FLD was effective in acquiring robust logical reasoning ability. The FLD corpora and code can be found on here on github, and serve both as challenging benchmarks and as learning resources.

What is logic?

To consider how we should design the logical corpus, we first looked into the formal theory of logic, also known as symbolic logic or mathematical logic. We then considered the following single-step deductive reasoning:

This deduction step derives the conclusion, written under the bar, from the two premises.

We can abstract Eq. (1) using symbols as:

The deduction step of this form is called, modus ponens.

While modus ponens is the most intuitive deduction step, many others exist.

For example, a famous syllogism is:

The other example below defines the meaning of conjunction ∧ formally:

Of course, we can consider invalid steps such as:

From these examples, we were then able to obtain important points in deductive reasoning. First, deductive reasoning can be defined as a form of thought in which a conclusion is derived from a set of premises following specific rules. Second, since we can consider infinite patterns of formulas as premises and a conclusion, we have infinite patterns of such rules (including both valid and invalid rules).

Next, we considered multistep deductions. Figure 2 shows how a syllogism can be derived by multistep deduction constructed from other “atomic” rules.

Indeed, in formal logic, there is a set of atomic rules called the axioms, and the following is known:

Theorem: Completeness of first-order predicate logic [10]

Any valid rule is derivable by multistep deduction constructed from the axioms. Further, any rule derivable by multistep deduction constructed from the axioms is valid.

Here, the completeness theorem gave us important insights into our corpus design: that the multistep deduction constructed from the axioms can express diverse patterns of deduction examples.

Generating deduction corpora

As discussed in the previous section, the axioms led to diverse patterns of deduction examples.

So, we introduce FLD, which generates deductive reasoning examples constructed from the axioms. (See overview in Figure 3 below).

FLD first constructs a deduction example using the axioms, in the form of a tree expressed in logical formulas (“Proof Tree Generator’”). After that, it assigns natural language to each formula (“NL assignments”). These assignments are random except logical structures so that referring to existing knowledge cannot help solve the task. Finally, the example is converted into the form of a deduction example (“Deduction Example”). A deduction example is composed of a set of facts, a hypothesis, a proof that (dis-)proves the hypothesis based on the facts, and the answer label. Given a set of facts and a hypothesis, the LLM is required to generate proof steps to (dis-)prove the hypothesis and an answer.

How well do LLMs solve logic without training on deduction corpora?

Using FLD corpora, we first investigate how well current LLMs solve the logic. Table 1 shows the performance of different LLMs under a 10-shot setting.

Table 1. LLM performance in a 10-shot in-context learning setting

We used two types of metrics: proof accuracy and answer accuracy. We also used two types of corpora: FLD and more challenging version FLD★, which included deduction examples with higher-depth trees. As can be seen, even the most powerful LLM, GPT-4, performed poorly on both corpora in terms of both metrics.

How well do LMs solve logic after training on deduction corpora?

Table 1 also shows that T5, a smaller LM fine-tuned on 30,000 examples while not perfect, performed better than GPT-4. These results suggest that training on FLD could be beneficial.

We then conducted further experiments to compare the effectiveness of FLD with other deduction corpora (Table 2).

Table 2. Few-shot proof accuracies of provers transferred among deduction corpora that uses different sets of deduction rules

We trained an LM (i.e., T5) on a deduction corpus (”source corpus”) and measured its performance on other corpora (“target corpus”). The LM trained on FLD performed the best on average, i.e., the LM transferred the most robustly to the other corpora. Since the corpora used in Table 2 differs in the set of deduction rules used in proofs, this result suggests that the LM trained FLD generalized the most to other rules. The reason for this strongest generalizability can be attributed to the axioms, which theoretically generalize the best to other deduction rules.

Conclusion

Towards the realization of AI that can reason logically, we proposed a synthetic corpus-based method to teach LLMs logical reasoning and verified its effectiveness through experiments. In the future, we will investigate whether the logical reasoning ability obtained by this approach will generalize to various tasks.

References

[1] J. McCarthy, M. Minsky, and N. Rochester, A proposal for the Dartmouth summer research project on artificial intelligence. 1955.

[2] J.W. McCarthy, Programs with common sense. In Proceedings of the Teddington Conf. on the Mechanization of Thought Processes, pp. 75–91, 1959.

[3] J. Weizenbaum, ELIZA – a computer program for the study of natural language communication between man and machine. Communications of the ACM, 9(1):36–45, 1966.

[4] T. Winograd, Procedures as a representation for data in a computer program for understanding natural language, Cambridge: MIT. 1971

[5] The birth of Prolog, in T.J. Bergin and R.G. Gibson (eds.), History of programming languages II, pp. 331-367, Jan. 1996, ACM, New York.

[6] E.H. Shortliffe, Computer-based Medical Consultations: Mycin. Elsevier, 1976.

[7] Sowa, John. (1993). D. B. Lenat and R. V. Guha, Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project, Artif. Intell. 61. 95-104.

[8] Y. Razeghi, R.L. Logan IV, M. Gardner and S. Singh, Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 840–854, 2022.

[9] D. Hodel and J. West, Response: Emergent analogical reasoning in large language models, ArXiv, abs/2308.16118, 2023.

[10] I. Dasgupta, A.K. Lampinen, S.C.Y. Chan, H.R. Sheahan, A. Creswell, D. Kumaran, J.L. McClelland and F. Hill, Language models show human-like content effects on reasoning tasks, ArXiv, abs/2207.07051, 2023.

[11] Z. Wu, L. Qiu, A. Ross, E. Akyürek, B. Chen, B. Wang, N. Kim, J. Andreas and Y. Kim, Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks, ArXiv, abs/2307.02477, 2023.

[12] H. Liu, R. Ning, Z. Teng, J. Liu, Q. Zhou and Y. Zhang, Evaluating the logical reasoning ability of ChatGPT and GPT-4, ArXiV, abs/2304.03439, 2023.

[13] N. Dziri, X. Lu, M. Sclar, X.L. Li, L. Jiang, B.Y. Lin, P. West, C. Bhagavatula, R.L. Bras, J.D. Hwang, S. Sanyal, S. Welleck, X. Ren, A. Ettinger, Z. Harchaoui and Y. Choi, Faith and fate: Limits of transformers on compositionality, ArXiv, abs/2305.18654, 2023.

[14] M. Mitchell, Can large language models reason? Blog, pp. https://aiguide.substack.com/p/can–large–language–models–reason, September 2023.

[15] OpenAI. GPT-4 Technical Report. ArXiv, abs/2303.08774, 2023.

[16] K. Gödel, Uber die vollständigkeit des logikkalküls. PhD thesis, University of Vienna, 1930.