AI’s Ability to Reason: Statistics vs. Logic

Tom Winans
Better Programming
Published in
7 min readNov 8, 2023

--

As humans, we are enamored of the idea that we can make a machine reason “like we do”. We’d like to think that the latest AI innovations, e.g., large language models, have reached such a point where, with them, machines can be made to reason. Our imaginations are captured, to say the least.

But the fact of the matter is that emerging AI, for all the awesomeness of its performance, does not reason: it “just” connects statistical dots, but not always reliably. To unlock its potential, we must recognize that its “reasoning” is based on probability, not logic, and this should guide us in how and with what guardrails we use it.

We want to believe that what modern AI produces is 100% correct, and that these models, or algorithms, can respond in thoughtful ways very like humans ideally would. Neither is correct. While today’s computing power makes it possible to construct models using billions and possibly trillions of features that, in turn, make it possible for models to statistically connect dots and form responses in ways that feel human, these exciting large language models are no better than the data on which they’re trained. They’re not sentient. They haven’t formed some new sense of morality. They don’t think using the Socratic method. They don’t initiate their own conversations. It may look like all the above is true, but it’s not.

As a simplistic existence proof that today’s AI does not reason with logic, consider the following problem in basic algebra which was given to Bing/OpenAI GPT to solve. The gist of the problem shown in the figure below is that there are two rectangles, each having the same height (though this detail is not clearly stated in the sourcing 6th grade math text) but different widths. Areas for each are given. The rectangles are positioned in the corresponding math text to suggest that they may be aggregated into a larger rectangle having a width that is the sum of the widths of the smaller rectangles — maybe as a hint toward length. The request to find the length (height) and widths is a test to see whether OpenAI’s GPT via Bing would determine if there are sufficient equations matching unknowns. There aren’t. GPT didn’t discover the number of equations is one too few. Instead, it attempted to find length and widths, and it responded suggesting it had successfully solved the math problem.

Algebra problem with 2 equations and 3 unknowns

Everything started to go amuck when the insufficiency of the number of equations matched to the number of unknowns was missed, and the third equation given above simply is a function of the other two. The problem in GPT’s output can be seen starting with the line beginning with “Substituting L and W2”. Specifically, the simplification of 12 / W1 * (W1 + 27 * W1 / 12) does not represent a correct application of the Distributed Property of Arithmetic. For illustration, the problem area is outlined in red in the figure below.

Incorrect application of the Distributed Property of Arithmetic

GPT tried to correct when I observed to it that the Distributed Property wasn’t correctly applied, but things only got worse.

Hmmm … not sure what happened here

I couldn’t explain GPT’s processing errors to myself until I focused more carefully on the simplification problem as shown below:

Blanks seem to matter. When removed, the Distributed Property is correctly applied, and the equation is rightly simplified

What appears to be happening is that the blanks included in the way the expression to simplify was first given to GPT, the very rendering which GPT itself produced earlier, contributed to statistical reasons GPT could not correctly simplify the equation (in truth, problems could be caused by functionality outside of the GPT model as easily as in it). However, in the second request to simplify, I removed blanks around fractions. I hoped that other blanks would not be misinterpreted, and I hoped GPT had been trained somehow regarding Arithmetic’s Distributed Property (apparently it has). Omission of these blanks appears to have helped GPT in analyzing the equation and properly simplifying and reducing it (39 = 39). Note in the second simplification request that, while I removed blanks from the equation to simplify, GPT created its response with spaces restored, suggesting the equation as I supplied it, i.e., with some blanks removed, may have been the actual text used/analyzed by GPT when forming its last response (had it not been, I assume the output of the second request would have mirrored the first’s).

This illustrates that Large Language Models, and GPT as an instance of such here, operate using statistics, not (symbolic) logic and true understanding of Mathematics (where we’d assume the technology implementation of same would understand the meaning of blanks and operator precedence). This should affect how we think to view such models, viz., they represent tools we, as humans, can use to amplify our own capabilities, but we need to be engaged when we use them to handle when answers may not be correct. And we need to structure prompts so that we can debug — we may need to be more precise in the way we say things as we structure our approaches to leverage these large language models and other AI innovations into what we do. Fine tuning will not prove sufficient.

Use of statistics to form these LLMs is an amazing outcome of AI work over the last 10+ years (certainly longer, but acceleration within the last 10 years is noteworthy). Translating one language to another, e.g., French to English or C# to Lisp, benefited by statistics, simply is gob smacking! Autocompletion when coding or writing a paper is equally so! Creating images from a textual description is a bit mind blowing, truth be told. But when we think about automating tasks with LLMs at the core of some system we’re implementing, essentially making activities hands-free, we need to take care that we put guardrails around what we build that trust but verify that solution requirements are met (else we re-submit/re-calculate or take another solution path). We also may need to invest to introduce guardrails, i.e., pre-/post-conditions and invariants, and key assumptions, into model sessions or into the LLMs themselves. Introducing such may not solve problems, but they can be used to signal exceptions that can be addressed by handlers that we create as part of the systems we put around LLMs et al.

We need to carefully consider our expectations of LLMs when we create AI Systems. We need to put safety checks into place to ensure request and response appropriateness and correctness. We need to pay attention to AI Safety, a field of research that aims to ensure artificial intelligence systems are aligned with foundational values and do not cause harm or unintended consequences. While there is no definitive or universally agreed-upon list of the pillars of AI System architecture, some possible candidates that overlap with how any software system should be put together are:

  • Alignment: AI systems should have goals and objectives that are compatible with foundational black and white values, and not pursue actions detrimental to human well-being or dignity.
  • Robustness: AI systems should be reliable, secure, and resilient to errors, attacks, or adversarial manipulation, and be able to cope with uncertainty and complexity.
  • Transparency: AI systems should be understandable, explainable, and accountable to humans, and provide clear and accurate information about their capabilities, limitations, and decision-making processes.
  • Ethics: AI systems should respect human rights, norms, and values, and avoid bias, discrimination, or unfairness.

This list is neither exhaustive nor mutually exclusive. Other considerations relating to how we put together such systems may include privacy, governance, regulation, social impact, and human-system collaboration. These are system-level concepts, not just AI model-focused concepts. You don’t naively presume a so-called safe LLM has been tested in all potential application contexts and against all common use cases. People who construct systems with these models are responsible to do this…

LLMs are REALLY COOL but not perfect — and the example above simply reinforces what we already know from the press they’re getting. Some might suggest that this test given to GPT is unfair in some way. Perhaps it is. HOWEVER, these LLMs and other modern AI are being held up as much more than finite automatons when they’re not, and our expectations of them need to be aligned with reality if we’re to unlock their true potential. And the systems we wrap around them, called AI Systems, are assembled by flawed people (aren’t we all?) who ultimately must be responsible to use appropriate data sets when training LLMs, to put guardrails into place to detect problems with both requests to and responses from our core systems, to define application contexts much more formally than we have to date, and to remove a few bits of white space on occasion…

--

--