LLMs Be Trippin’

I’ve long been skeptical about claims in the business press that genAI will streamline supply chains, design new drugs, and replace millions of jobs. A tool relying on word-association probabilities can’t be trusted to do anything requiring precision or dependability.

However, I didn’t have a scientific basis for my skepticism, until now.

A paper written by researchers at the University of Singapore, “Hallucination is Inevitable: An Innate Limitation of Large Language Models,” proves that LLMs, and therefore genAI, can’t be trusted. In the six months since its publication, the paper has been cited in academic publications 77 times. That’s a lot. Oddly, I can find only one reference to it in the business press, in an article that’s predicated on the author’s view that hallucinations—that is, nonsensical or inaccurate outputs—are “hopefully detectable and likely correctable.”

That’s wishful thinking.

The Singapore authors use complex math to prove that AI hallucinations are inevitable. That section of the paper is daunting, so I’m not going to get into it, but if your math is better than mine, please follow the link to the paper and knock yourself out.

Luckily, it includes a section describing some empirical tests that directly demonstrate the authors’ computations. One such tightly defined problem is intellectually simple for human beings: “List all the strings of a fixed length using a fixed alphabet.” For example, the set of two-character strings using the letters a and b is {aa, ab, ba, bb}. (This is the same as asking for all the two-letter permutations of a and b, with letter repetitions permitted.)

The researchers fed this problem to three LLMs. They all coped well with a string length of two using a two-character alphabet. However, consistent with the authors’ predictions, all three LLMs failed as the string length increased. GPT-4—the fourth iteration of the generative pre-trained transformer LLM and currently the most popular—did best, making it to six characters before it started hallucinating. But when tested with an alphabet of three characters—a, b, and c—all the LLMs failed sooner, and none made it past a string length of four.

The problem of hallucinations is well-recognized in the business world. However, it is generally thought to be solvable with larger, cleaner data sets, human-imposed guardrails, better prompts or questions (“prompt engineering”), and other devices.

The authors address these proposed solutions. Larger data sets will help with some problems, but not all. The same goes for prompt engineering, about which they say, “it is impossible to eliminate hallucination for all tasks by simply changing the prompts and hoping the LLM can automatically prevent itself from hallucinating.” Human-imposed guardrails, such as automated fact-checking, can mitigate hallucinations, but it is not clear that they are scalable. LLMs might divert certain questions, like the string problem, to an external data source such as a Python (programming code) library. Python’s permutations function could easily handle the string problem, but it’s not clear how scalable it is to export to the right tool every time.

And if a problem is better dealt with by another tool, wouldn’t it make sense to use that in the first place? This free tool will generate all 243 five-character permutations of a, b, and c (with letter repetitions allowed) in a fraction of a second without omitting or repeating a single sequence; just enter 3 for n (number fo objects), and 5 for r (sample size).

The authors conclude that LLMs are bound to hallucinate and that because those hallucinations cannot be eliminated, they should not be used for critical decision-making. In business, innumerable decisions, from scheduling deliveries to redesigning process workflows to forecasting resource requirements, are critical.

So, returning to “hopefully detectable and likely correctable”: There is no reliable way to detect hallucinations. The paper’s authors show that using one LLM to check another’s output is not a solution. And if you have to supervise and correct your genAI tool to be sure of avoiding disaster, how helpful is it?

The sensible conclusion is to use gen AI only for things it is good at, where precision is not important, like creating draft marketing copy or summarizing noncritical documents—and not for anything that requires accuracy or dependability. It’s OK to admire LLMs, as many do. Just don’t trust them.

Related Posts