AI & Technology8 min read

What AI Can't Do

Realistic expectations for AI limitations, hallucination, and the work that still needs a human.

By Luka Filips

Key Takeaways

  • The most dangerous AI limitations are not the tasks models refuse, but the ones they fail at while sounding completely confident. Output fluency is not evidence of accuracy.
  • Even purpose-built professional AI is unreliable: Stanford found the LexisNexis and Thomson Reuters legal research tools each hallucinate more than 17% of the time, despite using retrieval to ground answers.
  • AI pattern-matches; it does not reason robustly. Apple's research shows frontier reasoning models hit a complete accuracy collapse as problems get harder, and even reduce their effort once complexity rises.
  • In Australia trust is the real constraint, not adoption. With over 50% of organisations already using AI, the competitive edge is responsible, human-overseen deployment that customers and regulators can verify.

Ask an AI model a hard question and it will answer with the same confident tone it uses for an easy one. That tone is the problem. The biggest AI limitations are not the tasks models refuse; they are the tasks they fail at while sounding certain. Knowing where that line sits is now a core business skill.

AI limitations are the categories of work that current systems cannot do reliably, regardless of how much data or compute you add: they cannot verify truth, reason robustly through novel problems, predict genuinely unpredictable events, or be held accountable for an outcome. These are not bugs awaiting a patch. They follow from how the technology works.

This article sets out what AI cannot do, why, and how to plan around it. We write as a small agency that builds these systems for Australian businesses, so the framing is practical rather than alarmist. Be excited about the capability. Be precise about the edge.

The limits of machine reasoning

A large language model predicts the next token in a sequence. It is trained on enormous text corpora to continue patterns. That mechanism produces fluent, useful, often impressive output. It does not produce understanding.

The clearest evidence comes from Apple's machine learning researchers. In their June 2025 paper The Illusion of Thinking, they tested frontier reasoning models on controllable puzzles and found that these systems "face a complete accuracy collapse beyond certain complexities". Performance fell into three regimes: on simple tasks, plain models beat the reasoning ones; on medium tasks, the extra reasoning steps helped; on hard tasks, both collapsed entirely.

One finding deserves attention from anyone betting on scale. As problems got harder, the models' reasoning effort increased up to a point, "then declines despite having an adequate token budget". Given more room to think, they thought less. The researchers also found the models "fail to use explicit algorithms and reason inconsistently across puzzles". Hand a model the exact procedure to solve a problem and it still cannot apply it consistently as the problem grows.

This is the heart of the matter. A system that pattern-matches brilliantly is not a system that reasons. It works when the answer resembles its training data and degrades, quietly, when the answer requires genuinely new steps. For routine work that resemblance holds often enough to be valuable. For anything novel or high-stakes, treat fluent output as a draft, not a conclusion.

What is AI hallucination

AI hallucination is when a model generates content that is plausible and confidently stated but false or unsupported by any real source. The model is not lying, because lying requires knowing the truth. It is doing exactly what it was built to do: produce statistically likely text. Sometimes the most likely text is wrong.

People expect hallucination from a free chatbot. The surprise is how stubborn it is in serious, purpose-built tools. Stanford researchers Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher Manning and Daniel Ho evaluated the legal research products from LexisNexis and Thomson Reuters, both of which use retrieval-augmented generation to ground answers in real case law. Their paper, Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, reports that these tools "each hallucinate more than 17% of the time". These are professional products marketed to lawyers, built specifically to suppress the problem. Roughly one answer in six still goes wrong.

Retrieval-augmented generation reduces hallucination by feeding the model relevant documents at query time. It helps. It does not solve the problem, because the model can still misread, misquote, or stitch together passages that do not actually support its claim. We build retrieval-grounded knowledge assistants for clients precisely because grounding lifts reliability, and we design them knowing the floor is not zero.

The practical rule: confidence in the output tells you nothing about its accuracy. No prompt eliminates hallucination. You can reduce it, constrain it, and catch it with review, but you cannot trust an unverified AI claim in any context where being wrong has a cost.

Prediction is not the same as pattern recognition

Much of what gets sold as AI prediction is pattern recognition wearing a costume. The distinction decides whether a project succeeds or wastes a year of budget.

Pattern recognition works when three conditions hold: the future resembles the past, enough historical data exists, and the underlying system is stable. Demand forecasting for a stable product line fits. Spam detection fits. Fraud signals that repeat fit.

Prediction fails when the opposite is true. Novel situations, complex adaptive systems, and anything driven by human behaviour resist it. No model predicts stock prices, election results, individual job performance, or who will commit a crime, regardless of the training set. The data describes a world that no longer behaves the way it did when the data was collected. Feeding more history into an unstable system does not produce foresight. It produces a confident guess dressed as a forecast.

When a vendor claims their system predicts an inherently unpredictable outcome, that is the moment to slow down. The maths does not improve because the marketing does.

AI snake oil and the vendor reality check

The term AI snake oil, popularised by Princeton researchers Arvind Narayanan and Sayash Kapoor, names products that promise far more than the underlying technology can deliver. It is worth separating three categories of claim, because the danger is rarely AI itself. It is AI applied to a problem it was never suited for.

Some AI genuinely works and is mature: image recognition, speech-to-text, translation, recommendation. Some is improving fast with real limits: language generation, code assistance, drafting. And some is snake oil: systems claiming to predict the unpredictable or to read character, intent, or future behaviour from thin signals.

When a vendor pitches, a few patterns separate substance from theatre:

| Warning sign | What to ask | |---|---| | "95% accurate" with no context | Accurate at what task, measured how, on whose data? | | A flawless live demo | Can we run a paid pilot on our own messy data first? | | "It will be able to do this soon" | What can it do today, in writing? | | A black box that cannot be inspected | How do we audit a decision after the fact? | | Only success stories | Show us where it fails and how often. |

Demos are engineered to impress. Production is where the disadvantages of AI surface: edge cases, dirty data, and outputs no one checks. Ask for case studies with measured outcomes. Insist on a pilot. Verify the headline number yourself. A capable system survives that scrutiny; snake oil does not.

What AI cannot own: judgement and accountability

The deepest limit is not technical. AI cannot be accountable. When an automated decision harms a customer, "the model did it" satisfies no regulator, no court, and no person on the receiving end. Responsibility stays with people, which means people must stay in the decisions that carry weight.

There is a counter-intuitive pattern here that we see repeatedly. AI tends to automate the easy part of a job and leave the hard part standing. A lawyer reviews documents faster, but the legal judgement is still hers. A clinician gets a quicker first read, but the treatment call is still his. The routine, high-volume work compresses. The ambiguous, high-consequence work does not, and that is the work that needed a human all along.

This is why we design for human oversight from the start rather than bolting it on. The capabilities AI lacks are precisely the ones that justify human involvement: judgement when the rules do not fit, empathy for the person affected, ethics in a genuine trade-off, and ownership of the result. A system that knows when to defer to a person is more valuable than one that pretends it never needs to.

The design principle that follows is keeping a human in the loop: begin with human review of consequential decisions, expand the system's autonomy only as evidence of reliability accumulates, and route every uncertain or unusual case to a person by default. Errors in an unattended automation do not happen once. They repeat thousands of times before anyone notices.

The Australian angle: trust as the constraint

Australia has not been slow to adopt this technology. According to CSIRO, Australia's national science agency, "more than 50 per cent of organisations are using AI" and "almost half of Australians have used generative AI, outpacing the US and UK". Adoption is not the bottleneck here. Trust is.

CSIRO is blunt about what happens when systems lack proper safeguards: "biased decisions, privacy breaches, financial damage and real harm to people who rely on these systems". Those are not abstract risks. They are the predictable result of deploying a tool with the limits described above into a decision that matters, without oversight.

The agency's researchers frame the opportunity in commercial terms. Dr Ming Ding argues that "ensuring AI is responsible by design mitigates this risk from the beginning. It leads to systems that people can trust, and that trust has real value in the market". Dr Qinghua Lu puts it more sharply still: "Trust in AI is the new currency." For an Australian small business, that reframes the whole question. The competitive edge is not having AI. Most of your market already does. The edge is deploying it in a way customers and regulators can trust, with outputs people can check and decisions people can question.

The Enki approach

We would rather under-promise and over-deliver than sell something that fails in production. In our work with small businesses, the pattern that wins is unglamorous: match the tool to tasks it genuinely handles, keep people accountable for decisions that carry risk, verify outputs that feed real choices, and measure results against the claim.

That discipline starts before any build, in a discovery and audit phase that separates the problems AI can solve from the ones it cannot. It continues in treating the business as a system rather than a pile of features, and in future-proofing how you are found as search shifts toward AI answers. Realistic expectations are not a brake on ambition. They are how you spend a budget on the AI that works and skip the AI that does not.

Frequently Asked Questions

AI hallucination is when a model produces content that sounds plausible and is stated confidently but is false or unsupported by any real source. It happens because a large language model generates statistically likely text rather than verified truth, so the most probable wording is sometimes simply wrong. It cannot be eliminated by prompting, only reduced and caught through review. Stanford researchers found that even professional legal AI tools from LexisNexis and Thomson Reuters, which use retrieval to ground their answers, still hallucinate more than 17% of the time.
Current AI cannot verify whether its own output is true, cannot reason robustly through genuinely novel problems, cannot predict inherently unpredictable events such as stock prices or individual behaviour, and cannot be accountable for a decision. These limits follow from how the technology works rather than from a lack of data or compute. Apple's 2025 research showed that even frontier reasoning models suffer a complete accuracy collapse once problems become sufficiently complex.
No. A large language model predicts likely sequences of text based on patterns in its training data, which produces fluent and often useful output but is not human reasoning. Apple's machine learning researchers found that frontier models fail to apply explicit algorithms consistently and reason inconsistently across similar puzzles, with accuracy collapsing entirely as complexity rises. Treat AI as excellent at pattern recognition and unreliable at novel, multi-step reasoning.
AI snake oil, a term popularised by Princeton researchers Arvind Narayanan and Sayash Kapoor, describes products that promise far more than the technology can deliver, especially systems claiming to predict inherently unpredictable outcomes or to read character, intent, or future behaviour from thin signals. The danger is rarely AI itself but AI applied to a task it was never suited for. Protect yourself by asking how accuracy was measured, running a paid pilot on your own data, and verifying headline claims independently.
No. The limitations argue for selective, well-designed use rather than avoidance. AI genuinely works for tasks like summarisation, drafting, translation, recommendation, and routine automation. The approach that succeeds is matching AI to tasks it handles reliably, keeping a human accountable for decisions that carry risk, and verifying any output that feeds a real choice. In Australia, where more than half of organisations already use AI, the advantage comes from responsible, trustworthy deployment, not from adoption alone.

Ready to implement AI in your business?