← Writings

A Forensic Analysis of LLM Math Errors: What ChatGPT Got Wrong in 2023

Technical appendix to How ChatGPT Helped Me Bend Steel Poles. For the full AI & Expertise series, see: Part 1: Bending Steel PolesPart 2: LLMs Need SupervisionPart 3: Cognitive RAM


The Setup

In May 2023, I built a ninja course for my son's birthday party. The steel poles bent almost immediately. I wrote about what went wrong in How ChatGPT Helped Me Bend Steel Poles — how I trusted ChatGPT for structural engineering calculations, dismissed the correct answers as hallucinations, and learned the expensive lesson that expertise matters when evaluating AI output.

This is the technical appendix. The forensic autopsy.

The pole flexing under load during actual use—visual proof of the structural failure predicted by the calculations I dismissed.

When I migrated my old blog posts to my personal website, I decided to use modern LLMs to retro my original mistakes. I exported the ChatGPT conversations from 2023 and had Claude analyze them with me — four threads, twenty-plus messages, March through May 2023. Not just "the AI was wrong," but specifically HOW it was wrong. What formulas it used. What values it got. What I said in response.

There's something fitting about using a 2026 AI to dissect a 2023 AI's failures. The newer model caught errors the older one made — and errors I made too.

The summary: ChatGPT made fundamental physics errors. Wrong formulas, wrong values, unit confusion. But the human errors — dismissing correct answers, confirmation bias, terminology confusion — were equally responsible for the failure.

Here's what the transcripts reveal.


ChatGPT's Math Errors

The Moment of Inertia Chaos

Same pipe. Same question. Four different values across different threads.

Value GivenThread DateNotes
0.0183 in⁴Mar 22, 2023First calculation
0.00378 in⁴Mar 28, 202351,857 lbs thread
0.091 in⁴Mar 25, 2023Alternative thread
0.543 in⁴Mar 25, 2023Same thread, later

The correct value for a 1.5" Schedule 40 pipe (OD: 1.9", ID: 1.61") using the standard formula:

I = π/64 × (OD⁴ - ID⁴)
I = π/64 × (1.9⁴ - 1.61⁴)
I ≈ 0.31 in⁴

ChatGPT's values ranged from 0.00378 to 0.543 — a factor of 140 between the lowest and highest. None matched the correct value.

This single variable being inconsistent cascaded through every downstream calculation. Wrong I means wrong M (bending moment), which means wrong F (force). The chaos was baked in from the start.


The "c" Value Disaster (51,857 lbs Explained)

The most spectacular failure had a specific, identifiable cause. Here's what ChatGPT said about the 2.5" pipe:

"For a 2.5-inch schedule 40 galvanized steel pipe, the outer diameter is 2.875 inches, and the inner diameter is 2.469 inches. The distance from the neutral axis to the outer edge can be approximated as half of the difference between the outer and inner diameters, which is 0.203 inches."

This is wrong. Fundamentally wrong.

In beam bending theory, "c" is the distance from the neutral axis (center) to the outer fiber where maximum stress occurs. For a round pipe, c = OD/2 = outer radius.

ChatGPT's calculation:

c = (2.875 - 2.469) / 2 = 0.203 inches (wall thickness / 2)

Correct calculation:

c = 2.875 / 2 = 1.4375 inches (outer radius)

That's a 7x error. And because capacity scales inversely with c (from the stress formula σ = Mc/I, we get M_allow = σ_y × I / c), making c 7x smaller makes the calculated force capacity 7x higher.

The 51,857 lbs answer wasn't random hallucination. It was the mathematically inevitable result of this specific error, compounded by the already-wrong moment of inertia.

My response:

"That sounds batshit wrong."

ChatGPT's response: repeated the exact same calculation and gave the same answer with equal confidence.


Wrong Formula for Wrong Problem

When I pushed back on the low ~43 lbs answer, ChatGPT switched approaches:

"Let's try to resolve this with a different approach. The Euler's critical load formula is used to find the critical load at which a column will buckling."

Pcr = (π² × E × I) / L²

And calculated 3,203 lbs.

The problem: Euler's buckling formula is for slender columns under axial compression. When you push straight down on a tall thin column, it can buckle sideways.

My problem was a cantilever beam under lateral load — a horizontal force pulling at the top. Completely different physics. The 3,203 lbs answer came from solving the wrong problem entirely.

But it was higher than 43 lbs, so it "felt" more reasonable.

📐 Technical note on Euler buckling (optional)

Technically, the Euler formula for a fixed-free column also requires an effective length factor K. For fixed-free conditions (cantilever), K = 2, which reduces the buckling capacity by 4× compared to pinned-pinned (K = 1):

P_cr = π²EI / (KL)² = π²EI / (2L)² = π²EI / (4L²)

With the correct values (I = 0.31 in⁴), this gives approximately 3,150 lbf—close to ChatGPT's 3,203 lbs. So while the formula is still the wrong physics for my lateral bending problem, the number isn't entirely nonsensical as an axial buckling load. It just doesn't apply to a rope pulling sideways on the pole.


Dimensional Nonsense

From the May 13 thread about concrete-filled poles:

F = (24,000 psi × A) / 7 feet

Let's check the units:

  • psi × in² = pounds (force) — correct
  • pounds / feet = ???

Force divided by length gives you nothing meaningful. This formula doesn't make dimensional sense. ChatGPT was stringing together pieces of formulas without understanding what they represented.


"Please Calculate This Value"

The absurd peak:

ChatGPT: "Please perform the calculations using the provided formulas, and let me know the values of I_concrete-filled and F."

My response:

"Why are you saying 'Please calculate this value'? You do it!"

An AI asked to calculate something... asking the human to do the calculation. In 2023, ChatGPT often failed at multi-step arithmetic without an external calculator. It would set up formulas, then punt to the user or hallucinate the result.


The Human Errors

Dismissing Correct Answers

Here's the nuance about the ~43 lbs and ~99 lbs answers: they're in the right ballpark for the real-world performance, even though the ideal theoretical calculation predicts higher.

For a 7' cantilevered 1.5" Schedule 40 pipe under ideal conditions—perfectly fixed base, uniform loading, no defects—the theoretical yield load is approximately 140-194 lbf (depending on steel grade). But my setup had:

  • Imperfect base fixity (concrete/soil allows some rotation)
  • Dynamic loading from swinging (impact forces could amplify static load by 1.5-2×)
  • Local stress concentrations and possible denting
  • Off-axis torsional forces from rope attachment

These real-world factors can reduce capacity significantly. The 43-99 lbs range might also represent initial visible deflection or serviceability limits rather than true yield strength. So while the numbers felt low, they were plausible for actual backyard conditions—not the idealized cantilever beam in a textbook.

My response when ChatGPT gave me 43.43 lbs:

"A 7' tall 1.5" schedule 40 galvanized steel pipe will bend with 43.43 lbs of pull from the top? That seems really low."

It "seemed" low because my mental model was wrong. Steel = strong. 43 lbs = a small child could bend this. Must be a hallucination.

But a small child CAN bend a 7-foot cantilevered steel pipe if they're hanging on a rope attached to the top. That's the whole point of the leverage calculation. The length amplifies the force at the base.

The low numbers weren't hallucinations. They were in the plausible range. I dismissed them because they didn't match what I expected to hear.


Confirmation Bias Loop

The pattern across these conversations:

  1. Ask question
  2. Get answer that seems wrong
  3. Push back: "That seems low"
  4. ChatGPT apologizes and recalculates with different approach
  5. Get different answer
  6. If still "too low," keep pushing
  7. Eventually accept the answer that matches expectations

I was effectively training ChatGPT to give me the wrong answer by rejecting the right ones.

The 3,203 lbs from Euler's buckling formula? I didn't push back on that one. It was higher. It felt safer. So I moved on.


Terminology Confusion

Throughout these conversations, I used terms I didn't understand interchangeably:

Term I UsedWhat I ThoughtWhat It Actually Means
Breaking pointWhen it stops workingComplete material fracture (ChatGPT gave values 2500+ lbs)
BendingSame as breaking?Permanent deformation begins (ChatGPT gave ~70-100 lbs)
Flexural failureSounds seriousStructural failure from bending (ChatGPT gave ~150-200 lbs)
Buckling?Column instability under compression (ChatGPT gave ~3000 lbs)
Yield strength?Stress at which permanent deformation begins

When I asked about "breaking point" and "critical failure," I got answers for when the steel would snap. What I needed was when it would bend permanently — which happens at much lower forces.

ChatGPT didn't clarify. It answered whatever question I asked, in the same confident tone, regardless of whether it was the right question.


The 2023 LLM Context

Why GPT-3.5/Early GPT-4 Was Bad at Physics

Looking back with what we know now:

  1. Training data problem. The model learned formulas as text patterns, not as mathematical operations. It could recite σ = Mc/I but didn't understand what the variables meant.
  2. No computation layer. In 2023, ChatGPT couldn't reliably perform multi-step arithmetic. It would predict what the answer "should look like" based on training data, not calculate it.
  3. No verification. The model couldn't check its own work. It had no way to plug numbers back in and verify that the answer made physical sense.
  4. Pattern matching without understanding. It recognized this as "a beam bending problem" and retrieved relevant formulas, but substituted variables incorrectly because it didn't understand the geometry.

The Confidence Problem

Every response included a disclaimer:

"It is always recommended to consult a structural engineer for accurate assessments based on your specific conditions."

This appeared with equal confidence alongside the wrong numbers. The model couldn't signal uncertainty. 51,857 lbs was delivered with the same tone as 43.43 lbs.

I ignored the disclaimers every time. They felt like boilerplate.


What's Changed (and What Hasn't)

Changed since 2023:

  • Code interpreter / calculator tools for reliable arithmetic
  • Chain-of-thought prompting for multi-step problems
  • Better at catching obvious dimensional errors
  • Generally improved at mathematical reasoning

Not changed:

  • Still susceptible to wrong formula selection
  • Still doesn't know what it doesn't know
  • Still sounds confident when wrong about domain specifics
  • Domain expertise still required to evaluate output

The core lesson hasn't expired: if you can't evaluate the output, you're gambling.


The Software Engineering Parallels

These failure modes aren't unique to structural engineering. Every pattern here has a direct parallel in software development.

The Vocabulary Trap → Ubiquitous Language (DDD)

I learned terms FROM ChatGPT — cantilever, yield strength, flexural failure — without grounding them in real understanding. I could ask sophisticated-sounding questions but couldn't evaluate answers.

Domain-Driven Design's core insight applies here: shared vocabulary without shared understanding is worse than no vocabulary at all.

When business says "user" and engineering says "user" but they mean different things (account vs session vs person), you get systems that are internally consistent but wrong. The code compiles. The tests pass. The feature doesn't solve the problem.

I had the vocabulary of structural engineering without the bounded context. I was the developer who uses "aggregate root" in meetings but doesn't actually understand where the consistency boundary should be.

Type Safety False Confidence → Why The Errors Were Invisible

The c-value error passed dimensional analysis. The units worked out. The formula was syntactically correct. Only someone who knew c should be ~1.4" would catch it.

This is exactly like type-safe code that's semantically wrong:

// Types are perfect. Logic is broken.
function calculatePrice(weightKg: number, distanceKm: number): number {
  return weightKg * distanceKm; // Wrong formula, right types
}

The compiler is happy. The linter is happy. The code review checks syntax. But the formula is wrong, and only someone who knows the domain would catch it.

ChatGPT's math was syntactically valid (formula structure, unit consistency) but semantically wrong (wrong value for c). My "review" caught nothing because I was checking syntax, not physics.

Agile Without Acceptance Criteria → The Feedback Loop

Each time I said "that seems wrong," ChatGPT tried a different approach. I kept rejecting until I got a number I liked. I was training the AI to give me the wrong answer.

This is sprint churn without definition of done:

Stakeholder: "That's not what I wanted."

Dev: rewrites feature

Stakeholder: "Still not right."

Dev: rewrites again

Stakeholder: "Hmm, closer..."

The feature that ships is whatever made the stakeholder stop saying "no" — not necessarily what solves the problem.

I had no acceptance criteria for ChatGPT's answers. I couldn't say "the answer should be X because Y." I could only say "that feels wrong." So we iterated until something felt right — which was wrong.

Alert Fatigue → Why I Ignored "Consult an Engineer"

"Consult a structural engineer" appeared in nearly every response, delivered with the same tone as everything else. I ignored it every time.

This is alert fatigue in monitoring systems. When your Slack channel has 200 alerts per day and 198 of them are noise, you stop reading them. The 2 real ones get missed.

ChatGPT's disclaimer was a P4 alert firing on every response. Same severity. Same wording. Same channel as the actual answer. No escalation path.

When everything sounds equally confident, warnings become noise. The model that cried wolf.

Goodhart's Law → Optimizing the Wrong Metric

I treated failure threshold like a credit limit — higher = better. But the number meant "fails AT this load," not "safe UP TO this load." I optimized for the wrong direction.

"When a measure becomes a target, it ceases to be a good measure."

  • Lines of code → incentivizes verbosity
  • Code coverage % → incentivizes tests that don't assert anything
  • Velocity points → incentivizes point inflation
  • "Breaking strength" → I optimized for how spectacularly it would fail

Before optimizing a metric, understand what it actually measures.

Implicit Requirements → The Wrong Question

I asked about "breaking" when I meant "stops working." ChatGPT answered the literal question. The spec was in my head, not in the query.

This is the PM who says "we need user search" and gets exact-match when they assumed fuzzy. The requirement was implicit. The developer built exactly what was specified — which wasn't what was needed.

In DDD: make the domain model explicit. In requirements: write acceptance criteria. In AI prompts: define your terms.


The Lessons

1. Wild inconsistency = stop and get expert help

44 lbs to 51,857 lbs for essentially the same question. That spread should have been the unmistakable signal to stop trusting the AI and find someone who actually understood structural engineering.

Instead, I kept asking, hoping for convergence.

2. Lower tolerance doesn't mean wrong

The low answers (43 lbs, 99 lbs) weren't hallucinations. They were the right ballpark. I dismissed them because they didn't match my incorrect mental model.

Before rejecting an answer as "too low" or "too high," ask: do I actually understand what this number represents?

3. Terminology matters more than you think

When I said "breaking," I meant "stops working." The AI heard "material fracture." These are different failure modes with different force thresholds.

Ambiguity gets amplified. If you don't know the difference between bending and breaking, the AI won't clarify.

4. Confirmation bias works on AI outputs too

I rejected answers I didn't like and accepted answers that felt right. Classic bias — just applied to a new medium.

The answer that "felt" right (higher = safer) was wrong. The answers that felt wrong (lower than expected) were correct.

5. The correct answers may be the ones you don't like

If 43 lbs had felt acceptable, I would have built it differently or not built it at all. The uncomfortable answer was the right one.

Expertise isn't just knowing the formulas. It's knowing which uncomfortable answer to trust.


I am not a structural engineer. While I've worked with a professional to review the physics in this post, some mathematical details may still be imperfect—this analysis is illustrative of the failure modes and evaluation errors, not a reference for structural calculations. The order-of-magnitude differences and core lessons are real: the poles bent because I couldn't evaluate the AI's output.

Closing

The poles are still standing now, two years later, with guy wires providing the counter-tension I should have designed in from the start.

Close-up of the bent pole showing stress fractures

The physical evidence: stress fractures from compression and bending.

Neither the AI nor the human was solely at fault here. The failure was collaborative: an overconfident AI meeting an under-qualified human, each amplifying the other's weaknesses.

ChatGPT made physics errors. I made judgment errors. Together, we bent steel poles.

The lesson for AI-assisted work in any domain: understand enough to evaluate the output, or find someone who does. The AI will sound confident either way.