Status Was Green the Whole Time
Big system programs fail at a measured base rate, on causes the data has named for thirty years. The status deck is the last place the failure shows up.
YASHRAJ PATEL · LATENT VARIABLES
The number to start from
Begin with the base rate, because every other claim in this field has to argue against it. The Standish Group has tracked software project outcomes since 1994[1], and the first CHAOS report set the floor: 16.2 percent of projects landed on time, on budget, and at full scope; 52.7 percent were challenged; 31.1 percent were cancelled outright. The 2015 refresh moved the numbers a little and the shape not at all, roughly 29 percent successful against 52 percent challenged. I lead with Standish not because it is beyond criticism, the firm guards its methodology and the academy has pushed back on the sampling, but because no competing dataset of comparable size tells a kinder story, and the failure causes it names have been replicated everywhere since.
What it names is the part worth fixing on. The two highest-weighted success factors in the CHAOS rankings are user involvement and complete requirements, and the two highest-weighted failure factors are their absence: lack of user involvement at 12.8 percent of the failure weight, incomplete requirements at 12.3. Both sit above every technical factor on the list. Read that carefully. The thing most likely to sink a system program is not the platform, the integration architecture, or the cloud migration. It is that the people who do the work were never asked how the work is done. Panorama Consulting, which runs an ERP recovery and expert-witness practice[2], reaches the same place from the wreckage end: their published position is that software problems are usually a symptom of skipped discovery and weak change management, not a cause. When a firm that reconstructs dead programs for litigation and a firm that counts live ones agree on the root cause, I stop hedging.
Standish Group, CHAOS Report (1994) and 2015 refresh via InfoQ: share of projects by outcome.
The tail is where the money goes
Averages undersell the risk, and this is where I trust the most rigorous source in the canon. Bent Flyvbjerg and the McKinsey-Oxford study of 5,400-plus large IT projects[3] put the central tendency at 45 percent over budget with 56 percent less value delivered than promised, and 17 percent of large IT projects large enough to threaten the sponsoring organization's existence. The number that should change how anyone forecasts is not the average. It is the shape. Flyvbjerg's later work on the empirical reality of overruns[4] shows IT cost is fat-tailed: about 18 percent of projects overrun by more than 50 percent, and that tail averages a 447 percent overrun. A fat tail means the inside view, the project team's own plan, is close to worthless as a predictor, because the disasters are not the plan plus noise; they are a different distribution entirely. The honest forecast asks what happened to the reference class of projects that looked like this one, then goes hunting for the local facts that put a program in the tail.
Those local facts have names, and the public post-mortems supply them. TSB's 2018 core-banking migration went live with 4,424 open defects and had tested only one of its two data centers; the independent review priced the failure near 366 million pounds. Queensland Health's payroll carried more than 24,000 combinations of calculation rules and went live despite known defects, at an estimated A$1.2 billion. Canada's Phoenix pay system laid off about 2,700 payroll clerks before go-live and then mispaid close to 80 percent of 290,000 public servants, partly because roughly 40 percent of retroactive collective-agreement increases could not be processed automatically and the people who did them by hand were gone. None of these is a story about bad code in the abstract. Each is a story about a specific fact, known to specific people, that did not reach the go-live decision in time.
Flyvbjerg et al., "The Empirical Reality of IT Project Cost Overruns" (fat-tailed IT cost risk).
The pattern repeats with operating-model conflicts the configuration never resolved. Eric Kimberling's ERP failure post-mortems[5] trace Lidl's roughly 500-million-euro SAP write-off to an inventory valuation convention, purchase price versus retail price, that the business never accepted and that customization was used to paper over until the build broke. Hershey shipped its go-live into the pre-Halloween peak in 1999 and missed more than 100 million dollars in orders. The repeated diagnostic Kimberling extracts is mundane and damning: was the operating-model decision made before configuration, or was customization quietly substituting for a fight nobody wanted to have. That fight lives in the heads of the people who run the exceptions, and it is exactly what the status deck cannot show.
The watermelon problem
There is a name for the central failure of measurement here, and it is precise enough to keep. A program is reported green at the top while it is red underneath, the way a watermelon is green on the rind and red at the core. The mechanism is not fraud. It is aggregation. A clerk knows the exception rate and softens it. The first-line manager rounds the softened version up. The PMO averages it into a status color. The steering committee receives the color. Each step is locally rational, each person is protecting a reasonable interest, and the sum is a decision-maker who is structurally the last to learn the one thing the floor knew first.
Standish gives this a measurable handle that I find more useful than any survey question: user involvement is the top success factor, and it is checkable by asking a frontline user for an episode. Were you asked. Did anyone watch you work. Did your exception get into the design. Beyer and Holtzblatt established the reason this works in their contextual inquiry method[6]: people cannot accurately describe their own routine work from memory, but they can show it or narrate a specific recent instance. So a project that gathered requirements from documents and workshops, rather than from watching the work, has a known statistical defect, and the defect is invisible until cutover.
“Software issues are usually a symptom of inadequate change management and skipped discovery, not the root cause of failure.”
The clinical side has the cleanest evidence that the organization, not the software, is the variable. The KLAS Arch Collaborative has run more than 700,000 standardized EHR experience surveys[7] across 300-plus healthcare organizations, and the headline finding is that the same EHR scores wildly differently from one organization to the next, which is only possible if the determinant is local. The cross-cuts are specific: physicians reporting six or more hours of weekly after-hours charting are about twice as likely to report burnout, and physicians without adequate training are 3.5 times more likely to report a poor experience. The lived metric here is pajama time, the documentation that gets done after the shift, and it never appears in a status report because usage logs count clicks, not the hour at 10pm when the chart finally gets closed.
Why the truth stays on the floor
The deepest part of the problem is that the workarounds that predict a failed cutover are rational, and people will only narrate them under the right conditions. Ross Koppel's five-hospital study of barcode medication administration[8] is the canonical evidence: observed overrides on 4.2 percent of patients and 10.3 percent of medications, fifteen distinct workaround types, thirty-one distinct causes. His finding that matters for any discovery effort is that staff demonstrate these workarounds proudly when asked about a specific shift, and conceal them when asked to confess rule-breaking. The framing of the question determines whether the data exists at all.
And the spreadsheets deserve their own warning, because they are treated as scratch paper when they are the system of record. Raymond Panko's spreadsheet-error research[9] is unambiguous: more than 90 percent of operational spreadsheets contain errors, the average cell error rate runs near 3.9 percent, and users put the odds their own sheet contains an error at about 18 percent when the measured figure is 86 percent. In most legacy environments the shadow spreadsheet is the de facto application, full of business rules that exist nowhere else, and every conversion plan should treat it as untested code, because that is what it is. The to-be design, the new system as configured, gets built against the documented as-is. The work runs on the as-configured-by-a-clerk-in-2014, and the gap between them is where day-one failures live.
Raymond Panko, "What We Know About Spreadsheet Errors" (University of Hawaii) and EuSpRIG.
Put the threads together and they converge on one mechanism. The information that predicts whether a program is ready does not live in the data systems. It lives in the heads of order-entry clerks, schedulers, super-users, the one engineer who maintains the nightly job, the legacy administrator near retirement, and the developers below the team leads who know when the date actually died. It stays there because every channel built to collect it punishes honesty. Write the real exception rate and you indict your own unit. Log the defect and you slip the gate your manager promised. Tell the steering committee the date is dead and you are the one who killed it. So the clerk softens, the manager rounds, the PMO averages, and the committee makes the most expensive decision in the program, go or wait at cutover, on a number that is wrong by construction.
The grim part is that the recovery move is not exotic. Every named expert above is describing the same thing underneath the framework: get a neutral party to ask a specific person about a specific recent episode, somewhere the answer cannot be used against them. Koppel's observed shift. Beyer and Holtzblatt's narrated task. Panorama's independent assessment that the implementing team cannot run on itself. The answer is not unknowable. It is unasked, because the only channels most programs have built are the ones the floor learned to lie into.
Which points at the kind of instrument the evidence has been describing for thirty years and few have operationalized at scale: a neutral, confidential conversation, anchored to a real recent shift rather than a status request, run for enough people that no answer traces to one person, and read back at the altitude where each person actually knows something. That is the instrument we are building at Latent Variables. The base rates already told us where the failures sit and what causes them. The open problem was only ever reaching the floor before the cutover date arrives.
REFERENCES
- 1.Standish Group, CHAOS Report (1994): success/challenged/cancelled rates and the user-involvement and complete-requirements success factors; 2015 refresh via InfoQ. personal.utdallas.edu/~chung/SYSM6309/chaos_report.pdf
- 2.Panorama Consulting Group, ERP Implementation Rescue and project-recovery practice; published position that software issues are symptoms of skipped discovery and weak change management. www.panorama-consulting.com/erp-implementation-rescue
- 3.McKinsey and Oxford (BT Centre), "Delivering large-scale IT projects on time, on budget, and on value" (study of 5,400+ large IT projects): 45% over budget, 56% less value, 17% threaten the firm. www.mckinsey.com/capabilities/tech-and-ai/our-insights/delivering-large-scale-it-projects-on-time-on-budget-and-on-value
- 4.Bent Flyvbjerg et al., "The Empirical Reality of IT Project Cost Overruns": fat-tailed cost risk, ~18% of projects over 50% overrun, tail averaging 447%. arxiv.org/pdf/2210.01573
- 5.Eric Kimberling, Third Stage Consulting, "Lidl's ~600M SAP disaster": operating-model valuation conflict and customization substituting for change management. www.thirdstage-consulting.com/lidls-600-million-sap-disaster
- 6.Hugh Beyer and Karen Holtzblatt, Contextual Design (1997), contextual inquiry chapter: people cannot describe routine work from memory; they can show it or narrate a recent episode. courses.cs.washington.edu/courses/cse440/15wi/readings/ContextualInquiry-BeyerHoltzblatt1997.pdf
- 7.KLAS Arch Collaborative: 700,000+ standardized EHR experience surveys across 300+ organizations; after-hours charting and training cross-cuts; JAMIA 2021 burnout study. academic.oup.com/jamia/article/28/5/960/6242740
- 8.Ross Koppel et al., JAMIA (2008), five-hospital BCMA workaround study: overrides on 4.2% of patients and 10.3% of medications; 15 workaround types, 31 causes. todayshospitalist.com/study-finds-big-gaps-in-barcode-safety
- 9.Raymond Panko, "What We Know About Spreadsheet Errors" (University of Hawaii) and EuSpRIG: >90% of operational spreadsheets contain errors; ~3.9% cell error rate. panko.shidler.hawaii.edu/SSR/Mypapers/whatknow.htm
- 10.TSB Bank, Slaughter and May independent review (2019): 4,424 open defects at go-live, one of two data centers tested; via Computer Weekly. www.computerweekly.com/news/252474170/TSB-programme-pulled-apart-in-report-on-IT-meltdown
- 11.Queensland Health Payroll System Commission of Inquiry (2013): 24,000+ rule combinations, go-live despite known defects, ~A$1.2B cost. cabinet.qld.gov.au/documents/2013/Aug/Health%20payroll%20response/Attachments/Report.pdf
- 12.Office of the Auditor General of Canada, Phoenix Pay reports: ~2,700 clerks laid off before stabilization; ~40% of retro increases not processable automatically. www.oag-bvg.gc.ca/internet/English/parl_oag_201711_01_e_42666.html