Field Notes
Brief, real-world notes on how software behaves in production. Failure modes, operational drift, and the hidden complexity behind day-to-day operations.
Ford had to hire back former engineers to fix mistakes made by its automated systems
I didn't expect to see reversals like this for a bit longer:
“Mistakenly, we thought that by just introducing artificial intelligence and adjusting the design requirements that we had, that that would produce a high-quality product,” said Charles Poon, VP of vehicle hardware engineering, in a briefing this week with reporters.
Reality is sinking in. The dream of being able to fire competent employees and replace them with a chat bot is fading.
This is why having a competent, experienced steward for your systems matters so much. If the person at the controls doesn't know what they're doing—supported only by AI—they will inevitably introduce issues that either can't be fixed or are extremely expensive to fix later. Caught too late, it can lead to failures that have the potential to be catastrophic.
When Claude changed, everything changed: Managing AI blast radius in production
A good write up on over-trusting AI systems to make the correct decision:
LLM-backed systems break this assumption. The component that produces your output is not under your control. You cannot diff a model version bump from 4.0 to 4.5. It is a wholesale replacement of the functionality on which your system depends.
This is what we mean by an infinite blast radius: a change whose downstream effects cannot be enumerated in advance because the input space (natural language) and the failure modes (anything the model might do differently) are both unbounded.
This is why it's unwise to put AI/LLMs into environments where predictable, consistent outcomes are necessary. A big part of software engineering in the post-AI age is knowing when AI is the right tool to pull out of the drawer, or, best left for something that doesn't require certainty of output/outcome.
Existential Threat or Leverage: Your Choice
From a post on my personal developer blog about choosing how best to think about AI and how it fits into your work:
The mistake is treating AI like either a savior to be knelt-before or an outright Appetite for Destruction apocalypse.
It's neither. It's a more efficient way to make bad choices worse or good choices better. You can choose to deploy it in a valuable, conscientious way...or you can be the guy who uses it to invent a pyramid scheme and rationality cult.
This is the part nobody wants to hear. AI will not save you from needing standards. It will not save you from learning your craft. It will not save you from understanding customers, systems, tradeoffs, or consequences.
AI is not a magic wand, but it is a valuable tool. In the right hands with the right intent: it can make incredible things possible. But in the wrong hands with the wrong intent? All I can ask is..."have you seen Uncut Gems?"
A helpful reminder of why a system's architecture is often the (invisible) deciding line between a quick solution and a prolonged outage:
When you’re in the throes of an incident that involves an unexpected interaction, this architecture that was built for managing complexity now works against you. Because you’ve built an analysis solution but you’re now faced with a synthesis problem. You need to understand how the pieces all normally fit together to function in order to determine what is going wrong with the system right now. You’ve optimized to avoid requiring anybody to understand how the whole thing works, but now the whole thing isn’t working, and no one person knows how the whole thing works.
This quote stood out. It's why a stewardship approach to software makes sense: instead of trying to wrangle a coherent picture across several minds, you have a singular eye that knows and understands the whole system. And more importantly, that system is built in a way that makes stewardship effortless; not a maze of clever-but-kludgy half-decisions that get bundled into what we soften down to "technical debt."
The Newest Instagram "Exploit" is the Goofiest I've Seen
This is why replacing human support with “AI support” is not some harmless efficiency upgrade:
All the attacker needs to kick this off is your account username. Then, they hop on a VPN or proxy close to your city so Instagram's security algorithms don't suspect a thing. (You can quite easily get this from your public profile or "About" section or a hundred other ways.) Once it looks like the request is coming from the correct region, they tell the Meta support AI that the account is hacked and ask it to send the verification codes to an arbitrary email address they control.
The real failure here is not “AI.” The failure is handing an automated system a high-privilege recovery flow without hard, boring, conservative guardrails around identity, ownership, notification, escalation, and rollback (read: check the database to see if the account requesting the reset is who they say they are).
This is what happens when companies confuse support deflection with operational competence. They remove the expensive human layer, replace it with a chat box, and then act surprised when the chat box opens the vault door and says "come on in!"
Pure insanity.
A post from my personal developer blog. Thoughts about our collective obsession with speed, moving fast, and rarely if ever considering the consequences:
People rush through the part where understanding is supposed to happen, then spend ten times longer cleaning up the consequences. The original rushed work gets recorded as “fast.” The cleanup gets recorded as “unexpected.” The rework gets recorded as “iteration.” The confusion gets recorded as “alignment.” The preventable failure gets recorded as “learning.”
These things happen because they're rewarded, not because they're "correct" or "good." More often than not, speed is just a convenient excuse; not an actual strategy.
Good to see popular consensus starting to shift in this direction:
I’m calling it now, the adoption of AI agents into software development will be one of the most costly mistakes in the field’s history. Agents cannot program, and it’s taking longer and longer to realize that they can’t. They are a highly sophisticated statistical model designed to mimic the distribution of programming. The output is broken, but in a way that’s getting harder and harder to detect. Which is exactly what you’d expect from an increasingly accurate statistical model.
I wouldn't go so far as to say that LLMs can't program at all (my approach allows me to see both good and bad outputs from AI/LLMs), but rather, they can't program systems. The dream of typing in a few sentences and getting back an app is just that: a dream. In order to ship real, viable, production-grade software, the only way to get as close to "certain" as you can is to have a skilled human at the controls.
Otherwise, you're just gambling.
Why Japanese companies do so many different things
Really enjoyed this read on the inner workings of Japanese companies.
But Aoki points out that the horizontal coordination embodied by the andon cord doesn’t work without other practices as well.
For example: horizontal coordination requires that workers know each other’s jobs, since a worker who spots a problem in one area of the line can only act on it if he understands what that area is supposed to be doing.
But in order to understand each other’s jobs, workers cannot be specialized: they have to rotate across different workplace functions to the point where they’re familiar with much of the plant’s operations.
One section stood out: the andon cord only works because workers understand more than their own narrow station. They rotate across functions, learn adjacent jobs, and develop enough context to recognize when something elsewhere in the system is drifting. That is the part most organizations miss.
Stability does not come from everyone being highly specialized and locally efficient. It comes from enough people understanding how the whole operating environment fits together that they can notice failure before it becomes institutional. Stewardship of software systems is no different.
A steward is not just another specialist assigned to one slice of the system. The work requires perspective across the production environment: the software, the workflows, the integrations, the edge cases, the informal workarounds, and the points where small inconsistencies turn into operational risk.
The H-firm exists to make money, or rather to return money to shareholders; but the J-firm [Japanese-firm], run by its employees and largely indifferent to the interests of shareholders, exists simply to continue existing.
That’s why Japanese companies are so protean and willing to change what they do. Nintendo was founded in 1889 as a maker of handmade playing cards; in the 1960s, it was pushed out of the playing cards game by a wave of competition; and it spent several years experimenting with new markets—taxi services and instant rice, though contrary to the rumors not love hotels—before finding its way to video games.
The second section makes the same point from a different angle. The Japanese firm, as described here, exists to continue existing. That orientation changes how adaptation works.
Nintendo could move from handmade playing cards, to experiments in other markets, to video games because the company’s continuity mattered more than attachment to a single product category.
That kind of adaptability depends on organizational memory. A company can change what it does when it understands what it is, how it operates, and what must remain stable while everything else changes.
That is the deeper lesson: long-term stability is produced by continuity, shared context, and enough whole-system understanding to adapt without breaking the business; not rigidity.
You can no longer Google the word ‘disregard’
The reality of rushing into a technology having strange, second and third-order effects. This is why jumping on "hype" feels promising at first, but later introduces knock-on effects that can't be easily predicted. In order for a system to achieve stability (while remaining functional), the way (process) it's built matters just as much as how (technology) it's built.
The Code Nobody Read Is Already in Production
A good write-up that gets into the "why" of experienced engineers being cautious with how AI is used in their orgs:
Software development is accelerating in a way that has decoupled velocity from review. Code generated by AI tools ships to production without the kind of detailed human scrutiny that production code used to require. This must be the case for productivity gains to be real; otherwise, the bottleneck would simply move from reading to review.
This is the uncomfortable part of the current AI productivity story. The gain often depends on less review, not better review.
That matters because production software should not judged by how fast it was shipped to production. It should be judged by whether the organization can rely on it, operate it, debug it, change it, and explain its behavior when something goes wrong.
Code that “works on screen” can still be wrong in ways that matter. It can be wrong about state. Wrong about permissions. Wrong about failure. Wrong about money. Wrong about timing. Wrong about the business process it was supposed to support.
Calling that productivity is too generous. In many cases, it is just releasing unexamined behavior into a production system and hoping the blast radius stays manageable (hint: it won’t).
Careless use of AI is tantamount to unscrewing the discharge valve on a tanker truck full of Elmer’s glue and letting it pour all over your operations. Just because you can, doesn't mean you should.