LLM Coding

(I made this image with ChatGPT’s Sora 💅🏼)

The past few years has seen numerous AI companies emerge. Investors have spent trillions of dollars hoping for an eventual pay-day. The hype coming from people working at these companies is so loud it can be deafening!

As a software engineer it’s become tough to discern what is really truly valuable vs what might be a waste of time…

The goal of this post is to separate the hype from the practical value of using LLM’s while developing software.


📊📚 The State of the Art – Recent Research Highlights

To ground us a bit I want to start by summarizing some key research findings from the last 12 months.

The Good

Stanford Study - Predicting Expert Evaluations in Software Code Reviews

LLM productivity graph

This study found lots of productivity gains for engineers in the workplace, but with a fair amount of nuance.

  • Complexity and Maturity: The largest gains (30-40%) were seen in low-complexity, greenfield tasks, while high-complexity, brownfield tasks showed the smallest gains (0-10%).

  • Language Popularity: AI was found to be less helpful for low-popularity languages and more effective for popular languages like Python (10-20% gains).

  • Codebase Size: As the codebase grows, the productivity gains from AI decrease significantly due to limitations in context windows and signal-to-noise ratio.

In conclusion, the study found that AI increases developer productivity by an average of 15-20%, but its effectiveness is highly dependent on the specific task, codebase, and language.

The Bad

bad LLM productivity graph

Continuing to reference the Stanford study above, we should also highlight that code created with LLM tools have some issues with accumulating tech debt.

Along with the productivity gains described above there seems to be a lot of technical debt, or re-work, accumulated because of the use of these tools. So much so that it cuts the productivity gains in half. Even with this fiction, it still seems to be a net positive on productivity.

See YouTube video by the lead author here.

Importing Phantoms: Measuring LLM Package Hallucination Vulnerabilities

All tested models exhibited package hallucination, with rates between: 0.22% - 46.15%.

Organizations must carefully balance the trade-offs between model performance, security, and resource constraints when selecting models for code generation tasks.

The Ugly

NANDA - an MIT-led intiative - ROI Study

95% of enterprise generative AI pilot projects fail to deliver a meaningful return on investment (ROI) within 6 months.

Projects often stall in “pilot purgatory” because AI outputs are “confidently wrong,” requiring employees to spend extra time double-checking and correcting the results. This “verification tax” erases any potential ROI.

METR Study

Using AI tools actually slowed down experienced open-source developers by ~ -20% on average, contrary to their own predictions (~ +20%).

The slowdown was attributed to developers spending more time reviewing AI outputs and prompting, due to AI’s unreliability and lack of implicit context in complex, familiar codebases.

The Takeaway

LLM based coding tools seem to add the most value when you are working with:

  • type-ahead (where you can digest and verify each line more readily),
  • a prototype,
  • new feature,
  • greenfield codebase,
  • small to mid-size codebase, or
  • more compact brownfield feature

Beyond that they begin to have eroding value.

Sadly, these tools are not magic 🪄😞

If you are working with:

  • highly ambiguous specs,
  • in a: large, highly complex codebase, written in a low popularity language, or
  • building generative features on your own bespoke model at your company

you should ask yourself: “Is using this tool in this context going to help bring about an optimal business outcome?”

Still at the end of the day there seems to be clear boosts to productivity in many-many projects using this tools.


✍🏼 Anecdotal Experiences – Real‑World Stories

The One‑Line Fix

I can’t tell you how many times I’ve asked an LLM for a regex or SQL query syntax and got a correct solution in seconds, saving hours of time!

LLM based Auto-complete is my favorite feature of all these tools(thank you Tabnine)! Once in a while you will hit the tab button and something surprising that saves you days appears!

AI can tempt veterans to take shortcuts

I’ve seen good engineers rely on Claude like a crutch and stop thinking critically. I’ve even seen staff engineers open up PR’s with embarrassing levels of AI slop. PR’s created with AI must be read and reviewed critically, even for those with more sonority.

Pair‑Programming with an LLM

I love to explore a new framework or unfamiliar part of a codebase with Claude. Its great at summarizing a class or feature set. It can also be very helpful “pair” by prompting it with smaller questions and actively partnering with it.

It is Magic with Boilerplate Code

Recently I used Cursor to help a team transition from Rails ERB templates to React views by generating boilerplate code. I’ve also been able to verbally describe the shape and attributes of a JSON payload to the LLM. I used this to create example while testing API endpoints.

This can help engineers from having to work on mind numbing tasks.

When the Model Hallucinates

Multiple times I have been excited to see a LLM suggested a perfect library or module to compliment the feature I’m working on, only to find out it is non-existent. 😿

Wiley Coyote Effect

Sometimes while working with LLM’s you can think you are a based borg-minded-genius.

Then days later… after the LLM assisted code is deployed into production, you discover a major un-tested regression.

You realize at that moment that you are in-fact actually Wiley Coyote standing off the edge of a cliff.


🛠️🧰 Practical Tips – Making LLMs Work for You

Below are some tips I’ve compiled while working with AI the past few years.

Tip Why It Helps Quick Example
Focus on Smaller Problems The larger your codebase the more the context window is degraded Build boilerplate for a form that updates the user model
Start with a Clear Prompt Reducing ambiguity, yields more accurate snippets. Generate a Python function that parses ISO‑8601 dates
Validate with Tests Catch hallucinations early. Write a unit test that must pass before accepting the LLM's suggestion, write an e2e before you start the feature.
Always Verify Packages LLM's will confidently give you a magic package, sometimes they don't exist, and in the worst-case scenario they can link you to something with security vulnerabilities Google the NPM package before you commit!
Assume Code was Written by a JR This will challenge you to think critically about the context (LLMs don't), and surface issues Read every line in your editor as if it were written by a JR
Iterate & Refine Treat the model as a conversational partner. Follow‑up: "Add error handling for invalid strings." Apply Fowler's Refactoring principles to evaluate trade‑offs.
Leverage Contextual Files Upload relevant code files so the model sees the surrounding architecture, you'll get better results. Use OpenAI's file upload feature with spreadsheets, integrate your issue tracker and wiki with MCP, use cursor rule files in each dir
Documentation & Knowledge Transfer Living documentation helps teams go quicker LLMs can auto‑generate API docs, inline comments, and migration guides.

🛠️🧰 Practical Tips – Good Engineering Practices

I’d like to add good engineering practices in here as well. They are helpful for all teams, especially ones that use LLM’s.

Tip Why It Helps Quick Example
Design Over Implementation A good upfront design leads to long-term development advantages LLMs can help generate scaffolding but shouldn't dictate high‑level design.
Do Smoke Tests Often Catch issues early on Run the app in the background and do a click through ever 15 minutes
Linting is More Important than Ever Be sure to use linters deeply and expansively Harness tools like EsLint, Pylint, and RuboCop
Stay Security‑Aware Prevent accidental inclusion of secrets or vulnerable patterns. Run static analysis on generated code.
Keep Pull Requests Focused Always assume the code can and will have flaws. The more slop you let into the codebase, the harder it will be to clean up. I like to come up with a team agreement to keep PR's 15 files or less or 750 lines or less.

🤖👨🏽 Closing Thoughts – The Future of Human‑LLM Collaboration

Returning to the original question for this post is AI: “Autopilot or Co‑Pilot?”

Personally I think AI performs the role of something closer to co-pilot right now. It’s clear that software engineering will never been the same again. However, LLMs are augmentative, not a replacement for deep expertise. We don’t currently have agentic programmers, and we can’t get away with just vibe coding (at least not yet!).

In many ways the role of the senior / staff engineer is even more important than ever in this context. If we are using these tools for productivity boosts, we most also double down on ensuring our codebases are maintainable, performant, secure, and well architected.

I encourage folks to experiment with these tools evolve your best practices. As new tools and models come out we should continue to engage with them and find their best purposes.

Happy coding! 🎉

How to build a Python‑based web scraper that utilize OpenAI's SDK

How to Harness LLMs While Coding: A Practical Guide for Software Engineers Continue reading

Books I'm Reading During My Sabbatical

Published on January 01, 2018

Using MongoDB in a Ruby on Rails Project

Published on December 12, 2017