Vibe Coding in Healthcare: What the Prototype Proves and What It Doesn’t

Josh Koenig

60+
Healthcare Implementations
(14 Years)
4.9/5
Clutch Rating
(48+ Reviews)
$800M+
Client PortCo
Value Created

Last updated: May 2026
By: Josh Koenig, Product & Strategy, Sidebench

Healthcare buyers now arrive at the vendor conversation with a working prototype in hand. They built it themselves, or their innovation team did, using Claude, Cursor, Lovable, or Bolt. The screens click through. The flows look right. The demo lands in the boardroom.

This is new in 2026, and it changes what the first thirty days of an engagement should look like. The right vendor response is neither “we’ll redo it” nor “great, we’ll build on it.” It’s a structured read of what the prototype proves, what it doesn’t, and what has to be true for the production build to work.

In this article:


What a vibe-coded prototype actually proves

A vibe-coded prototype is a design artifact, not an engineering artifact. The distinction matters because everything downstream, scope, budget, timeline, vendor selection, hinges on which one you think you have.

A design artifact proves intent. It shows what the workflow should feel like, where the friction points live in the current process, and what the team has converged on after weeks of internal debate. It captures decisions that would otherwise sit scattered across Figma comments, Slack threads, and someone’s notebook, and it pulls those decisions into one place that a vendor or a new engineering team can read at a glance. That is genuine value, and it accelerates everything that comes after.

An engineering artifact proves something different: that the application actually works. Data persists across sessions, identity gets verified against a real provider, integrations make real calls into real systems, errors surface to operators in ways they can act on, sessions time out per policy, audit logs capture what compliance requires. The application would survive contact with a real patient, a real EHR, and a real audit.

The two artifacts look almost identical from the outside. That’s the problem.

Two recent reads on healthcare prototypes show where the spectrum runs.

A longevity-tech and cardio-cognitive prevention clinic brought us a coded prototype of their patient companion application, twenty-six screens with a clickable intake-to-dashboard journey and wearable integration flows that looked complete. The engineering reality underneath: no persistence, no real authentication, the wearable “integrations” were three constants sitting in a config file. The product decisions the prototype encoded were genuinely good, the pillar-based health-metric model, the modular dashboard, the order of operations from intake to first reading, so we carried those forward as specification for the production build and discarded the code. The prototype had done its job, which was to specify the product rather than to build it.

A patient engagement platform for a multi-specialty group sat at a different point on the spectrum. Same surface-level polish, a multi-screen application that demoed cleanly, but the design system and frontend components were structured well enough that a production team could carry them forward as-is. The service layer, EHR integrations, authentication, persistence, audit logging, none of that existed in any form a vendor could build on. The prototype contributed real value above the service boundary and nothing below it.

The pattern across both engagements, and across every other healthcare prototype we’ve read recently, is consistent enough to call it a rule: in a vibe-coded healthcare prototype, the salvage line almost always falls above the service boundary. UI components, design tokens, navigation structures, and workflow logic can carry forward, sometimes substantially. The service layer, integrations, auth, persistence, compliance scaffolding, and observability rarely do. The AI tools that generate prototypes are strongest at the layer closest to the user and weakest at the layer closest to the data.

That gives a useful working answer to what a vibe-coded prototype is genuinely good for. It locks in the workflow, so the team has stopped arguing about screen order and information hierarchy. It functions as an executable design spec that a new engineering team can read faster than they could read a PRD. And it accelerates vendor onboarding, because the vendor’s first week is faster when the product is already specified.

What it does not do is establish that the application works. That distinction is where the next thirty days of the engagement live.


Four patterns where the gap shows up

Across the healthcare prototypes we’ve reviewed in the last several months, the same four patterns appear together, and they appear as a set rather than in isolation. Each one looks innocuous on its own, which is part of why they are easy to miss.

The application looks done, so it feels done. AI coding assistants are excellent at the surface layer, the animations, the polished UI, the realistic fake data, the navigation flows that complete cleanly end to end. The application feels finished to anyone who is not specifically looking at the codebase, and that surface impression is what shapes the budget conversation that follows. The gap between visual fidelity and functional fidelity is invisible from the outside, and that gap is where almost all of the real engineering work lives.

“Sandboxed and functional” gets redefined silently. The product owner believes the application is running against a mock service layer, with seed data and stubbed integrations that a vendor can swap out for production calls. What actually exists, in most cases we’ve reviewed, is hardcoded demo content sitting in a single fixtures file, with no service layer to swap into. The AI satisfied the visual goal completely without ever needing to build the underlying structure, because the visual goal never required it. The product owner did not know to ask. The AI did not volunteer.

The architecture becomes real in the product owner’s mind before it’s real in the code. Vibe-coding tools are particularly good at producing architecture diagrams in chat, a clean four-layer system with a frontend, a service interface, a backend, and EHR adapters, described with real precision. The description is coherent and the architecture document is accurate, but neither one has been built. The AI helped describe a system, then helped build screens that look like that system exists, without ever building the foundation the screens assume.

Effort is measured in screens, not systems. A non-technical product owner naturally tracks progress by what they can see, screen count, flow completeness, whether the UI matches the design intent. The invisible work, authentication, persistence, error handling, compliance scaffolding, integration contracts, observability, never enters that ledger. A prototype with twenty-six screens and no infrastructure is a more common shape than the reverse, because screens are what the team built toward and infrastructure is what the tooling never required them to build.

The four patterns become expensive when they shape the next conversation. A budget anchored to “the prototype is mostly done” under-funds the production program. A vendor who scoped against the visible work either walks away or signs the contract and discovers the gap during sprint two. The industry data is consistent with what we see in the codebases themselves: recent research found 25.1% of AI-generated code contains confirmed vulnerabilities across the major models (AppSec Santa, 2026), and Georgia Tech’s Vibe Security Radar tracked 35 CVEs traced to AI-generated code in March 2026 alone, up from six in January.


Four diagnostics you can run yourself

Before the vendor conversation, four self-diagnostics separate prototypes that are closer to ready from prototypes that are further than the team thinks. None require an engineer. All of them surface the gap between how the prototype looks and how it is actually structured. Skipping them anchors the vendor conversation to wrong assumptions, which is what makes the eventual scope correction expensive.

Each diagnostic is a prompt you can paste directly into Claude Code, Cursor, Codex, or whichever AI coding assistant has access to the prototype’s repository.

1. Pressure-test the confirmations

After any user action in the prototype, booking, submitting, saving, the AI shows a confirmation. The question is whether anything actually happened, or whether the confirmation is a UI state change against in-memory data that disappears on reload. The fastest way to find out is to ask the AI directly:

Pick the most important user action in this app (booking,
submission, save). Trace exactly what happens when that
action fires. Tell me every place the data is read from
or written to, what would happen if the user closed the
app immediately after, what would happen if the network
dropped, and whether any of this is real or hardcoded.
Be specific.

If the answer involves “state” without “database” or “server,” the action is not real. The signal is in the vocabulary the AI uses to describe the action.

2. Ask for the folder structure and what’s in it

A production codebase has a recognizable shape, API clients, data models, authentication modules, error handling, validation. A vibe-coded prototype often has a much simpler shape, sometimes a single fixtures file holding the entire data layer. Asking for the folder structure surfaces the gap in under a minute:

Show me the folder structure of this project. For each
folder and file, briefly say what it does, mark whether
the implementation is real or mocked, and flag any file
with hardcoded data that should come from a service. Then
tell me what a senior engineer would say is missing if
they took this codebase to production.

The closing question is what gets the most useful answer. AI assistants are better at honest self-assessment when explicitly asked for it than at volunteering one.

3. Trace one workflow all the way through

A prototype with six workflows that all terminate in fake confirmations is less useful than a prototype with one workflow that actually persists data. Depth on one feature surfaces the real complexity; breadth across many features creates an illusion of completeness. Before extending the prototype to a seventh screen, the more useful move is to ask what it would take to make one workflow real:

Pick the most important workflow in this app. For that
one workflow only, tell me everything that would need to
change for it to work in production with real users:
persistence, auth, error handling, edge cases, partial
states, input validation. In order of difficulty, hardest
first.

The hardest-first ordering matters. AI assistants left to their own ordering tend to lead with the visible work and bury the structural work. Asking for difficulty order surfaces what would actually take the most time.

4. Ask for a production-readiness read

The fourth diagnostic is the closest to what a senior engineering review would do. Ask the AI to write the audit for a non-technical sponsor, and ask it not to soften the findings.

Audit this codebase as if you were an engineer hired to
take it to production. Tell me what looks production-ready,
what is mocked or hardcoded, what is missing entirely (auth,
persistence, error handling, compliance scaffolding), what
the architecture would actually need to be, and your honest
estimate of how much work remains as a percentage of total
build. Write the audit for a non-technical sponsor or
product owner. Don't soften the findings.

The percentage estimate is usually directionally accurate and almost always lower than the product owner’s intuition. The gap between the two is the size of the conversation that has to happen before the vendor scope gets written.

These four diagnostics are not a substitute for a senior engineering review, but a non-technical product owner can run all four in under an hour. The output is enough to enter the vendor conversation knowing what’s actually built versus what looks built.


The three things the codebase could be

Once the diagnostics have surfaced what’s actually built, the next question is what to do about it. There are three honest answers, and the right one depends less on the prototype itself than on where the prototype sits relative to the production target.

Build forward on the existing code. Rare in healthcare. The codebase would need to have a real service layer, real persistence, real authentication, and an architecture that anticipates the regulatory burden ahead. Most vibe-coded prototypes lack all four. When the code does qualify for this path, it usually means the prototype was built by someone with engineering background using AI as an accelerator rather than as the engineer. Worth checking for. Worth being honest when it’s not the answer.

Salvage above the service boundary, rebuild below. The most common outcome for healthcare prototypes with reasonably structured frontends. The design system, the components, the navigation, and the workflow logic carry forward. The service layer, the integrations, the auth, the persistence, the audit logging, and the compliance scaffolding get rebuilt from the ground up. The multi-specialty group example from Section 1 sat here. So do most prototypes from teams who chose their AI tooling carefully and gave it enough direction.

Use as specification, rebuild end to end. The right answer when the code itself isn’t structured well enough to carry forward, but the product decisions encoded in the screens are good. The longevity-tech and cardio-cognitive prevention clinic from Section 1 sat here. The prototype did its job by specifying the product. The vendor builds the production application from scratch, using the prototype as the design and workflow reference. Faster than starting from nothing. Slower than building forward.

Which of these three paths fits a given codebase is a judgment call. It depends on how the code is structured, what regulatory burden the application will carry, how much custom integration work is ahead, and how much time the team has before they need paying users. Two prototypes that look identical from a demo perspective can sit on different paths once a senior engineer reads the underlying code. How that read happens is the subject of the next section.


A structured codebase review, and what it covers

The diagnostics in Section 3 are the smallest possible self-assessment. A structured review is what a vendor should do before scoping a production build, and it answers questions the self-diagnostics cannot.

The version Sidebench runs combines AI-assisted pattern matching with senior engineering review at two checkpoints, but the eight dimensions below apply regardless of who does the read. The AI handles what AI is genuinely good at: pattern-matching across a codebase, cataloging what exists and what doesn’t, citing specific files and line numbers, and surfacing inconsistencies at a scale no single engineer would catch in a comparable time window. Senior engineering review handles what AI is less reliable at: calibrating the severity of findings, recognizing architectural smells that don’t surface in static analysis, applying domain-specific judgment around healthcare regulations, and signing off on conclusions before they reach a client.

The codebase gets read across eight dimensions:

Every finding is cited to specific files and line ranges. Severity is rated against the production target, not against generic engineering hygiene, which means a finding that would be a minor issue for an internal tool can be a critical one for an application that will handle protected health information.

Senior engineering review enters at two points. The first validates the AI’s findings: spot-checking citations to confirm they hold up, recalibrating severity where the AI was too charitable or too harsh, and adding architectural and domain-specific judgments the AI missed. The second signs off on the final assessment before it reaches the client, with particular attention to anything healthcare-specific: HIPAA technical safeguards, EHR integration quirks, patient identity resolution, audit log requirements, behavioral health carve-outs where they apply.

The output is a structured assessment of what carries forward, what gets rebuilt, and what the team should know before committing to a production build.


A pre-build checklist

Once the codebase has been read and the path forward is clear, seven items need to be settled before the production build begins. None of these are technical decisions in isolation. Each one shapes the timeline, the cost, and the organizational path to launch in ways that are easier to address now than after contracts are signed.

Identify security and compliance gaps. Authentication flows, data handling, audit logging, encryption at rest and in transit, session management. For healthcare, also: HIPAA technical safeguards under 45 CFR 164.312, BAA coverage for every infrastructure vendor, behavioral health carve-outs where they apply. Most vibe-coded prototypes have none of these.

Test integration feasibility against reality, not documentation. API documentation describes what the integration should do. Sandbox behavior shows what it actually does. Rate limits, auth constraints, undocumented edge cases, partial outages, inconsistent error responses are the rule, not the exception. EHR integrations (Epic, Cerner, athenahealth) carry vendor-specific quirks and certification timelines that don’t appear in the developer docs. Confirm the integration works against the real sandbox before scoping the build.

Map the failure states. Every workflow in the prototype completes. Production handles the cases when it doesn’t. What happens when the EHR is down, when the patient’s session expires mid-form, when a network drop interrupts a write. The failure states need to be designed before they need to be handled.

Clarify identity and authentication. Who logs in, how identity is verified, how account recovery works, whether multi-factor is required, how the application distinguishes patient from clinician from administrator. Healthcare adds proxy access (guardians, family members), behavioral health confidentiality rules, and minor consent considerations that vary by state.

Reduce scope to a real MVP. The prototype shows the full product vision. The MVP is a subset of it. The point of the first launch is to learn, not to finish everything. The features that get cut from MVP are usually the ones the prototype demonstrated most beautifully, which is what makes the cut hard.

Align budget, ownership, and approvals. Who pays for what, who signs off on what, who owns the product after launch. Procurement, security review, compliance review, legal review, and partner certifications each carry their own calendar time. Naming the sequence early compresses the timeline.

Map the organizational path to launch. Internal security review, internal compliance review, vendor procurement, App Store and Google Play submissions, partner certifications. Each of these has its own queue. The build can be done and the launch can still be six weeks out if the organizational path wasn’t mapped at the start.

What AI delivers by default What production actually needs Where the gap shows up
Twenty-six screens with realistic-looking data A service layer with API contracts and data persistence Scope discovery, week two
An onboarding flow with a six-digit code Identity verification, session management, MFA, account recovery Security review
A booking screen that shows a confirmation message An appointment record persisted to a database, written back to an EHR via FHIR, with conflict handling EHR integration kickoff
Hardcoded patient names and demo data PHI handling, encryption at rest and in transit, audit logging, role-based access HIPAA compliance review
A multi-screen flow that always completes Failure handling for network drops, API errors, validation failures, partial states QA, first real user test
A clean visual representation of an architecture The architecture itself, built and tested under real load First production incident

What this means in 2026

The conversation about AI-generated code has moved in the last twelve months. Andrej Karpathy coined “vibe coding” in February 2025 to describe what happens when developers stop reading the code their AI assistants produce and just trust the output. One year later, in February 2026, he declared the term essentially obsolete and introduced “agentic engineering” as the professional discipline that comes next. The framing has since been picked up across Sequoia’s AI Ascent 2026, the technology press, and most enterprise software conversations about how AI fits into a production workflow.

The distinction Karpathy drew is direct. Vibe coding raises the floor for everyone in terms of what they can do with software. Agentic engineering preserves the quality bar of what existed before in professional software. Two different goals, two different disciplines, and conflating them is how teams end up with either a toy that cannot scale or a workflow slower than it should be because production code is being treated like a weekend hack.

For healthcare buyers, the distinction matters in a specific way. A vibe-coded prototype is fine as a design artifact, which is the first half of this article. A production healthcare application is not something you vibe-code. It is something that needs the agentic engineering discipline: human direction over what gets built, human judgment over architectural decisions, human review of what the AI produced before any of it touches protected health information. The leverage achievable by a developer who deeply understands system architecture and uses agents at scale is dramatically higher than it was a year ago. The leverage achievable by a novice who treats AI as a substitute for that understanding is dramatically lower.

That’s what changes about the first thirty days of a healthcare engagement in 2026. The buyer arrives with a vibe-coded prototype that proves the product vision. The vendor’s job is not to redo the prototype and not to build on it uncritically, but to bring agentic engineering discipline to the production build: structured architecture decisions, human-in-the-loop validation, senior engineering review of what AI tooling produces, and the kind of governance that lets a HIPAA-regulated application ship without the technical debt that the prototype likely accumulated.

Sidebench has run this practice for regulated healthcare clients for fourteen years across more than 60 healthcare implementations. The Cortica AXON build (Sidebench is an investor), the IEHP member platform, the NOCD telehealth platform (Sidebench is an investor), the CHLA Baby Steps app, every production build shipped under the discipline Karpathy now calls agentic engineering. Karpathy named it. Sidebench has been doing it. The benefit of the name is that it gives buyers and vendors a shared vocabulary for what the work actually looks like, which makes the conversation about scope, methodology, and timeline easier on both sides.


What changes on Monday

The four diagnostics in Section 3 take about an hour to run. They cost nothing. The output is enough to recalibrate scope expectations before the next vendor conversation, which is the most expensive conversation in the program if it gets anchored to the wrong assumptions.

What changes after that is a question for the buyer and the vendor together. A vibe-coded prototype is a real asset when it is treated as the design artifact it is, and a real liability when it is treated as the engineering artifact it is not. The work of separating the two is the work of the first thirty days of a healthcare engagement in 2026.

If you’d like a senior-led read on your prototype before the next vendor conversation, including the eight-dimension review described above and the structured findings that come with it, here’s how to start the conversation.


Frequently Asked Questions

What is vibe coding?

Vibe coding is the practice of using AI coding assistants (Claude Code, Cursor, Lovable, Bolt) to generate working software based on high-level prompts rather than reading and writing the underlying code. Andrej Karpathy coined the term in February 2025 and declared it obsolete one year later, introducing “agentic engineering” as the professional discipline that comes next.

Can a HIPAA-compliant healthcare app be built from a vibe-coded prototype?

The prototype itself almost never satisfies HIPAA’s technical safeguards under 45 CFR 164.312. But the prototype can serve as the design and workflow specification for a HIPAA-compliant production build. The compliance work, encryption, audit logging, access controls, session management, BAA-covered infrastructure, needs to be built in the production phase by engineers who understand the regulatory burden.

What’s the difference between a vibe-coded prototype and a production application?

A prototype proves intent: the workflow, the UI, the product decisions. A production application proves that something works: data persists, identity is verified, integrations make real calls, errors surface to operators, sessions time out per policy. The two look nearly identical from a demo perspective and almost nothing alike from inside the codebase.

How much of a vibe-coded healthcare prototype usually carries forward to production?

The salvage line almost always falls above the service boundary. UI components, design tokens, navigation structures, and workflow logic can carry forward, sometimes substantially. The service layer, integrations, auth, persistence, compliance scaffolding, and observability rarely do. AI tools are strongest at the layer closest to the user and weakest at the layer closest to the data.

What does the first 30 days of a healthcare build look like when the client arrives with an AI-built prototype?

The first thirty days are a structured read of the prototype rather than a build. The vendor reviews the codebase across data, identity, integration, UX, quality, operability, security, and architecture; identifies which components carry forward and which need rebuilding; maps the integration and compliance work ahead; and produces a scope, budget, and timeline anchored to what is actually built rather than what looks built.

What is agentic engineering?

Agentic engineering is the disciplined use of AI coding agents under human direction, judgment, and review. Karpathy introduced the term in February 2026 as the professional successor to vibe coding. It emphasizes architectural decisions made by senior engineers, human-in-the-loop validation, and code review of what AI tools produce. For regulated industries like healthcare, it is what the production build needs to look like.

How do I know if my AI-built prototype is production-ready?

Run four diagnostics in your AI coding assistant: trace the most important user action and see whether it actually persists data; ask for the folder structure and what’s real versus mocked; pick one workflow and ask what it would take to make it production-grade; ask for an honest audit of how much work remains. If the answers reveal no service layer, no real auth, and no persistence, the prototype is a design artifact, not a production system.

Should I rebuild my vibe-coded prototype from scratch, or build on top of it?

There are three honest paths. Build forward if the code has real architecture, persistence, and auth (rare in healthcare). Salvage above the service boundary, rebuild below, if the frontend is structured well but the backend layers are missing or mocked. Use the prototype as specification and rebuild end to end if the code itself isn’t structured well enough to carry forward.

How much does it cost to take a healthcare prototype to production?

The cost depends on the regulatory burden, the scope of EHR or third-party integrations, the salvage path, and the team’s launch timeline. A vibe-coded prototype usually represents a small fraction of total production work, so cost anchored to “the prototype is mostly done” almost always under-funds the program. A structured codebase review before the vendor scope conversation is the most reliable way to anchor the budget.

Why do some vendors recommend rebuilding a vibe-coded prototype instead of building on it?

The recommendation depends on what’s actually in the codebase. When a prototype has no service layer, no real authentication, no persistence, and no compliance scaffolding, rebuilding the production layers is faster than retrofitting them onto a frontend that wasn’t designed to support them. When the codebase has reasonable structure, the right path is to keep what works and rebuild what doesn’t. The honest answer requires reading the code first.


Considering a pre-vendor audit?

4 to 12 hours of senior engineering and product time. One-page summary, detailed findings, build / salvage / translate recommendation, draft vendor scope. Most of the cost of a misaligned vendor engagement is set before the engagement starts, in assumptions baked into the scope. The audit catches those assumptions while they’re still cheap to correct.

Get in touch →


About the author

Josh Koenig leads product strategy and business development at Sidebench, a Los Angeles digital transformation consultancy and product studio with 60+ healthcare implementations over 14 years. Sidebench has shipped HIPAA-compliant platforms for clients including Cortica, NOCD, IEHP, CHLA, AppliedVR, and Hoag, alongside design and product work for Sony, Microsoft, HP, Oakley, Meta, a16z, Red Bull, NBC Universal, Lightspeed, Cedars-Sinai, and the American Heart Association.

sidebench.com

Cited sources

Building the Business Case for Longevity Technology

Building the Business Case for Longevity Technology: A Board-Ready Framework for Health System Executives

Kevin Yamazaki | CEO & Partner

Read more...

Questions to ask healthcare app developer

15 Questions to Ask Before Hiring a Healthcare App Developer

Josh Koenig

Read more...

Evaluate-a-Healthcare-Technology-Partner

How to Evaluate a Healthcare Technology Partner: A Decision Framework

Josh Koenig

Read more...

How We Build for HIPAA:The Controls and Processes Behind Sidebench's Healthcare Applications

How We Build for HIPAA: The Controls and Processes Behind Sidebench’s Healthcare Application Development

Kevin Yamazaki | CEO & Partner

Read more...

Behavioral health organization scaling from single clinic to interconnected multi-site network

The CTO’s Guide to Scaling Behavioral Health Technology

Kevin Yamazaki | CEO & Partner

Read more...

HIPAA compliance layers showing the gap between cloud infrastructure security and application-layer controls

Why HIPAA Compliance Starts at the Application Layer, Not the Cloud

Kevin Yamazaki | CEO & Partner

Read more...

Tackling Complex Scheduling in Healthcare: From Bottleneck to Growth Engine

Kevin Yamazaki | CEO & Partner

Read more...

Double Honors: Celebrating Two Prestigious Awards for our Innovative Digital Solutions

Sidebench Team

Read more...