← Back to blog

Vibe Coding Ships Fast — Here's What It Ships With

40% of AI-generated code contains exploitable vulnerabilities according to peer-reviewed research. This post examines the specific failure modes, why they survive review, and what a minimal check layer looks like.

AI Code GenerationSecurityCode Review

Key Takeaways

  • NYU researchers found ~40% of Copilot output in security-relevant scenarios contained vulnerabilities (CWE-79, CWE-89, CWE-798, CWE-22).
  • A 2024 Springer study across multiple LLMs found 62% of generated programs were vulnerable.
  • The failures are predictable: auth shortcuts, missing input validation, hardcoded secrets, path traversal. Catching them doesn't require novel tooling — just consistent application.

The research is in, and it's not ambiguous

In 2023, researchers at NYU evaluated GitHub Copilot across 89 security-relevant code generation scenarios. Approximately 40% of the generated programs contained exploitable vulnerabilities. The most common categories were cross-site scripting (CWE-79), SQL injection (CWE-89), hardcoded credentials (CWE-798), and path traversal (CWE-22).

A larger 2024 study published in Empirical Software Engineering (Springer) tested multiple LLMs and found that at least 62% of generated programs were vulnerable. A separate formal verification study using Z3 SMT solvers reported a 55.8% vulnerability rate by default across 3,500 code artifacts.

These aren't edge cases found by adversarial prompting. They're the output of normal development prompts — 'write a login handler,' 'create a file upload endpoint,' 'add JWT authentication.' The models produce code that works functionally but fails silently on security properties.

Why these bugs survive human review

The dangerous thing about AI-generated vulnerabilities is that they look like ordinary code. A JWT helper that calls decode without signature verification. A SQL query built with string concatenation inside an ORM wrapper. A file path constructed from user input without canonicalization.

Each of these passes the 'does it work?' test. The demo runs. The tests are green. The reviewer sees a small diff and pattern-matches it as routine. This is the core problem: AI coding tools shift the bottleneck from writing to judgment, but review processes are still calibrated for human-speed output.

When a developer produces 3x more code per day, the reviewer doesn't get 3x more attention to spend. The ratio inverts. More code, same review capacity, lower coverage per line.

The specific patterns that keep recurring

After reviewing the research and public CVE databases, the same vulnerability classes appear repeatedly in AI-generated code:

  • JWT signature bypass: CVE-2024-54150 (cjwt algorithm confusion), CVE-2023-51774 (Ruby JSON::JWT sign/encryption confusion), CVE-2024-37568 (Authlib HMAC/RSA confusion). AI models frequently generate jwt.decode() without algorithm restriction or signature verification.
  • Missing input validation: path traversal via unsanitized user input in file operations, SQL injection through string formatting even when ORMs are available, XSS through unescaped template interpolation.
  • Hardcoded secrets: API keys, database credentials, and signing keys placed directly in source files rather than environment variables or secret managers.
  • Overly permissive defaults: CORS set to '*', debug modes left enabled, admin routes without authentication middleware.

What a minimal check layer actually looks like

You don't need a security team or a month-long initiative. You need automated checks that run on every pull request and flag the specific patterns above. The bar is low because the bugs are predictable.

A useful automated reviewer should: flag JWT decode calls without algorithm restriction, reject hardcoded credential patterns, warn on path construction from user input without sanitization, and catch CORS/auth middleware gaps on sensitive routes.

The key insight from the research is that these aren't subtle logic bugs requiring deep context. They're mechanical patterns. A static check catches them. The problem isn't detection difficulty — it's that most teams don't run the check at all.

The uncomfortable math

If 40-62% of AI-generated code in security-relevant scenarios is vulnerable, and your team is generating 50+ pull requests per week with AI assistance, the expected number of unreviewed vulnerabilities entering your codebase per month is not small. It's not a rounding error.

The teams that will avoid incidents aren't the ones that reject AI coding tools. Those tools are too productive to abandon. The teams that avoid incidents are the ones that put a mechanical check between generation and merge. Not a heavy process. Just a consistent one.