<Marko />
← Back to case studies

AI Development Flow Integration

AIAgentic AIMulti-agent SystemsLLMCI/CDn8nAutomationGPT-4

AI isn't just a feature you ship to users — it's a force multiplier for engineering teams. This case study explores how we integrated AI agents directly into development workflows, from pre-human code review gates to hybrid QA pipelines, with a focus on the context engineering that makes AI outputs actually useful rather than noisy.

AI as a Pre-Human Review Gate

The most impactful use of LLMs in a development workflow isn't replacing human code reviewers — it's filtering the noise before human reviewers even see a pull request. Senior engineers are expensive, and most of their review time is spent on mechanical checks: style violations, missing null checks, inconsistent naming, obvious security anti-patterns. These are exactly the tasks where LLMs excel.

We positioned a GPT-4-class model to run automatically on every pull request before it enters the human review queue. The system is triggered by a CI pipeline step that extracts the diff, gathers relevant context, and sends structured prompts to the model via API. The results are posted as inline PR comments — exactly where a human reviewer would leave feedback.

Rather than a single monolithic review prompt, we built a multi-agent architecture where each agent specializes in one concern. The security agent checks for SQL injection patterns, hardcoded credentials, and insecure deserialization. The performance agent flags N+1 query patterns, missing indexes, and unbounded loops. The architecture agent checks for proper separation of concerns and adherence to project-specific patterns. The test coverage agent identifies untested code paths and suggests test cases.

The key insight is that each agent gets a different context window optimized for its concern. The security agent receives the full diff plus any database migration files. The architecture agent receives the diff plus the project's architecture decision records. This specialization produces dramatically better results than a single 'review everything' prompt, and it scales — adding a new review dimension is just adding a new agent with its own prompt and context template.

Hybrid QA Pipelines

Early experiments with fully autonomous QA agents — 'here's the code change, figure out what to test' — produced wildly inconsistent results. Sometimes the agent would generate brilliant edge-case tests we hadn't considered. Other times it would produce syntactically valid but semantically meaningless tests that tested implementation details rather than behavior. The variability made it impossible to rely on as a consistent quality gate.

The solution was a hybrid architecture that combines deterministic and AI-driven components. The deterministic layer is a traditional CI pipeline: run the existing test suite, check code coverage thresholds, validate linting rules, and execute integration tests against a staging environment. This layer produces reliable, reproducible results on every run.

The AI layer activates after the deterministic layer completes. An LLM agent analyzes test failures (if any), correlating the failure output with the code changes to generate regression hypotheses — specific explanations of how the code change likely caused the failure. For new code that lacks test coverage, the agent drafts test cases covering happy paths, error cases, and boundary conditions.

Critically, AI-generated tests go into a 'proposed tests' queue for human approval, not directly into the test suite. A developer reviews each proposed test, approves or modifies it, and only then does it become part of the deterministic layer. This keeps humans in the loop on test strategy — deciding what's worth testing and how — while automating the grunt work of actually writing the test boilerplate.

Test execution priority is determined by a risk score calculated from the code change analysis: files with high cyclomatic complexity, recent bug history, or proximity to security-critical code paths get their tests run first. This means the most important feedback arrives fastest, even in large test suites that take 20+ minutes to run completely.

Context Engineering Over Model Upgrades

After months of iteration, one principle became clear: context injection quality determines output quality far more than model choice. A well-contextualized prompt sent to GPT-4 produces better code reviews than a poorly contextualized prompt sent to a hypothetical GPT-5. The model is the engine, but the context is the fuel.

Our context pipeline engineering focuses on three dimensions: relevance (only include information the model actually needs), recency (recent commits and decisions matter more than old ones), and specificity (project-specific patterns and conventions, not generic best practices).

For code review, the context window includes: the PR diff (obviously), but also the linked Jira/Linear ticket description (so the model understands intent), the 5 most recent commits to the same files (so it understands the evolution), and a project-specific coding standards document maintained by the team. This standards document is the highest-leverage investment — it teaches the model to flag patterns specific to your codebase that no generic model would catch.

The context pipeline is orchestrated via n8n workflows that pull data from multiple sources (GitHub API, Jira API, internal documentation), assemble structured prompts, and manage the API calls to the LLM. n8n's visual workflow builder made it accessible to the entire team — anyone can modify the review pipeline without writing code, which means the AI review system evolves with the team's practices rather than being frozen at whatever the original developer configured.

We measured the impact of context quality experimentally: the same set of 50 PRs was reviewed with minimal context (raw diff only), moderate context (diff + ticket description), and full context (diff + ticket + recent commits + standards doc). Full context produced 3.2x more actionable findings and 70% fewer false positives compared to minimal context. This data justified the engineering investment in the context pipeline.

Autonomous Triage & Routing

Support tickets and bug reports represent another high-volume, repetitive workflow where AI provides leverage. Incoming reports are often poorly structured — a customer describes symptoms without reproduction steps, or a monitoring alert fires with a stack trace but no business context. A human triaging these spends most of their time gathering context rather than diagnosing the issue.

We built an LLM pipeline that processes incoming tickets through three stages. First, classification: the model categorizes the issue by type (bug, feature request, question, infrastructure), severity (based on business impact signals like 'checkout is broken' vs 'logo looks wrong'), and affected system component (catalog, checkout, payments, infrastructure).

Second, enrichment: the model extracts or infers reproduction steps from the report text, queries the error logging system for matching stack traces within the relevant time window, and links related open issues from the backlog. If the report mentions a specific order number or customer, the model pulls relevant transaction logs. This enriched context is attached to the ticket before any human sees it.

Third, routing: based on classification and enrichment, the ticket is automatically assigned to the appropriate team and prioritized within their queue. P1 issues (checkout broken, payments failing) trigger immediate Slack notifications and PagerDuty escalations. P2 issues get queued for the next business day. P3 issues go into the backlog with suggested investigation notes.

The impact was significant: first-response time dropped by 70% because the enrichment step eliminated the back-and-forth of 'can you provide reproduction steps?' and 'which browser were you using?'. Engineers receiving tickets now get a pre-investigated issue with context, logs, and related issues already gathered — they can start diagnosing immediately instead of spending 30 minutes gathering the information the AI already assembled.

Results

  • 70% reduction in first-response time for support tickets
  • 40% fewer production bugs reaching users
  • 2x faster PR review cycles with higher-quality feedback
  • 3.2x more actionable findings with full context engineering
  • 70% fewer false positives compared to raw-diff-only review
  • Automated test coverage increased by 35%
  • Senior engineer review time refocused on architecture decisions

Want to discuss a similar challenge? Get in touch →