9162
views
✓ Answered

How to Build an AI Skill for Diagnosing Flaky Tests

Asked 2026-05-04 19:07:48 Category: Finance & Crypto

Introduction

If you've spent any time in software development, you've likely encountered flaky tests—those unpredictable failures that drive teams crazy. They undermine trust in your test suite and waste countless hours. But what if you could teach an AI agent to systematically hunt down the root cause? With AI Agent Skills—reusable instruction sets for AI—you can. This guide walks you through creating a skill that empowers your AI to diagnose flaky tests with deterministic precision. We'll use a real-world example: a TOCTOU (time-of-check to time-of-use) bug causing duplicate invoice numbers in a Spring Boot webshop. By the end, you'll have a working skill that turns your AI into a flaky test detective.

How to Build an AI Skill for Diagnosing Flaky Tests
Source: blog.jetbrains.com

What You Need

  • An AI Agent Platform that supports Skills (e.g., a custom LLM integration or a tool like LangChain).
  • Access to the source code of the project you want to debug (the example uses a Spring Boot webshop).
  • A flaky test case—ideally one that fails sporadically. For our example, it's InvoiceServiceTest.firstTwoOrdersGetInvoiceNumbersOneAndTwo.
  • Familiarity with Java, Spring Boot, and concurrent programming concepts (or equivalent in your language).
  • Developers tools: debugger, log analysis tools, and a CI/CD pipeline to reproduce flaky behavior.

Step-by-Step Guide

Step 1: Understand the Nature of Flaky Tests

Before writing any skill, you must grasp what makes a test flaky. Flaky tests often stem from non-deterministic behaviors like race conditions, network timeouts, or resource contention. In our example, the test firstTwoOrdersGetInvoiceNumbersOneAndTwo creates two concurrent orders (CompletableFuture) and expects unique invoice numbers. The bug is a TOCTOU issue: the invoice service checks the last number and then increments it, but another thread intervenes, causing duplicates. The test passes or fails randomly because of thread scheduling.

Your AI skill needs to recognize such patterns. So, begin by documenting the common causes of flakiness in your environment (e.g., timing dependencies, shared mutable state). This knowledge becomes part of the Skill's context.

Step 2: Define the Skill's Purpose and Scope

Decide exactly what your AI Skill will do. For our case: "Given a flaky test report and source code, identify the root cause by analyzing race conditions, shared state, and concurrency patterns." Keep the scope narrow to avoid overwhelming the AI. Write this as a clear one-sentence objective in the Skill document.

Step 3: Structure the Skill Document

An AI Skill is a plain text file with a consistent format. Use these sections:

  1. Title and Description
  2. Input Requirements (e.g., test name, code file paths, logs)
  3. Analysis Steps (the core procedure)
  4. Output Format (e.g., a JSON report with root cause, confidence, reproduction steps)

For our example, the analysis steps should include: check for concurrent execution (like CompletableFuture), inspect shared resources (e.g., invoice number generation), verify atomicity of read-modify-write operations, and suggest fixes.

Step 4: Write the Core Diagnosis Logic

This is the heart of the Skill. In bullet points, describe what the AI must look for:

  • Identify concurrent operations: Look for multi-threading constructs (e.g., @Async, CompletableFuture, threads).
  • Spot shared mutable state: Find variables or objects accessed from multiple threads without synchronization.
  • Check atomicity: Does the code check a condition (e.g., getLastNumber()) and then act (e.g., setLastNumber()) in a way that can be interrupted? That's a TOCTOU bug.
  • Reproduce the flakiness: Suggest increasing thread count or adding intentional delays to force failure.

Provide concrete examples from your project. For the invoice service, point to the InvoiceService class where synchronized blocks are missing.

How to Build an AI Skill for Diagnosing Flaky Tests
Source: blog.jetbrains.com

Step 5: Integrate Developer Tools

An AI alone isn't enough. Your Skill should instruct the AI to leverage tools like:

  • Static analyzers (e.g., FindBugs, SpotBugs) to detect race conditions.
  • Log analysis to correlate failures with timestamps.
  • Debugger breakpoints to pause threads at critical sections.

In the Skill, include commands or API calls the AI can execute to run these tools. For instance: "Run mvn spotbugs:check and examine the output for NO_NOTIFY or WRONG_USE_OF_SYNCHRONIZED."

Step 6: Test the Skill on the Example Project

Load the webshop demo from the article (see Example Project). Feed the flaky test report to your AI with the Skill activated. The AI should:

  • Recognize the two concurrent CompletableFuture calls.
  • Trace the checkout method to InvoiceService.getNextInvoiceNumber().
  • Identify the non-atomic read-modify-write.
  • Recommend using synchronized or AtomicInteger.

Iterate until the AI consistently produces accurate diagnoses.

Step 7: Refine and Expand the Skill

After initial success, add more root causes (e.g., network flakiness, database contention). Update the Skill document with new patterns. Also include remediation steps for each cause, so the AI can suggest fixes. For our example, the fix is to make getNextInvoiceNumber atomic via synchronization or AtomicLong.

Tips for Success

  • Start Small: Focus on one type of flakiness (like concurrency) before generalizing.
  • Use Clear Language: Avoid ambiguous terms; define technical jargon in the Skill.
  • Include Negative Examples: Show cases where a test is not flaky to sharpen the AI's detection.
  • Version Control Your Skill: Treat the Skill document like code—track changes and review updates.
  • Combine with Human Review: Let developers validate the AI's findings before acting on them.
  • Monitor Performance: Keep a log of accuracy and false positives, and adjust the Skill accordingly.

By following these steps, you'll transform your AI agent into a reliable debugger for flaky tests, saving your team time and frustration. Ready to give it a try? Start with the first step and build your Skill today.