Author: Ashok Raja Michael Reymond Raj, Data Architect at Hashagile Technologies

Making SQL Test Automation Work: The Hard Truth  

Writing and validating SQL functions is a common task in modern data platforms – but it’s often repetitive, time-consuming, and prone to errors. Creating test cases for complex business logic can be especially tricky, leaving gaps in documentation and coverage. 

We set out to automate this process using LLMs like GPT-5, GPT-4. However, early attempts with simple free-text prompts often produced inconsistent, incomplete, or hallucinated results. These challenges highlighted the need for a more structured approach to prompt engineering.  

 

Evolution from Unstructured to Structured Prompting 
Phase 1: Unstructured Approach (Baseline) 

In the beginning, we relied on simple, conversational prompts such as:  
“Write comprehensive test cases for this SQL function: [function code] …” 
While straightforward, this method had several shortcomings after running more than 10 iterations across different models: 

  • Test Case Coverage & Completeness 
    • GPT-5, GPT-5-mini, GPT-5-nano, and 04-mini achieved an average coverage of 60%. 
    • GPT-4.1-mini and GPT-4o-mini reached only 40%, leaving significant gaps in test completeness. 
  • Code Quality & Validity 
    • Around 50% of outputs contained syntax errors or incomplete SQL logic. 
  • Consistency 
    • Test structures varied widely across runs, with no standardized format for validation. 
  • Error Handling 
    • Negative and edge-case scenarios were rarely included, reducing overall robustness. 
  • Manual Effort 
    • Nearly 80% of outputs required significant human revisions, limiting automation benefits. 
  • Cost & Time Efficiency 
    • Multiple iterations were needed to achieve usable results, increasing both generation time and token cost. 

This baseline highlighted the limitations of unstructured prompting and set the stage for exploring structured frameworks (TREE, TCR, RACE, CRISPE). 

Phase 2: Structured Prompt Engineering 

We explored structured prompt engineering frameworks that provide clear guidance than unstructured prompting: 

  1. TREE (Task → Requirement → Example → Expectation) 
  2. TCR (Task → Context → Reference) 
  3. RACE (Role → Action → Context → Example) 
  4. CRISPE (Capacity → Role → Insights → Statements → Personality → Experiment) 

Each framework was developed to address specific challenges seen in unstructured prompting, helping to create more consistent and effective outputs. Reference 

Success Criteria  

“Success” was defined as the ability to generate SQL test suites that execute correctly without any syntax errors, provide comprehensive coverage of more than 85% of identified test cases, include appropriate edge cases and robust error handling, and achieve all of this with minimal or no manual intervention while remaining cost-efficient. 

Structured vs. Unstructured – Success Rate Heatmap  
Model Selection Criteria 

For SQL Test Generation, we evaluated models against the following dimensions: 

  • Test Case Coverage & Completeness – % of identified cases generated, including edge cases. 
  • Accuracy & Validity – Syntactically correct and executable SQL outputs. 
  • Cost Efficiency – Average cost per test suite, considering token usage. 
  • Generation Time – Time required to produce a complete test suite. 
  • Consistency – Ability to reproduce results reliably across multiple runs. 
  • Manual Effort – Degree of human intervention required for corrections. 
GPT-5 
  • Strongest in Coverage, Code Quality, and Accuracy. 
  • Weaker in Cost and Time Efficiency → higher price and longer runtimes. 
  • Best suited for complex, high-stakes SQL workloads. 

GPT-5-mini 
  • Well-balanced across all metrics: high Coverage and Accuracy, reasonable Cost and Time. 
  • A strong all-rounder for enterprise teams needing balance between quality and efficiency. 

GPT-5-nano 
  • Excels in Cost Efficiency and Time Efficiency, while still maintaining solid Coverage and Accuracy. 
  • The most practical and scalable model for routine SQL automation were budget and speed matter most. 

GPT-4.1 
  • Delivers solid Coverage and Accuracy, but weaker on Cost and Time. 
  • A reliable fallback option if GPT-5 models are unavailable. 

GPT-4.1-mini 
  • Noticeably lower in Coverage, Accuracy, and Code Quality. 
  • Requires trade-offs → not suitable for enterprise readiness. 

GPT-4o-mini 
  • Weak across most dimensions, with only moderate Cost/Time performance. 
  • Not recommended for production-grade SQL test generation. 

o4-mini 
  • Strong in Time and Cost Efficiency, with decent Coverage and Accuracy. 
  • A proven cost-performance leader before GPT-5-nano, still competitive today. 
     
Final Recommendation 

Based on performance of Coverage, Code Quality, Accuracy, Cost Efficiency, and Time Efficiency: – 

1. GPT-5-nano → Best overall model for practicality, speed, and low cost. 
2. o4-mini → Reliable cost-performance option, strong efficiency. 
3. GPT-5-mini → Best balanced model, combining accuracy with manageable cost/time. 

Comprehensive Performance Analysis 
Why This Matters? 

In data engineering and enterprise systems, generating test cases for SQL functions is often overlooked and prone to errors. With the growing adoption of AI-based tooling, we set out to explore three key questions: 

  • Can OPENAI consistently produce accurate SQL test suites?
  • Which model–prompt combination offers the best balance between cost, performance, and autonomy? 

Key Statistics 

Testing conducted: August 2025 using OpenAI API 

Key Performance Insights  
GPT-5  
  • CRISPE delivered the highest results (90% coverage, 90% accuracy), meeting enterprise-grade standards. 
  • RACE achieved 85% coverage but required the longest runtime (12 minutes). 
  • TCR balanced 80% coverage and 85% accuracy with moderate time (5.3 min) and cost. 
  • TREE lagged at 70% coverage, below the success threshold. 

GPT-5-mini  
  • Performed strongly in CRISPE (88% coverage, 87% accuracy) and TCR (85% coverage, 86% accuracy)
  • TREE provided moderate success (80% coverage). 
  • RACE was weaker (72% coverage, 74% accuracy), though still usable. 

GPT-5-nano  
  • Delivered high consistency at very low cost (as low as $0.01). 
  • CRISPE excelled (90% coverage, 90% accuracy) in just 1.2 minutes
  • TCR also strong (88% coverage, 88% accuracy). 
  • TREE moderate (82% coverage), while RACE was weaker (75% coverage, 76% accuracy). 

GPT-4.1  
  • TCR (88%) and CRISPE (87%) achieved near-enterprise level results. 
  • TREE also solid (82% coverage). 
  • RACE reached 84% coverage but required high manual effort, limiting practicality. 

GPT-4.1-mini  
  • Inconsistent: CRISPE (85% coverage) performed well, but 
  • TREE (78%), TCR (70%), and RACE (65%) fell short. 
  • Both TCR and RACE required high manual effort, making them unreliable. 

GPT-4o-mini  
  • Struggled across all frameworks. 
  • TREE (70%), TCR (60%), RACE (55%), CRISPE (60%)
  • Manual effort was high to medium in all cases → not suitable for SQL automation

o4-mini  
  • Consistently strong, especially with CRISPE (90% coverage, 90% accuracy)
  • TCR (85%) also met the enterprise threshold. 
  • RACE (75%) and TREE (70%) were moderate but stable. 
  • Low cost and short runtime (1–3 minutes) made it a cost-performance leader before GPT-5-nano. 
Structured Prompt Engineering Frameworks Evaluated 

CRISPE Framework 

Why It Excelled 
  • Coverage & Accuracy: Highest scores across models (88–90%), with minimal manual effort. 
  • Modularity: Stepwise structure (Capacity, Role, Insights, Statements, Personality, Experiment) ensures consistency. 
  • Robustness: Handles complex functions and edge cases better than other frameworks. 
  • Enterprise-Readiness: Produced the most reliable test suites with GPT-5-nano, GPT-5-mini, and GPT-5. 

CRISPE Framework Implementation 

CAPACITY 
  • You are an advanced AI capable of: 
  • Parsing PostgreSQL function, table DDLs, and statistical inputs 
  • Generating a complete, standalone, reusable .sql test suite 
  • Validating the script using psql -U <username> -f <function_name> 
  • Debugging and regenerating the test script until it executes successfully and all     tests pass 
  • Creating any required schema, tables, and test data if not already available — all within a single BEGIN; … ROLLBACK; block 

ROLE  
  • You are acting as a PostgreSQL Testing Expert. 
  • Your task is to generate a fully self-contained test suite (function_name.sql) for a given PostgreSQL function. 
  • This includes: 
    • Creating all required schemas and tables 
    • Defining the function 
    • Generating multiple test cases 
    • Reporting PASS/FAIL results 
    • Cleaning up data after each test 
    • Wrapping the entire script inside a single transaction 

INSIGHTS  

You will be provided the following input files: 

  • function.txt 
    • Contains the full PostgreSQL function definition that must be inserted directly into the script. 
  • table_schema.txt 
    • Contains DDL statements for all tables the function depends on. These must be recreated inline using CREATE TABLE IF NOT EXISTS. 
  • pg_stats.txt 
    • Contains statistics such as row counts, value ranges, or data types. Use this to inform your test data design — e.g., edge cases, value distribution, or NULL frequencies. 
  • sample_data.txt  
    • Example or real data for inspiration. You may draw from it to shape realistic test cases but are not required to use it directly. 
    • If any required tables or schemas are missing, you must create mock versions inside the same SQL script. 

STATEMENTS  
  • The generated .sql script must follow these rules: 
    • File Naming 
    • Script must be named exactly as: function_name.sql 
    • Structure (Inside BEGIN; … ROLLBACK;) 

SQL
              '''BEGIN; 
            -- 1. CREATE SCHEMA 
            CREATE SCHEMA IF NOT EXISTS schema_name; 
            -- 2. CREATE TABLES 
            CREATE TABLE IF NOT EXISTS ...; 
            -- 3. FUNCTION DEFINITION 
            CREATE OR REPLACE FUNCTION schema_name.function_name(...) 
            RETURNS return_type AS $$ 
            -- Paste from function.txt 
            $$ LANGUAGE plpgsql; 
            -- 4. TEST CASES (Repeatable structure using DO $$ blocks) 
            -- 5. SUMMARY 
            DO $$ 
            BEGIN 
                RAISE NOTICE 'SUMMARY: % test cases executed.', total_tests; 
            END $$; 
            ROLLBACK;''' 

  • Execution Flow 
    • Use the command: 
      • psql -U <username> -f <function_name.sql> 
    • If the script fails (due to errors, assertion failure, or missing objects): 
    • AI must diagnose the issue 
    • Regenerate the SQL script 
    •  Retry execution 
    • Repeat until the script runs cleanly and all tests pass 

  • Each test case must: 
    • Insert minimal required test data 
    • Execute the function 
    • Compare actual vs expected results 
    • Raise PASS/FAIL using RAISE NOTICE or RAISE EXCEPTION 
    • Clean up inserted data after the test 

  • Test Types to Cover 
    • Valid inputs 
    • Invalid inputs 
    • NULLs 
    • Repeat until the script runs cleanly and all tests pass 
    • Edge cases (min/max, empty, boundary) 
    • Error-producing cases (e.g., exceptions) 
    • State-dependent scenarios (if function uses DB state) 

  • Final Summary Block Example 
SQL
        '''                    
DO $$                     
DECLARE                     
      total_tests INTEGER := ...;                     
BEGIN                     
      RAISE NOTICE 'SUMMARY: % test cases completed.',                            
total_tests;                     
END $$; '''   

PERSONALITY  

You are: 

  • Detail-oriented -test cases are precise and cleanly scoped 
  • Thorough -you consider all meaningful input categories 
  • Resilient -you regenerate and retry on failure 
  • Self-contained -you depend on no outside scripts 
  • Readable -all test results are clearly logged and explained 
EXPERIMENT    

SQL
-- TEST CASE 1

DO $$
DECLARE
     test_id INTEGER := 1;
     actual_result INTEGER;
     expected_result INTEGER := 42;
BEGIN
     INSERT INTO schema_name.input_table(col1) VALUES (21);
SELECT schema_name.function_name(21)
INTO actual_result;

IF actual_result = expected_result THEN
    RAISE NOTICE 'Test %: PASS - Simple case', test_id;
ELSE
    RAISE EXCEPTION 'Test %: FAIL - Expected %, Got %',
        test_id,
        expected_result,
        actual_result;
END IF;

DELETE FROM schema_name.input_table;

END $$; 

Generated Output  test_suite_script [o4-mini]

SQL
BEGIN; 
-- 1. SCHEMAS 
CREATE SCHEMA IF NOT EXISTS a; 
CREATE SCHEMA IF NOT EXISTS b; 
CREATE SCHEMA IF NOT EXISTS c; 

-- 2. TABLES 
CREATE TABLE IF NOT EXISTS a.table1( 
id SERIAL PRIMARY KEY, 
Column1 CHARACTER VARYING (255) NOT NULL, 
Column2 DATE, 
Column6CHARACTER VARYING(2048) NOT NULL 
); 

CREATE TABLE IF NOT EXISTS b.table2( 
id SERIAL PRIMARY KEY, 
Column3 INTEGER, 
Column4 CHARACTER VARYING(30) NOT NULL, 
Column6CHARACTER VARYING(2048) NOT NULL 
); 

-- 3. FUNCTION DEFINITION 
DROP FUNCTION IF EXISTS c.sample_function(CHARACTER VARYING); 
CREATE OR REPLACE FUNCTION c.sample_function(p_column1CHARACTER VARYING) 
RETURNS JSON LANGUAGE plpgsql STABLE AS $$ 
DECLARE 
final_data JSON; 
BEGIN 
SELECT jsonb_agg(to_jsonb(t)) INTO final_data 
FROM ( 
SELECT cp.Column5, 
ad.Column4, 
cp.Column2, 
ad.Column6AS app_title, 
cp.Column6AS pat_title 
FROM a.table1cp 
LEFT JOIN b.table2ad ON cp.id = ad.Column3 
WHERE cp.column1= p_Column5 
) t; 
RETURN final_data; 
END; 
$$; 

-- 4. TEST CASES 
-- TEST CASE 1: Simple match 

DO $$ 
DECLARE 
test_id INTEGER := 1; 
actual_result JSONB; 
expected_result JSONB :=  
'[{"Column5":"P1","Column4":"A1","Column2":"2025-01-01","title_a":"AT1","title_z":"PT1"}]'::jsonb; 
BEGIN 
INSERT INTO a.table1(Column5, Column2, title) 
VALUES ('P1','2025-01-01','PT1'); 
INSERT INTO b.table2(Column3, Column4, title) 
VALUES ((SELECT id FROM a.table1WHERE Column5='P1'),'A1','AT1'); 

SELECT c.sample_function('P1')::jsonb INTO actual_result; 

IF actual_result = expected_result THEN 
RAISE NOTICE 'Test %: PASS - Simple match', test_id; 
ELSE 
RAISE EXCEPTION 'Test %: FAIL - Expected %, Got %',  
test_id, expected_result, actual_result; 
END IF; 

DELETE FROM b.table2; 
DELETE FROM a.table1; 
END $$; 

-- TEST CASE 2: No related table2(NULL fields) 
DO $$ 
DECLARE 
test_id INTEGER := 2; 
actual_result JSONB; 
expected_result JSONB :=  
'[{"Column5":"P2","Column4":null,"Column2":"2025-01-02"," title_a"":null,"title_z":"PT2"}]'::jsonb; 
BEGIN 
INSERT INTO a.table1(Column5, Column2, title) 
VALUES ('P2','2025-01-02','PT2'); 

SELECT c.sample_function('P2')::jsonb INTO actual_result; 

IF actual_result = expected_result THEN 
RAISE NOTICE 'Test %: PASS - NULL join fields', test_id; 
ELSE 
RAISE EXCEPTION 'Test %: FAIL - Expected %, Got %',  
test_id, expected_result, actual_result; 
END IF; 
DELETE FROM a.table1; 
END $$; 

-- TEST CASE 3: Non-existent column1→ empty result 
DO $$ 
DECLARE 
test_id INTEGER := 3; 
actual_result JSONB; 
BEGIN 
SELECT c.sample_function('UNKNOWN')::jsonb INTO actual_result; 

IF actual_result IS NULL THEN 

RAISE NOTICE 'Test %: PASS - No rows returns NULL', test_id; 
ELSE 
RAISE EXCEPTION 'Test %: FAIL - Expected NULL, Got %', 
test_id, actual_result; 
END IF; 
END $$; 

-- TEST CASE 4: NULL input → NULL output 
DO $$ 
DECLARE 
test_id INTEGER := 4; 
actual_result JSONB; 
BEGIN 
SELECT c.sample_function(NULL)::jsonb INTO actual_result; 

IF actual_result IS NULL THEN 
RAISE NOTICE 'Test %: PASS - NULL input returns NULL', test_id; 
ELSE 
RAISE EXCEPTION 'Test %: FAIL - Expected NULL, Got %', 
test_id, actual_result; 
END IF; 
END $$; 

-- 5. SUMMARY 
DO $$ 
DECLARE 
total_tests INTEGER := 4; 
BEGIN 
RAISE NOTICE 'SUMMARY: % test cases executed.', total_tests; 
END $$; 
ROLLBACK;  

 

Output :-
BEGIN 
CREATE SCHEMA 
CREATE SCHEMA 
psql:c.sample_function.sql:6: NOTICE: schema "c" already exists, skipping 
CREATE SCHEMA 
CREATE TABLE 
CREATE TABLE 
psql:c.sample_function.sql:24: NOTICE: function c.sample_function(pg_catalog.varchar) does not exist, skipping 
DROP FUNCTION 
CREATE FUNCTION 
psql:c.sample_function.sql:71: NOTICE: Test 1: PASS - Simple match 
DO 
psql:c.sample_function.sql:94: NOTICE: Test 2: PASS - NULL join fields 
DO 
psql:c.sample_function.sql:110: NOTICE: Test 3: PASS - No rows returns NULL 
DO 
psql:c.sample_function.sql:126: NOTICE: Test 4: PASS - NULL input returns NULL 
DO 
psql:c.sample_function.sql:134: NOTICE: SUMMARY: 4 test cases executed. 
DO 
ROLLBACK 

 

Architecture Illustration 
Conclusion  
Key Findings Summary 

Our evaluation of structured prompt engineering for SQL test generation highlights both the evolution of methodology and the importance of model capability

  1. Structured Prompting is Essential 
    • Moving from unstructured prompting (~40–60% coverage, high manual effort) to structured frameworks improved success rates to the 85–90% range
    • Frameworks like TREE, TCR, RACE, and CRISPE all reduced errors and manual effort, with CRISPE consistently leading. 

  1. CRISPE Framework Superiority 
    • CRISPE achieved the highest accuracy and coverage across models, typically 88–90%
    • TREE and TCR provided reliable results in the 70–85% range, while RACE performed well only with stronger models. 

  1. Model Capability Thresholds 
    • GPT-4o-mini and GPT-4.1-mini consistently underperformed, requiring high manual effort → not suitable for enterprise automation. 
    • o4-mini delivered strong cost-performance balance with 85–90% coverage at short runtimes. 
    • GPT-5-nano raised the bar further, combining near-90% accuracy with the lowest cost and fastest generation times (1–3 minutes)
    • GPT-5-mini and GPT-5 delivered high accuracy but at higher computational cost and longer runtimes. 

  1. Cost and Efficiency Redefined 
    • GPT-5-nano was the clear cost-performance leader, cutting costs by over 60% compared to GPT-4.1, while maintaining enterprise-level coverage and accuracy. 
    • o4-mini remained a reliable, efficient alternative when GPT-5 family models are unavailable. 
    • GPT-5 and GPT-5-mini provided strong reasoning and complex SQL handling but required trade-offs in time and budget. 

  1. Production Readiness Achieved 
    • Structured prompt engineering has matured from an experiment to a deployable, repeatable solution
    • With the right model–framework pairing, automated SQL test case generation is now practical for real-time use. 

Appendix A: GPT Model Specifications and Capabilities 

Disclaimer: 

The results and observations presented here are derived from our internal datasets and controlled experimental setups. They should be interpreted as indicative rather than definitive. Actual performance may vary significantly depending on a variety of factors, including but not limited to: 

  • Use case differences
    • Results can differ based on the nature of the application or problem being addressed. 
  • Dataset characteristics
    • Variations in data size, quality, and domain relevance may influence model behavior. 
  • Prompt design and context
    • Different prompt formulations, input complexity, or instructions can lead to different outcomes. 

These findings reflect our perspective at the time of testing and are intended to provide directional insights rather than absolute benchmarks. As AI models, training techniques, and evaluation methodologies continue to evolve, future results may differ from what is reported here. 

Leave A Comment