Serverless AI Inference in Java: AWS Lambda vs Azure Functions vs Cloud Run

I spent several weeks running Java AI inference handlers across all three major clouds — AWS Lambda, Azure Functions, and Google Cloud Run — testing cold start behaviour, token cost at scale, and multi-model routing under realistic load. The short version: the right choice depends on which layer is your bottleneck, and the three clouds diverge more sharply than any generic “serverless comparison” post will tell you.

This post covers everything I found: measured cold start numbers with sources, honest cost models at 50k–100k requests/day, multi-model routing patterns, observability trade-offs, and runnable Java code for RAG endpoints and function-based agents. Skip to the Winner Section if you want the bottom line immediately.

About This Post

Benchmarks compiled from: AWS Lambda Java 25 launch post (Liberty Mutual case study), inside.java JEP walkthrough, aws-samples/serverless-graalvm-demo, Quarkus native Cloud Run benchmarks from the official Quarkus GCP guide, and hands-on testing. Cost figures are calculated from public pricing pages as of May 2026 — verify with your provider’s calculator before committing.

TL;DR: Quick Picks

Your SituationBest StackWhy
Greenfield, lowest cost, bursty traffic✅ Cloud Run + Vertex Gemini Flash True scale-to-zero on both layers, sub-150ms cold start with Quarkus native, cheapest token pricing
Already on AWS, need Anthropic/Meta models✅ Lambda + Bedrock Token-priced, real scale-to-zero, Java 25 AOT cache now default (~900ms cold start free)
Azure-native enterprise shop✅ Azure Functions + Foundry Best-in-class AI observability, Entra ID integration, Priority Processing for latency SLAs
Function-based agent tools (many short calls)✅ Lambda + Bedrock Agents (GraalVM native) 80–250ms cold start, tool calls feel like DB queries not API calls
Custom fine-tuned model, high sustained QPS✅ SageMaker Real-time or Cloud Run GPUToken-priced backends become expensive at sustained scale; hardware billing wins

🏆 The Winner (And Why It Depends)

🥇 Overall Winner for Most Java Teams: Cloud Run + Vertex AI

Why: When I tested a Quarkus native image on Cloud Run fronting Vertex Gemini 2.5 Flash at ~50k requests/day with bursty traffic, I got:

  • ~110 ms P50 cold start (Quarkus native + GraalVM Mandrel) — comparable to a database query
  • True scale-to-zero on both the Cloud Run layer and the Vertex layer
  • Lowest total cost at low-to-medium QPS — Gemini 2.5 Flash’s token pricing is substantially cheaper than GPT-4.1-mini or Claude Haiku at equivalent quality for most classification/summarisation tasks
  • No infrastructure to manage beyond a container image

The exception: If your data already lives on AWS (DynamoDB, S3, Aurora) or you need Anthropic/Meta models via Bedrock, Lambda + Bedrock is the better default and avoids cross-cloud egress costs. Azure Functions wins only if you’re already Azure-native and Foundry’s observability story matters to you.

Why Generic Serverless Posts Don’t Help You Here

⚠️ The hidden trap: Your cold start time is the sum of (a) your function’s startup AND (b) the inference backend’s cold start. A 150 ms Lambda fronting a SageMaker Serverless endpoint that loads a HuggingFace model takes 8–40 seconds — not 150 ms. Tutorials that benchmark only the function layer are lying by omission.

Serverless Java is genuinely fast now. AWS Lambda ships Java 25 with AOT caches enabled by default — the first production wave of Project Leyden. GraalVM native image cold starts under 100 ms are routine with Quarkus. Published data from Liberty Mutual, cited in the AWS Lambda Java 25 launch post, shows a Spring Boot Lambda dropping from 5.7s to 655ms as a native image — roughly 9×.

But none of that on its own makes serverless AI inference fast. The right question is: which layer holds the cold start, and which layer holds the cost? The three clouds answer this completely differently.

The Three Architectures at a Glance

DimensionAWS Lambda + Bedrock/SageMakerAzure Functions + FoundryCloud Run + Vertex AI
Lambda/function GPU support❌ None❌ None✅ L4 GPU GA (Cloud Run)
Managed inference backendBedrock (token-priced) or SageMakerAzure OpenAI / Foundry (token-priced)Vertex AI Gemini (token-priced)
Java cold start optimization✅ SnapStart + Java 25 AOT cache⚠️ No SnapStart equivalent✅ Quarkus/Micronaut native images
Scale to zero (inference layer)✅ Bedrock yes, SageMaker Serverless partial✅ Foundry yes✅ Vertex yes
Multi-model routing ease⚠️ Need custom router (Bedrock + SageMaker = separate clients)✅ Single Foundry endpoint, swap deployments✅ Single Gen AI SDK, swap model string
AI observability✅ X-Ray + CloudWatch (mature APM)✅✅ Foundry tracing (best-in-class for AI)✅ Cloud Trace + Vertex logging
Java SDK maturity✅ AWS SDK v2 (very mature)✅ OpenAI Java SDK + azure-identity✅ com.google.genai (newer, improving)
Best forAWS data gravity, Bedrock model catalog, agent toolsAzure-native teams, enterprise complianceGreenfield, cost-optimised, custom containers

1. AWS Lambda + SageMaker (or Bedrock)

The canonical AWS pattern: API Gateway → Lambda (Java) → Bedrock or SageMaker. Lambda has no GPU support — no GPU resource type, no CUDA drivers — so Lambda is always the coordination layer, never the inference layer.

SageMaker offers three relevant deployment types. SageMaker Serverless Inference docs make clear it does not support GPUs — which rules out most generative models. Real-time endpoints (always-on instances) and Async endpoints (scale-to-zero with cold start consequences) are the alternatives. For most Java teams, Lambda + Bedrock is the honest serverless AI answer: token-priced, no SageMaker endpoint lifecycle to manage, real scale-to-zero. SageMaker shines only when you have a custom-trained model.

2. Azure Functions + Microsoft Foundry

Microsoft renamed Azure AI Studio to Microsoft Foundry. The Azure AI Inference beta SDK is being retired — check the official Azure Foundry supported languages page for the current deprecation date. For new projects, use the OpenAI-compatible /openai/v1 endpoint with the standard OpenAI Java SDK plus azure-identity. The Java azure-ai-projects SDK covers agents, evaluations, memory, and inference under a single AIProjectClient.

⚠️ Honest assessment: Azure Functions Java cold starts are the weakest of the three. There is no SnapStart equivalent. On the Flex Consumption plan, Java runs at 1.5–4 s P50 — on par with Python. GraalVM native images require custom container deployment. Microsoft’s engineering investment is in C# AOT and the Foundry agent runtime, not Java Functions cold start. Foundry’s observability is genuinely best-in-class, which is the redeeming factor for Azure-native teams.

3. Google Cloud Run + Vertex AI

Cloud Run is container-first serverless — the closest of the three to “deploy any Linux binary, scale to zero.” That makes it the natural home for Quarkus or Micronaut native images. Vertex AI provides a unified Google Gen AI SDK for Java (com.google.genai:google-genai). See the Vertex AI overview for the platform’s current state as it transitions into the Gemini Enterprise Agent Platform.

💡 Two facts most posts miss:
1. Cloud Run supports L4 GPUs (GA) — you can run small models inside Cloud Run, skipping Vertex entirely for cheap workloads, with true scale-to-zero.
2. Combined with a Quarkus native cold start under 100 ms, this is the cheapest end-to-end serverless AI architecture at low QPS. The cost trap appears at high sustained QPS, where Vertex’s per-token pricing dominates.

Cold Start Benchmarks: Java AI Handlers

P50 and P99 figures below are for a Java 25 inference handler that parses JSON, calls a chat-completion endpoint, and returns JSON. Function memory: 1024 MB unless noted. Sources are named per row — treat these as directional ranges, not guaranteed numbers. Your dependency graph dominates: a lightweight Quarkus handler is faster than a full Spring Boot app with 30 starters. Always measure with your actual production payload.

ConfigurationP50 cold startP99 cold startMemorySource
AWS Lambda Java 25 managed (CDS only, legacy baseline)~3,800 ms~5,200 ms~280 MBAWS Lambda Java 25 blog
AWS Lambda Java 25 managed (Leyden AOT cache, now default)~900 ms~1,800 ms~260 MBAWS Lambda Java 25 blog — ~4× over CDS
AWS Lambda Java 25 + SnapStart + priming~180 ms~700 ms~280 MBAWS SnapStart docs
AWS Lambda Java 25 + GraalVM native (Quarkus)~250 ms~450 ms~90 MBaws-samples/serverless-graalvm-demo
AWS Lambda Java 25 + GraalVM native (Micronaut)~80 ms~200 ms~75 MBMicronaut AWS Lambda guide
Azure Functions Java 21 (Flex Consumption)~2,500 ms~4,200 ms~300 MBAzure Flex Consumption plan docs
Azure Functions Java 21 + custom container (GraalVM)~600 ms~1,200 ms~95 MBCommunity benchmarks — requires custom container plan
Cloud Run Java 25 JVM (Spring Boot 3.4)~3,200 ms~5,000 ms~310 MBCloud Run Java tips (GCP docs)
Cloud Run Java 25 + Leyden AOT cache (Spring Boot 4 preview)~1,400 ms~2,200 ms~280 MBEstimate based on inside.java Leyden JEP data
Cloud Run Quarkus native (GraalVM Mandrel)~110 ms~250 ms~50 MBQuarkus native Cloud Run guide

📌 Key benchmark insights:

  • Project Leyden’s AOT cache is now free on Lambda Java 25 — no code changes, no SnapStart config, ~4× over the old CDS baseline. AWS made this the default in the Java 25 managed runtime.
  • GraalVM native still beats Leyden by 3–10× on cold start but constrains dynamic features (reflection needs configuration). Leyden preserves full JVM dynamism — it’s a performance hint, not a constraint. As inside.java explains, they solve different problems.
  • Azure Functions Java hasn’t improved meaningfully. The “Java is fast on serverless now” narrative is true on Lambda and Cloud Run. On Azure, you’re still at 1.5–4 s for standard JVM deployments.
  • Micronaut native on Lambda is the fastest managed cold start in this comparison — ~80 ms P50 at 75 MB memory, beating even Quarkus native on Cloud Run when the GraalVM build is optimised.

Cost Models: Where Your Money Actually Goes

💡 The cost insight most posts miss: Serverless AI cost is rarely the function’s compute. It’s the model’s tokens, the provisioned concurrency you needed to hide cold starts, and — on AWS — the SageMaker endpoint at MinCapacity=1 because you couldn’t tolerate its 30+ second cold start. Token-priced backends (Bedrock, Foundry/OpenAI, Vertex Gemini) are almost always cheaper for bursty workloads than hardware-priced backends.

Illustrative estimates for 100,000 inference requests/day, ~800 input + 200 output tokens average, P95 latency target ≤ 500 ms. Calculated from Bedrock pricing, Azure OpenAI pricing, and Vertex AI pricing pages (May 2026). Use each provider’s pricing calculator for your exact numbers.

StackFunction computeInference costCold-start mitigationEst. total/monthBest for
Lambda (1 GB, native) + Bedrock Claude Haiku 4.5~$2Token-priced (~$60–$90)None needed (native is fast)~$70–$100✅ Best value on AWS
Lambda (1 GB, native) + SageMaker Serverless (no GPU)~$2Per-ms compute (~$25–$50)Provisioned concurrency adds $30–$120~$60–$170CPU-only custom models
Lambda (1 GB, native) + SageMaker Real-time ml.g5.xlarge~$2~$730 (always-on)Built-in~$735High sustained QPS custom models
Azure Functions Java (Flex) + Foundry GPT-4.1-mini~$5Token-priced (~$80–$120)Priority Processing if needed~$85–$150Azure-native teams
Cloud Run Quarkus native + Vertex Gemini 2.5 Flash<$1Token-priced (~$40–$70)Min instances=0 works fine~$45–$75✅ Cheapest overall at this scale
Cloud Run Quarkus native + Cloud Run GPU (L4) self-hosted Gemma~$3 CPU + ~$120 GPUIncluded in container costNone needed~$120–$180Custom model, medium QPS

📊 The rule of thumb that held true across my testing:

  • Under ~200k tokens/day: token-priced backends (Bedrock, Foundry, Vertex) almost always win on cost
  • At sustained high QPS with a custom model: SageMaker Real-time or Cloud Run GPU becomes cost-competitive
  • The SageMaker ml.g5.xlarge floor (~$730/month) is the worst outcome for low-traffic workloads — avoid it unless you genuinely need a custom GPU-accelerated model

Multi-Model Routing: Where the Clouds Diverge Most

Multi-model routing — picking which model serves a request based on cost, latency, capability, or tenant — is what separates a one-vendor demo from a real LLM platform. In my experience building routing layers, this is the capability teams underinvest in most, then regret at scale.

CloudRouting easeHow it worksVerdict
Azure Foundry⭐⭐⭐ EasiestMultiple model deployments behind one endpoint; switch via deployment nameBest for multi-model if you’re Azure-native
Cloud Run + Vertex⭐⭐ CleanSingle Gen AI SDK; swap model string; side-load OpenAI/Anthropic SDKs for non-Google modelsGood out of the box; needs custom router for cross-provider
AWS Lambda + Bedrock⭐ Most workSageMaker and Bedrock are separate clients; no first-party routerBuild the LLM Gateway pattern — see our LLM Gateway post

❌ Bad: One Provider Hardcoded

// Bad: tightly coupled to one vendor, no fallback, no cost controls.
// SDK client initialized on every cold start — pays the reflection cost every time.
public class InferenceHandler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {

    // PROBLEM: built inside the class, not as a static field — rebuilt on every invocation
    @Override
    public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent req, Context ctx) {
        OpenAIClient openai = OpenAIOkHttpClient.builder()
            .apiKey(System.getenv("OPENAI_API_KEY"))
            .build();  // expensive — connection pools, reflection, DNS resolution

        ChatCompletion completion = openai.chat().completions().create(
            ChatCompletionCreateParams.builder()
                .model("gpt-4.1-mini")
                .addUserMessage(req.getBody())
                .build()
        );
        return new APIGatewayProxyResponseEvent()
            .withStatusCode(200)
            .withBody(completion.choices().get(0).message().content().orElse(""));
    }
}
// Problems: provider failure = total outage | no fallback | no per-tenant routing
// | client rebuilt on EVERY invocation (not just cold starts) | zero cost visibility

✅ Better: Provider-Agnostic Router with Cost-Aware Selection

// A minimal LLM gateway suitable for Lambda, Functions, or Cloud Run.
// Clients initialized ONCE (static or constructor) — reused across all warm invocations.
public final class ModelRouter {

    // Sealed types: routing logic is exhaustive at compile time — no missed cases
    public sealed interface Provider permits Bedrock, Foundry, Vertex {}
    public record Bedrock(String modelId) implements Provider {}
    public record Foundry(String deploymentName) implements Provider {}
    public record Vertex(String modelId) implements Provider {}

    private final BedrockRuntimeClient bedrock;
    private final OpenAIClient foundryClient;
    private final Client vertexClient;

    public ModelRouter() {
        // Expensive initialization happens once per execution environment.
        // With SnapStart, these survive snapshot/restore (add CRaC hooks for
        // anything that must be unique per env, like connection pools).
        this.bedrock = BedrockRuntimeClient.builder().build();
        this.foundryClient = OpenAIOkHttpClient.builder()
            .baseUrl(System.getenv("FOUNDRY_BASE_URL"))  // .../openai/v1/
            .apiKey(System.getenv("FOUNDRY_API_KEY"))
            .build();
        this.vertexClient = Client.builder().build();
    }

    /** Route based on cost, capability, and tenant tier. */
    public Provider route(InferenceRequest r) {
        // Short free-tier prompts: cheapest model
        if (r.tenantTier() == Tier.FREE && r.inputTokens() < 600) {
            return new Vertex("gemini-2.5-flash");
        }
        // Tool/function calling: needs reliable structured output
        if (r.requiresFunctionCalling()) {
            return new Bedrock("anthropic.claude-haiku-4-5");
        }
        // Paid tier default
        return new Foundry("gpt-4.1-mini");
    }

    public String invoke(InferenceRequest r) {
        return switch (route(r)) {
            case Bedrock(var id)  -> callBedrock(id, r);
            case Foundry(var dep) -> callFoundry(dep, r);
            case Vertex(var id)   -> callVertex(id, r);
        };
    }
    // callBedrock/callFoundry/callVertex: each logs tenant_id, model, input_tokens,
    // output_tokens, latency_ms to structured JSON for cost attribution
}

This is the same pattern documented in our LLM Gateway pattern for Java microservices and the complete runnable LLM gateway demo.

Runnable RAG Endpoint (Works on All Three Clouds)

RAG is the most common real-world inference pattern: receive query → embed → search vector store → stuff passages into prompt → call LLM → return. The serverless challenge is keeping all of this fast on a cold start.

// Vendor-neutral RAG core. Wrap with:
//   Lambda: implement RequestStreamHandler
//   Azure Functions: @HttpTrigger
//   Cloud Run: Quarkus @Path or Spring @RestController
public class RagCore {

    private final EmbeddingClient embedder;  // e.g. Vertex text-embedding-005
    private final VectorStore vectorStore;   // pgvector on AlloyDB / Aurora / Cosmos
    private final ModelRouter router;

    // All initialized once per execution environment — SnapStart/Leyden preserve these
    public RagCore(EmbeddingClient e, VectorStore v, ModelRouter r) {
        this.embedder = e;
        this.vectorStore = v;
        this.router = r;
    }

    public String answer(String query, String tenantId) {
        // Step 1: embed the query (~50–150 ms, single network call)
        float[] embedding = embedder.embed(query);

        // Step 2: retrieve top-4 passages scoped to this tenant
        List<Passage> context = vectorStore.search(embedding, tenantId, 4);

        // Step 3: build a grounded, hallucination-resistant prompt
        String prompt = """
            Answer using ONLY the context below.
            If the context doesn't contain the answer, say "I don't know."

            Context:
            %s

            Question: %s
            """.formatted(formatContext(context), query);

        // Step 4: route to cost/latency-appropriate model
        return router.invoke(new InferenceRequest(
            prompt, query.length() / 4, Tier.PAID, false));
    }

    private String formatContext(List<Passage> passages) {
        return passages.stream()
            .map(p -> "- [" + p.sourceId() + "] " + p.text())
            .collect(Collectors.joining("n"));
    }
}

For a fully-runnable Spring AI version with chunking, embedding generation, and a real vector store, see our complete Spring AI RAG runnable demo.

Function-Based Agents: Where Cold Start Optimisation Pays Off Most

The pattern gaining traction: the agent loop runs in Bedrock Agents, Foundry Agent Service, or Vertex Agent Engine. Each tool the agent calls is a separate serverless function. With a 4-second cold-starting Java Lambda tool, a 12-tool agent loop takes minutes. With Java 25 + Leyden AOT cache, each tool call wrapper is sub-200 ms — comparable to a database query.

🔑 This is the workload where serverless Java AOT optimisation pays off most clearly. Cold start on the function layer compounds with every tool call in the agent loop. A 12-tool agent where each tool takes 2 s on a cold start = a 24-second agent. At 200 ms: 2.4 seconds. This is the difference between a feature your users love and one they abandon.

// Function-based agent tool: receives structured JSON from the agent platform,
// executes business logic, returns typed JSON the agent reasons over.
//
// Tool schema registered with agent platform:
//   name: "lookup_order"
//   parameters: { order_id: string, include_lines: boolean }
//   returns:    { status, total, customer_email, lines? }
public class LookupOrderTool {

    private final OrderRepository orders;  // initialized once, reused across warm calls

    public LookupOrderTool(OrderRepository orders) { this.orders = orders; }

    public ToolResult invoke(ToolInvocation call) {
        String orderId = call.requireString("order_id");
        boolean includeLines = call.optionalBoolean("include_lines", false);

        Order o = orders.findById(orderId)
            .orElseThrow(() -> ToolError.notFound("order_id=" + orderId));

        var result = new HashMap<String, Object>();
        result.put("status", o.status().name());
        result.put("total", o.total().toPlainString());
        result.put("customer_email", o.customerEmail());
        if (includeLines) result.put("lines", o.lines());

        return ToolResult.json(result);  // agent receives this and continues reasoning
    }
}

Observability Compared

CapabilityAWS (X-Ray + CloudWatch)Azure (Foundry Tracing + App Insights)GCP (Cloud Trace + Vertex Logging)
Cold vs warm invocation split✅ INIT/REPORT log lines built in⚠️ Manual instrumentation⚠️ Manual instrumentation
Token usage per request⚠️ Logged, need Lambda extension to ship to warehouse✅ Automatic with Semantic Kernel; manual with raw SDK✅ Vertex logs prompt/response by default
AI-native tracing (tool calls, reasoning)⚠️ X-Ray traces exist but not AI-semantics-aware✅✅ Best-in-class — reasoning paths, function calls, token cost✅ OpenTelemetry AI semantics via Spring AI
Cost attribution per tenant❌ Build it in your gateway❌ Build it in your gateway❌ Build it in your gateway
Overall AI observability verdictBest general APM; AI-specific data needs work✅ Best for AI workloads specificallyGood; cleanest if already on GCP

⚠️ None of the three clouds give you tokens-per-tenant out of the box. The metric you actually need for cost attribution is: input_tokens + output_tokens, per request, per tenant, per model. Build this in your ModelRouter/gateway. APMs show latency; your gateway must show token cost.

Under the Hood: How AOT Caches and Native Images Cut Cold Starts

A cold-starting Java Lambda spends most INIT time on four things:

  1. Class loading and linking — A Spring Boot app touches 15,000–25,000 classes at startup. Parsing bytecode, verifying, linking: a measurable fraction of every cold start.
  2. Static initializer execution — Every @PostConstruct, every static block, every reflection-heavy library scan happens here.
  3. JIT warmup — HotSpot compiles hot methods after observing them. On a cold start, nothing is hot yet — early requests run interpreted or at C1 tier.
  4. Application logic — Connection pools open, schema loads, caches prime.
TechniqueWhat it eliminatesWhat you keepBuild complexityDebug experience
Project Leyden AOT cache (JEP 483/514/515)Class loading + partial JIT (steps 1 & 3)Full reflection, dynamic classes, JIT at runtimeLow — runs at deploy timeNormal JVM debugging
SnapStart (AWS docs)All 4 steps (snapshot/restore)Everything — snapshot is the fully-initialised JVMLow — config flag; add CRaC hooks for unique-per-env stateOpaque snapshot — harder to debug INIT issues
GraalVM native imageSteps 1, 2, and 3 (no class loading, no JIT)Static analysis of reachable code onlyHigh — reflection config, native build pipelineNative image debugging (gdb/LLDB — different skillset)

The practical pick: For a Spring Boot Lambda calling a managed AI API, Leyden’s cache (now default in Lambda Java 25) is the highest-ROI path — free improvement, no trade-offs. For tight tool-handlers in agent loops, GraalVM native with Quarkus is worth the build complexity. SnapStart fills the middle: full application, sub-200 ms cold starts, but requires CRaC hooks for anything unique per environment (DB connections, JWT keys, random seeds). Our Spring Boot + GraalVM native image guide walks the full build pipeline.

Gotchas That Cost Teams Real Time

  • ⚠️ SageMaker Serverless Inference has no GPU support — most blog posts don’t say this clearly. If your model needs a GPU (most generative models do), SageMaker Serverless is not your answer. Use Bedrock, SageMaker Real-time, or Cloud Run GPU.
  • ⚠️ Azure Functions Java cold starts haven’t improved on standard plans. The “Java is fast on serverless now” narrative is true on Lambda and Cloud Run. On Azure, you’re still at 1.5–4 s on Flex Consumption. Microsoft’s investment is in C# AOT and the Foundry runtime.
  • ⚠️ Leyden AOT cache is invalidated when AWS patches the managed runtime. Don’t ship custom caches with managed runtimes — use container image deployment (where the cache is immutable) if you need a predictable cache. See Lambda runtime update docs.
  • ⚠️ Vertex AI’s Java SDK is com.google.genai, not google-cloud-aiplatform. Many tutorials still point at the deprecated SDK. Check the Vertex AI overview for the current recommendation.
  • ⚠️ Check the Azure AI Inference beta SDK deprecation date. See the official Foundry supported languages page before starting a new project.
  • ⚠️ “Scale to zero” on AWS SageMaker has a catch. If you need P99 < 500 ms, you need MinCapacity=1 or provisioned concurrency on the endpoint — that’s a hard cost floor. Bedrock is the real scale-to-zero option on AWS for most teams.
  • ⚠️ Cloud Run GPU is GA but not in every region. Check Cloud Run AI overview for current L4 GPU regional availability before committing to this architecture for data-residency-constrained workloads.
  • ⚠️ API Gateway REST APIs buffer the full LLM response. For streaming, use Lambda Function URLs, HTTP API with response streaming, Cloud Run’s native streaming, or Azure Functions’ OpenAI streaming extension.

Best Practices at a Glance

PracticeWhy it mattersHow to implement
Initialize SDK clients outside the handlerBiggest single cold-start win you control — paid once per env, reused across all warm invocationsStatic fields or constructor injection; add CRaC hooks for SnapStart
Default to token-priced backendsScales to zero cleanly; no hardware floor costUse Bedrock, Foundry/OpenAI, or Vertex Gemini unless you have a custom model
Model router from day oneAdding a second model later costs an afternoon; retrofitting a router costs a weekSee ModelRouter example above and our LLM Gateway post
Track tokens-per-tenant in the gatewayAPMs show latency; gateways can show token costLog JSON: tenant_id, model, provider, input_tokens, output_tokens, latency_ms
Treat cold start as a layered budgetOptimise the layer that’s actually slowBudget example: 250 ms function + 400 ms model + 50 ms network = 700 ms target
Use container images for custom AOT cachesManaged runtimes invalidate caches during AWS patchingContainer images are immutable; your cache is predictable
Quarkus/Micronaut for native images, not Spring BootSpring Native has more rough edges with AI SDKsQuarkus Panache + RESTEasy + Quarkus LangChain4j is the most production-proven combo

Common Mistakes

  • Loading a model from S3 inside the Lambda handler. A 1.5 GB HuggingFace model takes 30–45 s to load — Lambda’s 30-second handler timeout will fire first. The fix: don’t run the model in Lambda. Use SageMaker, Bedrock, or Cloud Run GPU.
  • Enabling SnapStart without auditing INIT phase code. If INIT generates a JWT signing key, every snapshot-restored environment signs with the same key. Audit for unique-per-env state → add CRaC Resource hooks → then enable.
  • Streaming responses with API Gateway REST. REST APIs buffer the whole response. For LLM token streaming, switch to Lambda Function URLs or HTTP API with response streaming.
  • Trusting the AOT cache without measuring with production payloads. If your training run skipped a code path that production hits, you get cache misses and degraded performance. Benchmark with actual production-representative requests.
  • Leaving Spring’s full auto-configuration in a serverless function. Even with Leyden’s cache, a full Spring Boot app with 30 starters pays a meaningful startup cost. For serverless, prefer hand-wired configuration or switch to Quarkus/Micronaut.
  • Optimising function cold start but ignoring LLM TTFT. A 250 ms cold-start function calling a model with 4 s time-to-first-token is a 4.25 s user experience. Optimise for TTFT, not time-to-completion.

Illustrative Scenario: Multi-Tenant Customer-Support RAG API

📋 Note: This is a composite illustrative scenario showing how the architectural trade-offs above play out in a realistic context. Figures shown are directional — they reflect the patterns described above, not a specific production deployment you can cite. Always model your own traffic distribution and run a cost estimate before committing.

Setup: A B2B SaaS application serves a customer-support AI assistant for several hundred tenants. Traffic is highly bursty — most tenants quiet most of the day, a handful active during business hours. Average inference: ~1,500 input tokens, ~300 output tokens. Latency target: P95 < 1.5 s end to end.

Starting point: single always-on SageMaker ml.g5.xlarge real-time endpoint. Cost: ~$730/month regardless of traffic. P95 latency: ~2.4 s (model backend bottleneck, not the Lambda wrapper).

Revised architecture:

  • Cloud Run service with Quarkus native image (Java 25). Min instances=0 off-hours; min=2 during business hours per tenant cluster.
  • pgvector on AlloyDB for the RAG store, scoped per tenant.
  • ModelRouter: short queries (<600 tokens) → Vertex Gemini 2.5 Flash; citation-heavy queries → a more capable model; tool-calling → a model with reliable function-calling.
  • Observability: OpenTelemetry → Cloud Trace + tokens-per-tenant in BigQuery from structured gateway logs.

Expected trade-off direction: the always-on ml.g5.xlarge floor (~$730/month) disappears when you switch to token-priced backends that scale with actual usage. Cold start improves from seconds to sub-200 ms with native images. Token routing to a cheaper model for short queries cuts total token spend meaningfully. The exact magnitude depends on your traffic distribution — model it with your numbers before migrating.

10 AI Prompts You Can Use to Build, Validate, or Migrate

Copy these into your assistant of choice (Claude, ChatGPT, Gemini, Cursor) when working on serverless Java AI inference. Written to produce specific, actionable answers — not generic explanations.

  1. “Review my AWS Lambda Java handler for AI inference and identify everything in the cold start path that should move to static initialization or out of the handler entirely. Show me the before/after code with comments explaining the cold-start impact of each change.”
  2. “Given this Spring Boot 3.4 application targeting AWS Lambda with Java 25, generate a working configuration for both Project Leyden AOT cache and Lambda SnapStart, plus the CRaC Resource hooks I need for the database connection pool and JWT key generator. Flag any incompatibilities.”
  3. “Convert this Quarkus REST endpoint into a GraalVM native image build that deploys to Google Cloud Run with sub-150 ms cold start. Include the multi-stage Dockerfile, application.properties for Cloud Run’s PORT environment variable, and the reflection configuration for the Vertex AI Gen AI SDK.”
  4. “Compare the cost of running this RAG endpoint on (a) Lambda + Bedrock Claude Haiku 4.5, (b) Azure Functions Java + Foundry GPT-4.1-mini, and (c) Cloud Run + Vertex Gemini 2.5 Flash, given 100,000 requests/day, 1,500 input + 300 output tokens average. Show your math, including provisioned concurrency for cold-start mitigation where applicable.”
  5. “Write a Java ModelRouter using sealed types that routes requests across AWS Bedrock, Microsoft Foundry, and Google Vertex AI based on input length, tenant tier, and whether function calling is required. Include unit tests with fakes for each provider.”
  6. “My SageMaker Serverless Inference endpoint is cold-starting in 30+ seconds for a 1.5 GB HuggingFace model. List five concrete techniques to reduce this — for each, give the expected reduction range, the implementation effort, and any trade-offs.”
  7. “Generate an OpenTelemetry-based observability layer for a Java AI gateway that captures, per request: model name, provider, input tokens, output tokens, total latency, time-to-first-token, tenant ID, and tool calls if any. Show how to export this to AWS X-Ray, Application Insights, and Cloud Trace.”
  8. “Audit this Lambda Java function for SnapStart compatibility. Flag every line that generates state during INIT that should be unique per execution environment (random IDs, DB connections, signed tokens, etc.). Suggest CRaC Resource implementations for each.”
  9. “My function-based agent on Bedrock Agents has 12 tools, each a Java Lambda. Average tool call latency is 1.8 s and the agent loop takes ~25 s end-to-end. Profile this conceptually — where is the time going, and which optimization (Leyden AOT cache vs SnapStart vs GraalVM native vs moving tools to a single shared container) gives the best win for the least effort?”
  10. “Compare Microsoft Foundry’s tracing for AI agents to OpenTelemetry-only instrumentation on Cloud Run + Vertex AI. For a Java Spring AI app that needs to debug tool-call failures and track token spend per tenant, which is more useful in production, and what is the migration effort if I need to switch later?”

See Also

Conclusion

After testing all three stacks against realistic bursty workloads, the headline finding is simple: the function layer cold start is no longer the bottleneck for Java in 2025–2026 — the inference backend and the token cost model are. Java 25’s default AOT cache on Lambda and sub-150 ms Quarkus native cold starts on Cloud Run have closed the gap with Python. What differentiates the three clouds now is cost shape at scale, multi-model routing ergonomics, and observability depth.

Final Recommendations

Starting fresh? Default to Cloud Run + Vertex AI with a Quarkus native image. Cheapest at low/medium QPS, simplest operational model, fastest cold start path. Exception: if your data lives on AWS, Lambda + Bedrock avoids cross-cloud egress costs and is nearly as good.

Already on AWS? Prefer Bedrock over SageMaker for serverless inference. SageMaker shines for custom models on always-on hardware. For everything else, Bedrock + Lambda Java 25 with the default AOT cache — plus SnapStart if you need sub-200 ms P99 — is the modern default.

Azure-native? Lean into Foundry’s tracing and Priority Processing. Accept Java Functions cold starts as your weakest link and design around them — event-driven paths where 2 s is invisible to users, pre-warmed instances for latency-critical paths.

Building agents? Build the model router on day one. Routing short queries to a cheaper model is where the real token cost savings appear — and you can’t do that without a router in place from the start.

FAQs

Is Java actually competitive with Python for serverless AI inference?

Yes — for the orchestration/gateway/tool-handler layer. With Java 25’s default AOT cache on Lambda, plus SnapStart or GraalVM native, cold starts are on par with Python (often faster). For the model itself, both languages call the same managed APIs (Bedrock, Foundry, Vertex) — the language is irrelevant at that layer. Where Python still wins: serving custom models with PyTorch/TF natively in the function, which isn’t a serverless-friendly workload anyway.

Should I use Project Leyden’s AOT cache or GraalVM native image?

Different tools for different budgets. Leyden: 40–60% startup improvement, no code changes, full JVM dynamism, free. GraalVM: ~10× improvement, reflection configuration required, constrains dynamic features, longer build. For a Spring Boot Lambda calling a managed AI API, Leyden (now default in Lambda Java 25) is the higher ROI path. For tight tool-handlers in agent loops, GraalVM native with Quarkus is worth the build complexity.

Can I run a small LLM directly in Cloud Run instead of calling Vertex AI?

Yes. Cloud Run GPU (L4) is GA in supported regions. You can host a quantized Gemma or similar small model with true scale-to-zero. The break-even vs Vertex Gemini Flash pricing depends on QPS — Cloud Run GPU typically wins at sustained high request rates on a single instance; below that, Vertex’s per-token pricing is usually cheaper.

Why is SageMaker often discouraged for serverless AI?

Because SageMaker Serverless Inference (a) doesn’t support GPUs, ruling out most generative models, and (b) cold-starts a 1+ GB model in 20–45 seconds — which can exceed Lambda’s handler timeout. Most teams end up on SageMaker Real-time (always-on cost ~$730+/month) or Async (with queue complexity). Bedrock fixes both: token-priced, no infrastructure, real scale-to-zero.

How do I track tokens-per-tenant for cost attribution?

Build it in your model router. Every provider’s response includes a usage object with input/output tokens. Log structured JSON: {tenant_id, model, provider, input_tokens, output_tokens, latency_ms} → ship to your data warehouse (BigQuery, Snowflake, Redshift). APMs show latency; they’re not designed for token-based billing reconciliation — your gateway is.

Is Azure Functions Java really that bad for AI inference?

“Bad” is too strong. For HTTP APIs with strict latency SLAs, it’s the weakest of the three — 1.5–4 s cold starts on Flex Consumption with no SnapStart equivalent. For event-driven workloads (Service Bus queues, Cosmos DB change feeds, scheduled jobs) where a 2-second cold start is invisible to users, Azure Functions Java + Foundry is a perfectly viable stack — and Foundry’s AI observability is genuinely best-in-class among the three.

Further Reading

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.