I spent several weeks running Java AI inference handlers across all three major clouds — AWS Lambda, Azure Functions, and Google Cloud Run — testing cold start behaviour, token cost at scale, and multi-model routing under realistic load. The short version: the right choice depends on which layer is your bottleneck, and the three clouds diverge more sharply than any generic “serverless comparison” post will tell you.
This post covers everything I found: measured cold start numbers with sources, honest cost models at 50k–100k requests/day, multi-model routing patterns, observability trade-offs, and runnable Java code for RAG endpoints and function-based agents. Skip to the Winner Section if you want the bottom line immediately.
About This Post
Benchmarks compiled from: AWS Lambda Java 25 launch post (Liberty Mutual case study), inside.java JEP walkthrough, aws-samples/serverless-graalvm-demo, Quarkus native Cloud Run benchmarks from the official Quarkus GCP guide, and hands-on testing. Cost figures are calculated from public pricing pages as of May 2026 — verify with your provider’s calculator before committing.
TL;DR: Quick Picks
| Your Situation | Best Stack | Why |
|---|---|---|
| Greenfield, lowest cost, bursty traffic | ✅ Cloud Run + Vertex Gemini Flash | True scale-to-zero on both layers, sub-150ms cold start with Quarkus native, cheapest token pricing |
| Already on AWS, need Anthropic/Meta models | ✅ Lambda + Bedrock | Token-priced, real scale-to-zero, Java 25 AOT cache now default (~900ms cold start free) |
| Azure-native enterprise shop | ✅ Azure Functions + Foundry | Best-in-class AI observability, Entra ID integration, Priority Processing for latency SLAs |
| Function-based agent tools (many short calls) | ✅ Lambda + Bedrock Agents (GraalVM native) | 80–250ms cold start, tool calls feel like DB queries not API calls |
| Custom fine-tuned model, high sustained QPS | ✅ SageMaker Real-time or Cloud Run GPU | Token-priced backends become expensive at sustained scale; hardware billing wins |
🏆 The Winner (And Why It Depends)
🥇 Overall Winner for Most Java Teams: Cloud Run + Vertex AI
Why: When I tested a Quarkus native image on Cloud Run fronting Vertex Gemini 2.5 Flash at ~50k requests/day with bursty traffic, I got:
- ~110 ms P50 cold start (Quarkus native + GraalVM Mandrel) — comparable to a database query
- True scale-to-zero on both the Cloud Run layer and the Vertex layer
- Lowest total cost at low-to-medium QPS — Gemini 2.5 Flash’s token pricing is substantially cheaper than GPT-4.1-mini or Claude Haiku at equivalent quality for most classification/summarisation tasks
- No infrastructure to manage beyond a container image
The exception: If your data already lives on AWS (DynamoDB, S3, Aurora) or you need Anthropic/Meta models via Bedrock, Lambda + Bedrock is the better default and avoids cross-cloud egress costs. Azure Functions wins only if you’re already Azure-native and Foundry’s observability story matters to you.
Why Generic Serverless Posts Don’t Help You Here
⚠️ The hidden trap: Your cold start time is the sum of (a) your function’s startup AND (b) the inference backend’s cold start. A 150 ms Lambda fronting a SageMaker Serverless endpoint that loads a HuggingFace model takes 8–40 seconds — not 150 ms. Tutorials that benchmark only the function layer are lying by omission.
Serverless Java is genuinely fast now. AWS Lambda ships Java 25 with AOT caches enabled by default — the first production wave of Project Leyden. GraalVM native image cold starts under 100 ms are routine with Quarkus. Published data from Liberty Mutual, cited in the AWS Lambda Java 25 launch post, shows a Spring Boot Lambda dropping from 5.7s to 655ms as a native image — roughly 9×.
But none of that on its own makes serverless AI inference fast. The right question is: which layer holds the cold start, and which layer holds the cost? The three clouds answer this completely differently.
The Three Architectures at a Glance
| Dimension | AWS Lambda + Bedrock/SageMaker | Azure Functions + Foundry | Cloud Run + Vertex AI |
|---|---|---|---|
| Lambda/function GPU support | ❌ None | ❌ None | ✅ L4 GPU GA (Cloud Run) |
| Managed inference backend | Bedrock (token-priced) or SageMaker | Azure OpenAI / Foundry (token-priced) | Vertex AI Gemini (token-priced) |
| Java cold start optimization | ✅ SnapStart + Java 25 AOT cache | ⚠️ No SnapStart equivalent | ✅ Quarkus/Micronaut native images |
| Scale to zero (inference layer) | ✅ Bedrock yes, SageMaker Serverless partial | ✅ Foundry yes | ✅ Vertex yes |
| Multi-model routing ease | ⚠️ Need custom router (Bedrock + SageMaker = separate clients) | ✅ Single Foundry endpoint, swap deployments | ✅ Single Gen AI SDK, swap model string |
| AI observability | ✅ X-Ray + CloudWatch (mature APM) | ✅✅ Foundry tracing (best-in-class for AI) | ✅ Cloud Trace + Vertex logging |
| Java SDK maturity | ✅ AWS SDK v2 (very mature) | ✅ OpenAI Java SDK + azure-identity | ✅ com.google.genai (newer, improving) |
| Best for | AWS data gravity, Bedrock model catalog, agent tools | Azure-native teams, enterprise compliance | Greenfield, cost-optimised, custom containers |
1. AWS Lambda + SageMaker (or Bedrock)
The canonical AWS pattern: API Gateway → Lambda (Java) → Bedrock or SageMaker. Lambda has no GPU support — no GPU resource type, no CUDA drivers — so Lambda is always the coordination layer, never the inference layer.
SageMaker offers three relevant deployment types. SageMaker Serverless Inference docs make clear it does not support GPUs — which rules out most generative models. Real-time endpoints (always-on instances) and Async endpoints (scale-to-zero with cold start consequences) are the alternatives. For most Java teams, Lambda + Bedrock is the honest serverless AI answer: token-priced, no SageMaker endpoint lifecycle to manage, real scale-to-zero. SageMaker shines only when you have a custom-trained model.
2. Azure Functions + Microsoft Foundry
Microsoft renamed Azure AI Studio to Microsoft Foundry. The Azure AI Inference beta SDK is being retired — check the official Azure Foundry supported languages page for the current deprecation date. For new projects, use the OpenAI-compatible /openai/v1 endpoint with the standard OpenAI Java SDK plus azure-identity. The Java azure-ai-projects SDK covers agents, evaluations, memory, and inference under a single AIProjectClient.
⚠️ Honest assessment: Azure Functions Java cold starts are the weakest of the three. There is no SnapStart equivalent. On the Flex Consumption plan, Java runs at 1.5–4 s P50 — on par with Python. GraalVM native images require custom container deployment. Microsoft’s engineering investment is in C# AOT and the Foundry agent runtime, not Java Functions cold start. Foundry’s observability is genuinely best-in-class, which is the redeeming factor for Azure-native teams.
3. Google Cloud Run + Vertex AI
Cloud Run is container-first serverless — the closest of the three to “deploy any Linux binary, scale to zero.” That makes it the natural home for Quarkus or Micronaut native images. Vertex AI provides a unified Google Gen AI SDK for Java (com.google.genai:google-genai). See the Vertex AI overview for the platform’s current state as it transitions into the Gemini Enterprise Agent Platform.
💡 Two facts most posts miss:
1. Cloud Run supports L4 GPUs (GA) — you can run small models inside Cloud Run, skipping Vertex entirely for cheap workloads, with true scale-to-zero.
2. Combined with a Quarkus native cold start under 100 ms, this is the cheapest end-to-end serverless AI architecture at low QPS. The cost trap appears at high sustained QPS, where Vertex’s per-token pricing dominates.
Cold Start Benchmarks: Java AI Handlers
P50 and P99 figures below are for a Java 25 inference handler that parses JSON, calls a chat-completion endpoint, and returns JSON. Function memory: 1024 MB unless noted. Sources are named per row — treat these as directional ranges, not guaranteed numbers. Your dependency graph dominates: a lightweight Quarkus handler is faster than a full Spring Boot app with 30 starters. Always measure with your actual production payload.
| Configuration | P50 cold start | P99 cold start | Memory | Source |
|---|---|---|---|---|
| AWS Lambda Java 25 managed (CDS only, legacy baseline) | ~3,800 ms | ~5,200 ms | ~280 MB | AWS Lambda Java 25 blog |
| AWS Lambda Java 25 managed (Leyden AOT cache, now default) | ~900 ms | ~1,800 ms | ~260 MB | AWS Lambda Java 25 blog — ~4× over CDS |
| AWS Lambda Java 25 + SnapStart + priming | ~180 ms | ~700 ms | ~280 MB | AWS SnapStart docs |
| AWS Lambda Java 25 + GraalVM native (Quarkus) | ~250 ms | ~450 ms | ~90 MB | aws-samples/serverless-graalvm-demo |
| AWS Lambda Java 25 + GraalVM native (Micronaut) | ~80 ms | ~200 ms | ~75 MB | Micronaut AWS Lambda guide |
| Azure Functions Java 21 (Flex Consumption) | ~2,500 ms | ~4,200 ms | ~300 MB | Azure Flex Consumption plan docs |
| Azure Functions Java 21 + custom container (GraalVM) | ~600 ms | ~1,200 ms | ~95 MB | Community benchmarks — requires custom container plan |
| Cloud Run Java 25 JVM (Spring Boot 3.4) | ~3,200 ms | ~5,000 ms | ~310 MB | Cloud Run Java tips (GCP docs) |
| Cloud Run Java 25 + Leyden AOT cache (Spring Boot 4 preview) | ~1,400 ms | ~2,200 ms | ~280 MB | Estimate based on inside.java Leyden JEP data |
| Cloud Run Quarkus native (GraalVM Mandrel) | ~110 ms | ~250 ms | ~50 MB | Quarkus native Cloud Run guide |
📌 Key benchmark insights:
- Project Leyden’s AOT cache is now free on Lambda Java 25 — no code changes, no SnapStart config, ~4× over the old CDS baseline. AWS made this the default in the Java 25 managed runtime.
- GraalVM native still beats Leyden by 3–10× on cold start but constrains dynamic features (reflection needs configuration). Leyden preserves full JVM dynamism — it’s a performance hint, not a constraint. As inside.java explains, they solve different problems.
- Azure Functions Java hasn’t improved meaningfully. The “Java is fast on serverless now” narrative is true on Lambda and Cloud Run. On Azure, you’re still at 1.5–4 s for standard JVM deployments.
- Micronaut native on Lambda is the fastest managed cold start in this comparison — ~80 ms P50 at 75 MB memory, beating even Quarkus native on Cloud Run when the GraalVM build is optimised.
Cost Models: Where Your Money Actually Goes
💡 The cost insight most posts miss: Serverless AI cost is rarely the function’s compute. It’s the model’s tokens, the provisioned concurrency you needed to hide cold starts, and — on AWS — the SageMaker endpoint at MinCapacity=1 because you couldn’t tolerate its 30+ second cold start. Token-priced backends (Bedrock, Foundry/OpenAI, Vertex Gemini) are almost always cheaper for bursty workloads than hardware-priced backends.
Illustrative estimates for 100,000 inference requests/day, ~800 input + 200 output tokens average, P95 latency target ≤ 500 ms. Calculated from Bedrock pricing, Azure OpenAI pricing, and Vertex AI pricing pages (May 2026). Use each provider’s pricing calculator for your exact numbers.
| Stack | Function compute | Inference cost | Cold-start mitigation | Est. total/month | Best for |
|---|---|---|---|---|---|
| Lambda (1 GB, native) + Bedrock Claude Haiku 4.5 | ~$2 | Token-priced (~$60–$90) | None needed (native is fast) | ~$70–$100 | ✅ Best value on AWS |
| Lambda (1 GB, native) + SageMaker Serverless (no GPU) | ~$2 | Per-ms compute (~$25–$50) | Provisioned concurrency adds $30–$120 | ~$60–$170 | CPU-only custom models |
| Lambda (1 GB, native) + SageMaker Real-time ml.g5.xlarge | ~$2 | ~$730 (always-on) | Built-in | ~$735 | High sustained QPS custom models |
| Azure Functions Java (Flex) + Foundry GPT-4.1-mini | ~$5 | Token-priced (~$80–$120) | Priority Processing if needed | ~$85–$150 | Azure-native teams |
| Cloud Run Quarkus native + Vertex Gemini 2.5 Flash | <$1 | Token-priced (~$40–$70) | Min instances=0 works fine | ~$45–$75 | ✅ Cheapest overall at this scale |
| Cloud Run Quarkus native + Cloud Run GPU (L4) self-hosted Gemma | ~$3 CPU + ~$120 GPU | Included in container cost | None needed | ~$120–$180 | Custom model, medium QPS |
📊 The rule of thumb that held true across my testing:
- Under ~200k tokens/day: token-priced backends (Bedrock, Foundry, Vertex) almost always win on cost
- At sustained high QPS with a custom model: SageMaker Real-time or Cloud Run GPU becomes cost-competitive
- The SageMaker ml.g5.xlarge floor (~$730/month) is the worst outcome for low-traffic workloads — avoid it unless you genuinely need a custom GPU-accelerated model
Multi-Model Routing: Where the Clouds Diverge Most
Multi-model routing — picking which model serves a request based on cost, latency, capability, or tenant — is what separates a one-vendor demo from a real LLM platform. In my experience building routing layers, this is the capability teams underinvest in most, then regret at scale.
| Cloud | Routing ease | How it works | Verdict |
|---|---|---|---|
| Azure Foundry | ⭐⭐⭐ Easiest | Multiple model deployments behind one endpoint; switch via deployment name | Best for multi-model if you’re Azure-native |
| Cloud Run + Vertex | ⭐⭐ Clean | Single Gen AI SDK; swap model string; side-load OpenAI/Anthropic SDKs for non-Google models | Good out of the box; needs custom router for cross-provider |
| AWS Lambda + Bedrock | ⭐ Most work | SageMaker and Bedrock are separate clients; no first-party router | Build the LLM Gateway pattern — see our LLM Gateway post |
❌ Bad: One Provider Hardcoded
// Bad: tightly coupled to one vendor, no fallback, no cost controls.
// SDK client initialized on every cold start — pays the reflection cost every time.
public class InferenceHandler implements RequestHandler<APIGatewayProxyRequestEvent, APIGatewayProxyResponseEvent> {
// PROBLEM: built inside the class, not as a static field — rebuilt on every invocation
@Override
public APIGatewayProxyResponseEvent handleRequest(APIGatewayProxyRequestEvent req, Context ctx) {
OpenAIClient openai = OpenAIOkHttpClient.builder()
.apiKey(System.getenv("OPENAI_API_KEY"))
.build(); // expensive — connection pools, reflection, DNS resolution
ChatCompletion completion = openai.chat().completions().create(
ChatCompletionCreateParams.builder()
.model("gpt-4.1-mini")
.addUserMessage(req.getBody())
.build()
);
return new APIGatewayProxyResponseEvent()
.withStatusCode(200)
.withBody(completion.choices().get(0).message().content().orElse(""));
}
}
// Problems: provider failure = total outage | no fallback | no per-tenant routing
// | client rebuilt on EVERY invocation (not just cold starts) | zero cost visibility
✅ Better: Provider-Agnostic Router with Cost-Aware Selection
// A minimal LLM gateway suitable for Lambda, Functions, or Cloud Run.
// Clients initialized ONCE (static or constructor) — reused across all warm invocations.
public final class ModelRouter {
// Sealed types: routing logic is exhaustive at compile time — no missed cases
public sealed interface Provider permits Bedrock, Foundry, Vertex {}
public record Bedrock(String modelId) implements Provider {}
public record Foundry(String deploymentName) implements Provider {}
public record Vertex(String modelId) implements Provider {}
private final BedrockRuntimeClient bedrock;
private final OpenAIClient foundryClient;
private final Client vertexClient;
public ModelRouter() {
// Expensive initialization happens once per execution environment.
// With SnapStart, these survive snapshot/restore (add CRaC hooks for
// anything that must be unique per env, like connection pools).
this.bedrock = BedrockRuntimeClient.builder().build();
this.foundryClient = OpenAIOkHttpClient.builder()
.baseUrl(System.getenv("FOUNDRY_BASE_URL")) // .../openai/v1/
.apiKey(System.getenv("FOUNDRY_API_KEY"))
.build();
this.vertexClient = Client.builder().build();
}
/** Route based on cost, capability, and tenant tier. */
public Provider route(InferenceRequest r) {
// Short free-tier prompts: cheapest model
if (r.tenantTier() == Tier.FREE && r.inputTokens() < 600) {
return new Vertex("gemini-2.5-flash");
}
// Tool/function calling: needs reliable structured output
if (r.requiresFunctionCalling()) {
return new Bedrock("anthropic.claude-haiku-4-5");
}
// Paid tier default
return new Foundry("gpt-4.1-mini");
}
public String invoke(InferenceRequest r) {
return switch (route(r)) {
case Bedrock(var id) -> callBedrock(id, r);
case Foundry(var dep) -> callFoundry(dep, r);
case Vertex(var id) -> callVertex(id, r);
};
}
// callBedrock/callFoundry/callVertex: each logs tenant_id, model, input_tokens,
// output_tokens, latency_ms to structured JSON for cost attribution
}
This is the same pattern documented in our LLM Gateway pattern for Java microservices and the complete runnable LLM gateway demo.
Runnable RAG Endpoint (Works on All Three Clouds)
RAG is the most common real-world inference pattern: receive query → embed → search vector store → stuff passages into prompt → call LLM → return. The serverless challenge is keeping all of this fast on a cold start.
// Vendor-neutral RAG core. Wrap with:
// Lambda: implement RequestStreamHandler
// Azure Functions: @HttpTrigger
// Cloud Run: Quarkus @Path or Spring @RestController
public class RagCore {
private final EmbeddingClient embedder; // e.g. Vertex text-embedding-005
private final VectorStore vectorStore; // pgvector on AlloyDB / Aurora / Cosmos
private final ModelRouter router;
// All initialized once per execution environment — SnapStart/Leyden preserve these
public RagCore(EmbeddingClient e, VectorStore v, ModelRouter r) {
this.embedder = e;
this.vectorStore = v;
this.router = r;
}
public String answer(String query, String tenantId) {
// Step 1: embed the query (~50–150 ms, single network call)
float[] embedding = embedder.embed(query);
// Step 2: retrieve top-4 passages scoped to this tenant
List<Passage> context = vectorStore.search(embedding, tenantId, 4);
// Step 3: build a grounded, hallucination-resistant prompt
String prompt = """
Answer using ONLY the context below.
If the context doesn't contain the answer, say "I don't know."
Context:
%s
Question: %s
""".formatted(formatContext(context), query);
// Step 4: route to cost/latency-appropriate model
return router.invoke(new InferenceRequest(
prompt, query.length() / 4, Tier.PAID, false));
}
private String formatContext(List<Passage> passages) {
return passages.stream()
.map(p -> "- [" + p.sourceId() + "] " + p.text())
.collect(Collectors.joining("n"));
}
}
For a fully-runnable Spring AI version with chunking, embedding generation, and a real vector store, see our complete Spring AI RAG runnable demo.
Function-Based Agents: Where Cold Start Optimisation Pays Off Most
The pattern gaining traction: the agent loop runs in Bedrock Agents, Foundry Agent Service, or Vertex Agent Engine. Each tool the agent calls is a separate serverless function. With a 4-second cold-starting Java Lambda tool, a 12-tool agent loop takes minutes. With Java 25 + Leyden AOT cache, each tool call wrapper is sub-200 ms — comparable to a database query.
🔑 This is the workload where serverless Java AOT optimisation pays off most clearly. Cold start on the function layer compounds with every tool call in the agent loop. A 12-tool agent where each tool takes 2 s on a cold start = a 24-second agent. At 200 ms: 2.4 seconds. This is the difference between a feature your users love and one they abandon.
// Function-based agent tool: receives structured JSON from the agent platform,
// executes business logic, returns typed JSON the agent reasons over.
//
// Tool schema registered with agent platform:
// name: "lookup_order"
// parameters: { order_id: string, include_lines: boolean }
// returns: { status, total, customer_email, lines? }
public class LookupOrderTool {
private final OrderRepository orders; // initialized once, reused across warm calls
public LookupOrderTool(OrderRepository orders) { this.orders = orders; }
public ToolResult invoke(ToolInvocation call) {
String orderId = call.requireString("order_id");
boolean includeLines = call.optionalBoolean("include_lines", false);
Order o = orders.findById(orderId)
.orElseThrow(() -> ToolError.notFound("order_id=" + orderId));
var result = new HashMap<String, Object>();
result.put("status", o.status().name());
result.put("total", o.total().toPlainString());
result.put("customer_email", o.customerEmail());
if (includeLines) result.put("lines", o.lines());
return ToolResult.json(result); // agent receives this and continues reasoning
}
}
Observability Compared
| Capability | AWS (X-Ray + CloudWatch) | Azure (Foundry Tracing + App Insights) | GCP (Cloud Trace + Vertex Logging) |
|---|---|---|---|
| Cold vs warm invocation split | ✅ INIT/REPORT log lines built in | ⚠️ Manual instrumentation | ⚠️ Manual instrumentation |
| Token usage per request | ⚠️ Logged, need Lambda extension to ship to warehouse | ✅ Automatic with Semantic Kernel; manual with raw SDK | ✅ Vertex logs prompt/response by default |
| AI-native tracing (tool calls, reasoning) | ⚠️ X-Ray traces exist but not AI-semantics-aware | ✅✅ Best-in-class — reasoning paths, function calls, token cost | ✅ OpenTelemetry AI semantics via Spring AI |
| Cost attribution per tenant | ❌ Build it in your gateway | ❌ Build it in your gateway | ❌ Build it in your gateway |
| Overall AI observability verdict | Best general APM; AI-specific data needs work | ✅ Best for AI workloads specifically | Good; cleanest if already on GCP |
⚠️ None of the three clouds give you tokens-per-tenant out of the box. The metric you actually need for cost attribution is: input_tokens + output_tokens, per request, per tenant, per model. Build this in your ModelRouter/gateway. APMs show latency; your gateway must show token cost.
Under the Hood: How AOT Caches and Native Images Cut Cold Starts
A cold-starting Java Lambda spends most INIT time on four things:
- Class loading and linking — A Spring Boot app touches 15,000–25,000 classes at startup. Parsing bytecode, verifying, linking: a measurable fraction of every cold start.
- Static initializer execution — Every
@PostConstruct, every static block, every reflection-heavy library scan happens here. - JIT warmup — HotSpot compiles hot methods after observing them. On a cold start, nothing is hot yet — early requests run interpreted or at C1 tier.
- Application logic — Connection pools open, schema loads, caches prime.
| Technique | What it eliminates | What you keep | Build complexity | Debug experience |
|---|---|---|---|---|
| Project Leyden AOT cache (JEP 483/514/515) | Class loading + partial JIT (steps 1 & 3) | Full reflection, dynamic classes, JIT at runtime | Low — runs at deploy time | Normal JVM debugging |
| SnapStart (AWS docs) | All 4 steps (snapshot/restore) | Everything — snapshot is the fully-initialised JVM | Low — config flag; add CRaC hooks for unique-per-env state | Opaque snapshot — harder to debug INIT issues |
| GraalVM native image | Steps 1, 2, and 3 (no class loading, no JIT) | Static analysis of reachable code only | High — reflection config, native build pipeline | Native image debugging (gdb/LLDB — different skillset) |
The practical pick: For a Spring Boot Lambda calling a managed AI API, Leyden’s cache (now default in Lambda Java 25) is the highest-ROI path — free improvement, no trade-offs. For tight tool-handlers in agent loops, GraalVM native with Quarkus is worth the build complexity. SnapStart fills the middle: full application, sub-200 ms cold starts, but requires CRaC hooks for anything unique per environment (DB connections, JWT keys, random seeds). Our Spring Boot + GraalVM native image guide walks the full build pipeline.
Gotchas That Cost Teams Real Time
- ⚠️ SageMaker Serverless Inference has no GPU support — most blog posts don’t say this clearly. If your model needs a GPU (most generative models do), SageMaker Serverless is not your answer. Use Bedrock, SageMaker Real-time, or Cloud Run GPU.
- ⚠️ Azure Functions Java cold starts haven’t improved on standard plans. The “Java is fast on serverless now” narrative is true on Lambda and Cloud Run. On Azure, you’re still at 1.5–4 s on Flex Consumption. Microsoft’s investment is in C# AOT and the Foundry runtime.
- ⚠️ Leyden AOT cache is invalidated when AWS patches the managed runtime. Don’t ship custom caches with managed runtimes — use container image deployment (where the cache is immutable) if you need a predictable cache. See Lambda runtime update docs.
- ⚠️ Vertex AI’s Java SDK is
com.google.genai, notgoogle-cloud-aiplatform. Many tutorials still point at the deprecated SDK. Check the Vertex AI overview for the current recommendation. - ⚠️ Check the Azure AI Inference beta SDK deprecation date. See the official Foundry supported languages page before starting a new project.
- ⚠️ “Scale to zero” on AWS SageMaker has a catch. If you need P99 < 500 ms, you need MinCapacity=1 or provisioned concurrency on the endpoint — that’s a hard cost floor. Bedrock is the real scale-to-zero option on AWS for most teams.
- ⚠️ Cloud Run GPU is GA but not in every region. Check Cloud Run AI overview for current L4 GPU regional availability before committing to this architecture for data-residency-constrained workloads.
- ⚠️ API Gateway REST APIs buffer the full LLM response. For streaming, use Lambda Function URLs, HTTP API with response streaming, Cloud Run’s native streaming, or Azure Functions’ OpenAI streaming extension.
Best Practices at a Glance
| Practice | Why it matters | How to implement |
|---|---|---|
| Initialize SDK clients outside the handler | Biggest single cold-start win you control — paid once per env, reused across all warm invocations | Static fields or constructor injection; add CRaC hooks for SnapStart |
| Default to token-priced backends | Scales to zero cleanly; no hardware floor cost | Use Bedrock, Foundry/OpenAI, or Vertex Gemini unless you have a custom model |
| Model router from day one | Adding a second model later costs an afternoon; retrofitting a router costs a week | See ModelRouter example above and our LLM Gateway post |
| Track tokens-per-tenant in the gateway | APMs show latency; gateways can show token cost | Log JSON: tenant_id, model, provider, input_tokens, output_tokens, latency_ms |
| Treat cold start as a layered budget | Optimise the layer that’s actually slow | Budget example: 250 ms function + 400 ms model + 50 ms network = 700 ms target |
| Use container images for custom AOT caches | Managed runtimes invalidate caches during AWS patching | Container images are immutable; your cache is predictable |
| Quarkus/Micronaut for native images, not Spring Boot | Spring Native has more rough edges with AI SDKs | Quarkus Panache + RESTEasy + Quarkus LangChain4j is the most production-proven combo |
Common Mistakes
- ❌ Loading a model from S3 inside the Lambda handler. A 1.5 GB HuggingFace model takes 30–45 s to load — Lambda’s 30-second handler timeout will fire first. The fix: don’t run the model in Lambda. Use SageMaker, Bedrock, or Cloud Run GPU.
- ❌ Enabling SnapStart without auditing INIT phase code. If INIT generates a JWT signing key, every snapshot-restored environment signs with the same key. Audit for unique-per-env state → add CRaC
Resourcehooks → then enable. - ❌ Streaming responses with API Gateway REST. REST APIs buffer the whole response. For LLM token streaming, switch to Lambda Function URLs or HTTP API with response streaming.
- ❌ Trusting the AOT cache without measuring with production payloads. If your training run skipped a code path that production hits, you get cache misses and degraded performance. Benchmark with actual production-representative requests.
- ❌ Leaving Spring’s full auto-configuration in a serverless function. Even with Leyden’s cache, a full Spring Boot app with 30 starters pays a meaningful startup cost. For serverless, prefer hand-wired configuration or switch to Quarkus/Micronaut.
- ❌ Optimising function cold start but ignoring LLM TTFT. A 250 ms cold-start function calling a model with 4 s time-to-first-token is a 4.25 s user experience. Optimise for TTFT, not time-to-completion.
Illustrative Scenario: Multi-Tenant Customer-Support RAG API
📋 Note: This is a composite illustrative scenario showing how the architectural trade-offs above play out in a realistic context. Figures shown are directional — they reflect the patterns described above, not a specific production deployment you can cite. Always model your own traffic distribution and run a cost estimate before committing.
Setup: A B2B SaaS application serves a customer-support AI assistant for several hundred tenants. Traffic is highly bursty — most tenants quiet most of the day, a handful active during business hours. Average inference: ~1,500 input tokens, ~300 output tokens. Latency target: P95 < 1.5 s end to end.
Starting point: single always-on SageMaker ml.g5.xlarge real-time endpoint. Cost: ~$730/month regardless of traffic. P95 latency: ~2.4 s (model backend bottleneck, not the Lambda wrapper).
Revised architecture:
- Cloud Run service with Quarkus native image (Java 25). Min instances=0 off-hours; min=2 during business hours per tenant cluster.
- pgvector on AlloyDB for the RAG store, scoped per tenant.
- ModelRouter: short queries (<600 tokens) → Vertex Gemini 2.5 Flash; citation-heavy queries → a more capable model; tool-calling → a model with reliable function-calling.
- Observability: OpenTelemetry → Cloud Trace + tokens-per-tenant in BigQuery from structured gateway logs.
Expected trade-off direction: the always-on ml.g5.xlarge floor (~$730/month) disappears when you switch to token-priced backends that scale with actual usage. Cold start improves from seconds to sub-200 ms with native images. Token routing to a cheaper model for short queries cuts total token spend meaningfully. The exact magnitude depends on your traffic distribution — model it with your numbers before migrating.
10 AI Prompts You Can Use to Build, Validate, or Migrate
Copy these into your assistant of choice (Claude, ChatGPT, Gemini, Cursor) when working on serverless Java AI inference. Written to produce specific, actionable answers — not generic explanations.
- “Review my AWS Lambda Java handler for AI inference and identify everything in the cold start path that should move to static initialization or out of the handler entirely. Show me the before/after code with comments explaining the cold-start impact of each change.”
- “Given this Spring Boot 3.4 application targeting AWS Lambda with Java 25, generate a working configuration for both Project Leyden AOT cache and Lambda SnapStart, plus the CRaC
Resourcehooks I need for the database connection pool and JWT key generator. Flag any incompatibilities.” - “Convert this Quarkus REST endpoint into a GraalVM native image build that deploys to Google Cloud Run with sub-150 ms cold start. Include the multi-stage Dockerfile, application.properties for Cloud Run’s PORT environment variable, and the reflection configuration for the Vertex AI Gen AI SDK.”
- “Compare the cost of running this RAG endpoint on (a) Lambda + Bedrock Claude Haiku 4.5, (b) Azure Functions Java + Foundry GPT-4.1-mini, and (c) Cloud Run + Vertex Gemini 2.5 Flash, given 100,000 requests/day, 1,500 input + 300 output tokens average. Show your math, including provisioned concurrency for cold-start mitigation where applicable.”
- “Write a Java
ModelRouterusing sealed types that routes requests across AWS Bedrock, Microsoft Foundry, and Google Vertex AI based on input length, tenant tier, and whether function calling is required. Include unit tests with fakes for each provider.” - “My SageMaker Serverless Inference endpoint is cold-starting in 30+ seconds for a 1.5 GB HuggingFace model. List five concrete techniques to reduce this — for each, give the expected reduction range, the implementation effort, and any trade-offs.”
- “Generate an OpenTelemetry-based observability layer for a Java AI gateway that captures, per request: model name, provider, input tokens, output tokens, total latency, time-to-first-token, tenant ID, and tool calls if any. Show how to export this to AWS X-Ray, Application Insights, and Cloud Trace.”
- “Audit this Lambda Java function for SnapStart compatibility. Flag every line that generates state during INIT that should be unique per execution environment (random IDs, DB connections, signed tokens, etc.). Suggest CRaC
Resourceimplementations for each.” - “My function-based agent on Bedrock Agents has 12 tools, each a Java Lambda. Average tool call latency is 1.8 s and the agent loop takes ~25 s end-to-end. Profile this conceptually — where is the time going, and which optimization (Leyden AOT cache vs SnapStart vs GraalVM native vs moving tools to a single shared container) gives the best win for the least effort?”
- “Compare Microsoft Foundry’s tracing for AI agents to OpenTelemetry-only instrumentation on Cloud Run + Vertex AI. For a Java Spring AI app that needs to debug tool-call failures and track token spend per tenant, which is more useful in production, and what is the migration effort if I need to switch later?”
See Also
- Project Leyden Explained: AOT Compilation and Smart Caching to Finally Fix Java’s Cold Start
- Building Native Images of Spring Boot Applications with GraalVM
- Spring AI RAG in Java — Complete Runnable Code & End-to-End Demo
- The LLM Gateway Pattern for Java Microservices: Multi-Provider Failover, Cost Control, and Rate Limiting
- LLM Gateway for Java Microservices — Complete Runnable Code & Demo
- Spring AI + LangChain4j: Building Production-Ready AI Microservices in Java
- Virtual Threads vs Reactive (WebFlux) vs Platform Threads in Spring Boot 3.4
- Java 25 LTS: Every JEP That Matters (with AI Prompts for Each Migration)
Conclusion
After testing all three stacks against realistic bursty workloads, the headline finding is simple: the function layer cold start is no longer the bottleneck for Java in 2025–2026 — the inference backend and the token cost model are. Java 25’s default AOT cache on Lambda and sub-150 ms Quarkus native cold starts on Cloud Run have closed the gap with Python. What differentiates the three clouds now is cost shape at scale, multi-model routing ergonomics, and observability depth.
Final Recommendations
Starting fresh? Default to Cloud Run + Vertex AI with a Quarkus native image. Cheapest at low/medium QPS, simplest operational model, fastest cold start path. Exception: if your data lives on AWS, Lambda + Bedrock avoids cross-cloud egress costs and is nearly as good.
Already on AWS? Prefer Bedrock over SageMaker for serverless inference. SageMaker shines for custom models on always-on hardware. For everything else, Bedrock + Lambda Java 25 with the default AOT cache — plus SnapStart if you need sub-200 ms P99 — is the modern default.
Azure-native? Lean into Foundry’s tracing and Priority Processing. Accept Java Functions cold starts as your weakest link and design around them — event-driven paths where 2 s is invisible to users, pre-warmed instances for latency-critical paths.
Building agents? Build the model router on day one. Routing short queries to a cheaper model is where the real token cost savings appear — and you can’t do that without a router in place from the start.
FAQs
Is Java actually competitive with Python for serverless AI inference?
Yes — for the orchestration/gateway/tool-handler layer. With Java 25’s default AOT cache on Lambda, plus SnapStart or GraalVM native, cold starts are on par with Python (often faster). For the model itself, both languages call the same managed APIs (Bedrock, Foundry, Vertex) — the language is irrelevant at that layer. Where Python still wins: serving custom models with PyTorch/TF natively in the function, which isn’t a serverless-friendly workload anyway.
Should I use Project Leyden’s AOT cache or GraalVM native image?
Different tools for different budgets. Leyden: 40–60% startup improvement, no code changes, full JVM dynamism, free. GraalVM: ~10× improvement, reflection configuration required, constrains dynamic features, longer build. For a Spring Boot Lambda calling a managed AI API, Leyden (now default in Lambda Java 25) is the higher ROI path. For tight tool-handlers in agent loops, GraalVM native with Quarkus is worth the build complexity.
Can I run a small LLM directly in Cloud Run instead of calling Vertex AI?
Yes. Cloud Run GPU (L4) is GA in supported regions. You can host a quantized Gemma or similar small model with true scale-to-zero. The break-even vs Vertex Gemini Flash pricing depends on QPS — Cloud Run GPU typically wins at sustained high request rates on a single instance; below that, Vertex’s per-token pricing is usually cheaper.
Why is SageMaker often discouraged for serverless AI?
Because SageMaker Serverless Inference (a) doesn’t support GPUs, ruling out most generative models, and (b) cold-starts a 1+ GB model in 20–45 seconds — which can exceed Lambda’s handler timeout. Most teams end up on SageMaker Real-time (always-on cost ~$730+/month) or Async (with queue complexity). Bedrock fixes both: token-priced, no infrastructure, real scale-to-zero.
How do I track tokens-per-tenant for cost attribution?
Build it in your model router. Every provider’s response includes a usage object with input/output tokens. Log structured JSON: {tenant_id, model, provider, input_tokens, output_tokens, latency_ms} → ship to your data warehouse (BigQuery, Snowflake, Redshift). APMs show latency; they’re not designed for token-based billing reconciliation — your gateway is.
Is Azure Functions Java really that bad for AI inference?
“Bad” is too strong. For HTTP APIs with strict latency SLAs, it’s the weakest of the three — 1.5–4 s cold starts on Flex Consumption with no SnapStart equivalent. For event-driven workloads (Service Bus queues, Cosmos DB change feeds, scheduled jobs) where a 2-second cold start is invisible to users, Azure Functions Java + Foundry is a perfectly viable stack — and Foundry’s AI observability is genuinely best-in-class among the three.
Further Reading
- AWS Lambda now supports Java 25 (AWS Compute Blog) — official announcement with Liberty Mutual case study data (5.7s → 655ms as native image)
- Run Into the New Year with Java’s Ahead-of-Time Cache Optimizations (inside.java) — JEP-by-JEP walkthrough of the Leyden cache mechanism (JEP 483, 514, 515)
- SageMaker Serverless Inference docs — including GPU limitations and supported instance types
- AWS Lambda SnapStart documentation — including CRaC hooks and priming strategies
- Azure Foundry supported languages (Java) — current SDK options and deprecation timeline
- Vertex AI overview — platform transition into Gemini Enterprise Agent Platform
- Run AI solutions on Cloud Run — GPU availability, patterns, and pricing for Cloud Run AI workloads
- aws-samples/serverless-graalvm-demo — reference Java + GraalVM Lambda project with benchmark methodology
- Quarkus: Deploying to Google Cloud — native image build and Cloud Run deployment guide
- OpenJDK Project Leyden — the JEP roadmap for Java startup and footprint improvements