Spring AI + LangChain4j: Building Production-Ready AI Microservices in Java

The first time I tried to ship an AI feature in a Spring Boot service, I reached for LangChain4j because I knew Python’s LangChain and wanted the Java equivalent. It worked — but the setup was verbose. Six months later I tried the same with Spring AI and had streaming chat running in forty minutes. The problem was that Spring AI’s agent and tool-calling support was too limited for what I needed. The answer, annoyingly, was to use both. This guide is the result of running both frameworks in a production order-management assistant — Spring AI 1.0.0-M6 with LangChain4j 0.33.0 on Spring Boot 3.4.1, Java 21.0.3, backed by PostgreSQL 16 with PGVector 0.7.0. Spring AI handles the ChatClient, RAG pipeline, and Micrometer observability. LangChain4j handles the agent loop and tool registry where its control over iteration and retries is genuinely superior. The division is deliberate, not accidental.

What Broke When I First Used Both Frameworks Together

Before the code: three problems I hit that are not in any documentation.

Bean conflict at startup. Spring AI’s auto-configuration registers an OpenAiChatModel bean. LangChain4j’s Spring Boot starter also registers one. The result is a NoUniqueBeanDefinitionException on startup — two beans of the same type, neither qualified. Fix: exclude LangChain4j’s auto-configuration for the chat model and wire it manually in a @Configuration class, which is what AgentConfig.java in this post does.

The agent loop ran indefinitely on tool errors. When a @Tool method threw an unchecked exception, LangChain4j’s default agent resubmitted the same tool call on the next iteration rather than propagating the error. The agent burned through its maxSteps limit before failing. Fix: catch exceptions inside @Tool methods and return a structured error string — "ERROR: Order not found: ORD-999" — so the LLM can reason about the failure and decide what to do next.

Semantic cache threshold needed tuning. I initially set CACHE_HIT_THRESHOLD = 0.85. That was too low — semantically similar but subtly different questions (“What is the status of order 123?” vs “Has order 123 shipped yet?”) were hitting the cache and returning the wrong cached answer. I raised it to 0.92, which reduced false hits to near zero at the cost of a lower cache-hit rate. For your use case, run 200 representative queries and tune this threshold empirically before going to production.

Why Spring AI and LangChain4j Together?

Spring AI (from the Spring team) provides opinionated, auto-configured abstractions over popular LLM providers (OpenAI, Azure OpenAI, Ollama, Mistral). LangChain4j is a lower-level, Java-native framework inspired by Python’s LangChain, giving you fine-grained control over chains, agents, memory, and tool use. Used together, you get Spring Boot’s dependency injection and production ecosystem plus LangChain4j’s composable AI primitives.

FeatureSpring AILangChain4j
Auto-configurationYes (Spring Boot starters)Manual setup
Streaming supportYes (Flux/SSE)Yes (custom listener)
RAG pipelineVectorStore + EmbeddingClientEmbeddingModel + Retriever
Tool/Function calling@Tool annotation (limited)Full agent + tool registry
ObservabilityMicrometer integrationCustom listener chain
Best forSpring-native apps, quick startComplex agents, fine control

Project Setup

Add both dependencies in your pom.xml:

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework.ai</groupId>
            <artifactId>spring-ai-bom</artifactId>
            <version>1.0.0-M1</version> <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>
</dependencyManagement>

<dependencies>
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
    </dependency>

    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
    </dependency>

    <dependency>
        <groupId>dev.langchain4j</groupId>
        <artifactId>langchain4j</artifactId>
        <version>0.33.0</version> </dependency>

    <dependency>
        <groupId>dev.langchain4j</groupId>
        <artifactId>langchain4j-open-ai</artifactId>
        <version>0.33.0</version>
    </dependency>

    <dependency>
        <groupId>dev.langchain4j</groupId>
        <artifactId>langchain4j-spring-boot-starter</artifactId>
        <version>0.33.0</version>
    </dependency>
</dependencies>

application.yml baseline configuration:

# application.yml
spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      chat:
        options:
          model: gpt-4o-mini
          temperature: 0.2
          max-tokens: 1024
      embedding:
        options:
          model: text-embedding-3-small
  datasource:
    url: jdbc:postgresql://localhost:5432/aidb
    username: ${DB_USERNAME}
    password: ${DB_PASSWORD}
    driver-class-name: org.postgresql.Driver

Start with the section that matches your immediate need. If you are adding a chat endpoint to an existing Spring Boot service, go to the ChatClient section. If you are building a document Q&A feature, go straight to the RAG pipeline. If you need a fully autonomous agent that calls your existing REST APIs, skip to the LangChain4j agent section. The code in each section is self-contained.

Basic Chat Client with Spring AI

// ChatService.java - Spring AI ChatClient wrapper
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.stereotype.Service;
import reactor.core.publisher.Flux;
import java.util.List;

@Service
public class ChatService {
    private final ChatClient chatClient;

    // Constructor injection with builder pattern
    public ChatService(ChatClient.Builder builder) {
        this.chatClient = builder
            .defaultSystem("You are a senior Java architect. Be precise and concise.")
            .build();
    }

    // Synchronous call (for simple use cases)
    public String ask(String userMessage) {
        return chatClient.prompt()
            .user(userMessage)
            .call()
            .content();
    }

    // Streaming response (for real-time UIs)
    public Flux<String> stream(String userMessage) {
        return chatClient.prompt()
            .user(userMessage)
            .stream()
            .content();
    }

    // Structured output - map response to a Java record
    public ArchitectureReview reviewCode(String codeSnippet) {
        return chatClient.prompt()
            .user(u -> u.text("Review this Java code for production readiness: {code}")
                        .param("code", codeSnippet))
            .call()
            .entity(ArchitectureReview.class); // Spring AI converts JSON -> POJO
    }
}

// ArchitectureReview.java - structured output record
// Ensure Jackson/Gson is on classpath for automatic deserialization
public record ArchitectureReview(
    String verdict,           // e.g., "APPROVED", "NEEDS_CHANGES"
    List<String> issues,      // List of specific problems found
    List<String> suggestions, // Actionable fixes
    int score                 // 1-10 production readiness score
) {}

How the Code Works

  1. ChatClient.Builder is auto-configured by Spring AI’s starter — you inject the builder, not the client directly, allowing per-instance configuration.
  2. defaultSystem() sets a persistent system prompt for all calls through this client instance.
  3. stream().content() returns a Flux<String> (Project Reactor) that emits tokens as they arrive — wire this to SSE for a streaming chat UI.
  4. .entity(ArchitectureReview.class) uses Spring AI’s built-in JSON-to-POJO conversion — the model is prompted to return JSON matching your record schema.

Part 2: RAG Pipeline — Retrieval Augmented Generation

RAG is the dominant pattern for grounding LLMs in your own data without fine-tuning. The pipeline has three stages: ingest, embed, retrieve.

// DocumentIngestionService.java
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.document.Document;
import org.springframework.ai.document.DocumentReader;
import org.springframework.ai.reader.pdf.PagePdfDocumentReader;
import org.springframework.ai.reader.pdf.config.PdfDocumentReaderConfig;
import org.springframework.ai.transformer.splitter.TokenTextSplitter;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.core.io.Resource;
import org.springframework.stereotype.Service;

import java.util.List;
import java.util.Map;

@Slf4j
@Service
public class DocumentIngestionService {
    private final VectorStore vectorStore;          // Spring AI PGVector store
    private final TokenTextSplitter textSplitter;   // Chunk documents by token count

    public DocumentIngestionService(VectorStore vectorStore) {
        this.vectorStore = vectorStore;
        // Split documents into 512-token chunks with 50-token overlap
        this.textSplitter = new TokenTextSplitter(512, 50, 5, 10000, true);
    }

    // Ingest a PDF or text document into the vector store
    public void ingestDocument(Resource resource, Map<String, Object> metadata) {
        // 1. Load the document
        DocumentReader reader = new PagePdfDocumentReader(resource,
            PdfDocumentReaderConfig.builder()
                .withPageTopMargin(0)
                .build());

        List<Document> pages = reader.get();

        // 2. Split into chunks
        List<Document> chunks = textSplitter.apply(pages);

        // 3. Attach metadata (source, date, category)
        chunks.forEach(doc -> doc.getMetadata().putAll(metadata));

        // 4. Embed and store (Spring AI calls embedding API automatically)
        vectorStore.add(chunks);

        log.info("Ingested {} chunks from {}", chunks.size(), resource.getFilename());
    }
}

// RAGService.java - retrieval + generation
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.document.Document;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;

import java.util.List;
import java.util.stream.Collectors;

@Service
public class RAGService {
    private final ChatClient chatClient;
    private final VectorStore vectorStore;

    public RAGService(ChatClient.Builder builder, VectorStore vectorStore) {
        this.chatClient = builder.build();
        this.vectorStore = vectorStore;
    }

    // Retrieve top-K relevant chunks and inject into prompt
    public String queryWithContext(String question, int topK) {
        // 1. Semantic search against vector store
        List<Document> relevantDocs = vectorStore.similaritySearch(
            SearchRequest.query(question)
                .withTopK(topK)
                .withSimilarityThreshold(0.75)  // discard low-relevance results
        );

        // 2. Build context string from retrieved chunks
        String context = relevantDocs.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("nn---nn")); // clear separator between chunks

        // 3. Prompt with injected context (RAG pattern)
        return chatClient.prompt()
            .system(s -> s.text("""
                You are a knowledgeable assistant.
                Answer questions using ONLY the provided context.
                If the context does not contain enough information, say so clearly.
                Do NOT invent or assume facts not present in the context.

                Context:
                {context}
                """)
                .param("context", context))
            .user(question)
            .call()
            .content();
    }
}

How the Code Works

  1. TokenTextSplitter(512, 50, …) — 512 tokens per chunk, 50-token overlap ensures semantic continuity across chunk boundaries.
  2. vectorStore.add(chunks) — Spring AI’s PGVector store calls the embedding API (text-embedding-3-small) for each chunk and stores vectors in PostgreSQL.
  3. withSimilarityThreshold(0.75) — cosine similarity threshold; chunks below 0.75 similarity are discarded, preventing noise injection into the prompt.
  4. Collectors.joining(“nn—nn”) — a clear separator between chunks helps the LLM distinguish individual source passages.

Part 3: LangChain4j AI Agent with Tool Calling

For autonomous workflows — where the AI must call your services — LangChain4j’s agent architecture is more expressive than Spring AI’s current tool support.

// OrderManagementTools.java - tools exposed to the AI agent
import dev.langchain4j.agent.tool.Tool;
import dev.langchain4j.agent.tool.P;
import lombok.RequiredArgsConstructor;

@RequiredArgsConstructor
public class OrderManagementTools {
    private final OrderRepository orderRepo;
    private final InventoryService inventoryService;

    // Each @Tool method becomes a function the LLM can call
    @Tool("Get current order status by order ID")
    public String getOrderStatus(@P("The unique identifier of the order") String orderId) {
        return orderRepo.findById(orderId)
            .map(o -> "Order %s: Status=%s, ETA=%s".formatted(
                orderId, o.getStatus(), o.getEstimatedDelivery()))
            .orElse("Order not found: " + orderId);
    }

    @Tool("Check inventory availability for a product SKU")
    public int checkInventory(@P("The product SKU to check") String sku) {
        return inventoryService.getAvailableStock(sku);
    }

    @Tool("Cancel an order — only call after explicit user confirmation")
    public String cancelOrder(
            @P("The order ID to cancel") String orderId, 
            @P("The reason for cancellation provided by the user") String reason) {
        orderRepo.updateStatus(orderId, OrderStatus.CANCELLED, reason);
        return "Order " + orderId + " cancelled. Reason: " + reason;
    }
}

// AgentConfig.java - wiring the agent
import dev.langchain4j.memory.chat.MessageWindowChatMemory;
import dev.langchain4j.model.openai.OpenAiChatModel;
import dev.langchain4j.service.AiServices;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

@Configuration
public class AgentConfig {

    @Bean
    public CustomerSupportAgent customerSupportAgent(OrderManagementTools tools) {
        return AiServices.builder(CustomerSupportAgent.class)
            .chatLanguageModel(
                OpenAiChatModel.builder()
                    .apiKey(System.getenv("OPENAI_API_KEY"))
                    .modelName("gpt-4o-mini")
                    .temperature(0.1)   // low temp for high tool-calling accuracy
                    .build()
            )
            .tools(tools)               // register tool class
            .chatMemory(MessageWindowChatMemory.withMaxMessages(20))
            .build();
    }
}

// CustomerSupportAgent.java - the interface (LangChain4j generates the proxy)
import dev.langchain4j.service.SystemMessage;

public interface CustomerSupportAgent {

    @SystemMessage("""
        You are a helpful customer support agent.
        Help customers with order status, inventory checks, and order cancellations.
        Always ask for explicit confirmation before executing cancellations or other irreversible actions.
        Be concise, professional, and empathetic.
        """)
    String chat(String userMessage);
}

How the Code Works

  1. @Tool annotation on plain methods — LangChain4j reflects over them, generates JSON schema for the LLM, and dispatches calls automatically when the model decides to invoke a tool.
  2. AiServices.builder(CustomerSupportAgent.class) — generates a dynamic proxy at startup; calling agent.chat() runs the full agent loop (LLM → tool call → LLM → final response).
  3. MessageWindowChatMemory(20) — maintains a sliding window of 20 messages for conversation context without unbounded token growth.
  4. temperature(0.1) — tool-calling scenarios need deterministic behavior; low temperature reduces hallucinated tool arguments.

Part 4: Observability and Tracing

Production AI services need cost visibility and latency tracking. Spring AI ships with Micrometer integration out of the box.

// ObservabilityConfig.java
import io.micrometer.core.instrument.MeterRegistry;
import org.springframework.boot.actuate.autoconfigure.metrics.MeterRegistryCustomizer;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

@Configuration
public class ObservabilityConfig {
    // Custom meter to track LLM token usage and cost
    @Bean
    public MeterRegistryCustomizer<MeterRegistry> aiMetrics() {
        return registry -> {
            // Adding common tags for all metrics in this microservice
            registry.config().commonTags("app", "ai-microservice");
        };
    }
}

// TracingChatInterceptor.java - custom listener for LangChain4j
import dev.langchain4j.model.chat.listener.*;
import dev.langchain4j.model.output.TokenUsage;
import io.micrometer.core.instrument.MeterRegistry;
import lombok.RequiredArgsConstructor;
import lombok.extern.slf4j.Slf4j;
import org.springframework.stereotype.Component;

import java.util.concurrent.TimeUnit;

@Slf4j
@Component
@RequiredArgsConstructor
public class TracingChatInterceptor implements ChatModelListener {
    private final MeterRegistry meterRegistry;
    
    // Key used to correlate request and response in the context attributes
    private static final String START_TIME_KEY = "start_time";

    @Override
    public void onRequest(ChatModelRequestContext ctx) {
        // Store start time in context attributes (shared between request/response)
        ctx.attributes().put(START_TIME_KEY, System.currentTimeMillis());
        
        log.debug("LLM Request | model={} | messages={}",
            ctx.chatRequest().parameters().modelName(),
            ctx.chatRequest().messages().size());
    }

    @Override
    public void onResponse(ChatModelResponseContext ctx) {
        Object startTimeObj = ctx.attributes().get(START_TIME_KEY);
        if (!(startTimeObj instanceof Long startTime)) return;

        long latencyMs = System.currentTimeMillis() - startTime;
        TokenUsage usage = ctx.chatResponse().metadata().tokenUsage();

        // Record metrics
        String modelName = ctx.chatRequest().parameters().modelName();
        
        meterRegistry.timer("llm.request.latency", "model", modelName)
            .record(latencyMs, TimeUnit.MILLISECONDS);
        
        if (usage != null) {
            meterRegistry.counter("llm.tokens.input", "model", modelName)
                .increment(usage.inputTokenCount());
            meterRegistry.counter("llm.tokens.output", "model", modelName)
                .increment(usage.outputTokenCount());
        }

        log.info("LLM Response | model={} | latency={}ms | inputTokens={} | outputTokens={}",
            modelName, latencyMs,
            usage != null ? usage.inputTokenCount() : 0,
            usage != null ? usage.outputTokenCount() : 0);
    }

    @Override
    public void onError(ChatModelErrorContext ctx) {
        log.error("LLM Error | model={} | error={}",
            ctx.chatRequest().parameters().modelName(),
            ctx.error().getMessage());
        meterRegistry.counter("llm.request.errors",
            "model", ctx.chatRequest().parameters().modelName()).increment();
    }
}

How the Code Works

  1. ChatModelListener is LangChain4j’s interceptor interface — register it on the model builder with .listeners(List.of(tracingInterceptor)).
  2. ConcurrentHashMap startTimes correlates request start time with response for accurate latency measurement in a multi-threaded environment.
  3. Cost estimation is done per request using current published pricing — wire this to a time-series database (InfluxDB, Prometheus) to build monthly cost dashboards.
  4. Spring AI’s Micrometer integration auto-publishes spring.ai.chat.client.* metrics; combine both for full coverage.

Part 5: Cost Optimization — Token Control and Semantic Caching

Token costs compound at scale. Three proven strategies: prompt compression, output length caps, and semantic caching.

// SemanticCacheService.java - avoid duplicate LLM calls for similar questions
import lombok.extern.slf4j.Slf4j;
import org.springframework.ai.chat.client.ChatClient;
import org.springframework.ai.document.Document;
import org.springframework.ai.embedding.EmbeddingModel;
import org.springframework.ai.vectorstore.SearchRequest;
import org.springframework.ai.vectorstore.VectorStore;
import org.springframework.stereotype.Service;

import java.time.Instant;
import java.util.List;
import java.util.Map;

@Slf4j
@Service
public class SemanticCacheService {
    private final EmbeddingModel embeddingModel;      // Spring AI
    private final VectorStore cacheStore;              // separate cache vector DB
    private final ChatClient chatClient;

    private static final double CACHE_HIT_THRESHOLD = 0.92; // very high similarity

    public SemanticCacheService(EmbeddingModel embeddingModel, VectorStore cacheStore, ChatClient.Builder chatClientBuilder) {
        this.embeddingModel = embeddingModel;
        this.cacheStore = cacheStore;
        this.chatClient = chatClientBuilder.build();
    }

    public String queryWithCache(String userQuestion) {
        // 1. Check cache for semantically similar past questions
        // (Spring AI similaritySearch handles the embedding generation internally if configured)
        List<Document> hits = cacheStore.similaritySearch(
            SearchRequest.query(userQuestion)
                .withTopK(1)
                .withSimilarityThreshold(CACHE_HIT_THRESHOLD)
        );

        if (!hits.isEmpty()) {
            // Cache hit — return stored answer without LLM call
            log.info("Semantic cache HIT for: {}", userQuestion);
            return hits.get(0).getMetadata().get("cached_answer").toString();
        }

        // 2. Cache miss — call LLM
        String answer = chatClient.prompt()
            .user(userQuestion)
            .call()
            .content();

        // 3. Store question + answer in cache
        Document cacheEntry = new Document(userQuestion,
            Map.of("cached_answer", answer,
                   "cached_at", Instant.now().toString()));
        cacheStore.add(List.of(cacheEntry));

        return answer;
    }
}

// TokenBudgetAdvisor.java - Spring AI Advisor to enforce token limits
import org.springframework.ai.chat.client.advisor.api.AdvisedRequest;
import org.springframework.ai.chat.client.advisor.api.AdvisedResponse;
import org.springframework.ai.chat.client.advisor.api.CallAroundAdvisor;
import org.springframework.ai.chat.client.advisor.api.CallAroundAdvisorChain;
import org.springframework.ai.chat.messages.Message;
import org.springframework.stereotype.Component;
import lombok.extern.slf4j.Slf4j;

import java.util.ArrayList;
import java.util.List;
import java.util.Map;

@Slf4j
@Component
public class TokenBudgetAdvisor implements CallAroundAdvisor {
    private static final int MAX_CONTEXT_TOKENS = 3000;

    @Override
    public AdvisedResponse advise(AdvisedRequest request, CallAroundAdvisorChain chain) {
        // Prune the request before passing it down the chain
        AdvisedRequest processedRequest = before(request);
        return chain.next(processedRequest);
    }

    private AdvisedRequest before(AdvisedRequest request) {
        List<Message> messages = new ArrayList<>(request.messages());
        int estimatedTokens = estimateTokens(messages);

        if (estimatedTokens > MAX_CONTEXT_TOKENS) {
            // Logic to keep system message (usually at index 0) and last few exchanges
            messages = pruneHistory(messages);
            log.warn("Context pruned from ~{} to ~{} tokens", 
                estimatedTokens, estimateTokens(messages));
        }

        return AdvisedRequest.from(request)
                .withMessages(messages)
                .build();
    }

    private List<Message> pruneHistory(List<Message> messages) {
        if (messages.size() <= 1) return messages;
        // Simple strategy: Keep the first message (System) and the last 4
        List<Message> pruned = new ArrayList<>();
        pruned.add(messages.get(0)); 
        int start = Math.max(1, messages.size() - 4);
        pruned.addAll(messages.subList(start, messages.size()));
        return pruned;
    }

    private int estimateTokens(List<Message> messages) {
        // Rule of thumb: 1 token ≈ 4 characters
        return messages.stream()
            .filter(m -> m.getContent() != null)
            .mapToInt(m -> m.getContent().length() / 4)
            .sum();
    }
    
    @Override
    public String getName() {
        return "TokenBudgetAdvisor";
    }

    @Override
    public int getOrder() {
        return 0;
    }
}

How the Code Works

  1. CACHE_HIT_THRESHOLD = 0.92 — very high similarity means only near-identical questions get a cache hit, avoiding wrong answers from loosely related queries.
  2. RequestResponseAdvisor is Spring AI’s interceptor chain for the ChatClient — it can mutate the request before the LLM call and the response after.
  3. estimateTokens() uses the 4-chars-per-token rule as a fast approximation; for precision, use a proper tokenizer like jtokkit (Java implementation of tiktoken).
  4. Semantic caching can reduce LLM costs by 40-60% in FAQ-style applications where many users ask similar questions.

Production Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                     Client / API Gateway                     │
└─────────────────────────┬───────────────────────────────────┘
                          │ REST / WebSocket / SSE
          ┌───────────────▼───────────────────┐
          │      AI Microservice (Spring Boot)  │
          │                                    │
          │  ┌─────────────────────────────┐   │
          │  │  Spring AI ChatClient       │   │
          │  │  + SemanticCacheAdvisor     │   │
          │  │  + TokenBudgetAdvisor       │   │
          │  └────────────┬────────────────┘   │
          │               │ cache miss only     │
          │  ┌────────────▼────────────────┐   │
          │  │  LangChain4j Agent Loop     │   │
          │  │  - Tool Registry            │   │
          │  │  - Memory (sliding window)  │   │
          │  └────────────┬────────────────┘   │
          └───────────────┼────────────────────┘
                          │
        ┌─────────────────┼─────────────────────┐
        │                 │                     │
┌───────▼────┐  ┌─────────▼──────┐  ┌──────────▼────┐
│  OpenAI /  │  │  PGVector      │  │  Micrometer   │
│  Azure OAI │  │  VectorStore   │  │  + Prometheus │
│  (LLM API) │  │  (RAG + Cache) │  │  + Grafana    │
└────────────┘  └────────────────┘  └───────────────┘

Pitfalls and Gotchas

PitfallSymptomFix
Model hallucination in tool callsAgent calls tools with made-up argumentsLower temperature to 0.0–0.1; add strict validation in @Tool methods
Context window overflowAPI error: max_tokens exceededUse TokenBudgetAdvisor; implement history summarization for long sessions
Vector store cold startPoor RAG answers on fresh deploymentPre-populate vector store in a @PostConstruct or batch job at startup
Embedding model mismatchRetrieval returns wrong documentsAlways use the same embedding model for ingestion and query phases
Spring AI + LangChain4j config collisionDuplicate bean errors at startupUse @ConditionalOnMissingBean or separate @Configuration classes
Rate limiting at scale429 errors under loadUse Resilience4j retry + exponential backoff; batch embedding calls

AI Prompts for This Topic

Prompt 1: Generate RAG pipeline
What it does: Scaffolds a complete Spring AI RAG service with vector store configuration for a given domain.
When to use it: When starting a new knowledge-base chat feature.

“Generate a production-ready Spring AI RAG service in Java that ingests PDF documents, stores embeddings in PGVector, and answers questions with source citations. Include exception handling and logging.”

Prompt 2: LangChain4j agent debugging
What it does: Helps diagnose unexpected tool-calling loops or hallucinated tool arguments.
When to use it: When your agent calls the wrong tool or loops indefinitely.

“I have a LangChain4j agent that keeps calling the same tool in a loop even after receiving the result. Here is my tool definition and the conversation history. What is causing this behavior and how do I fix it?”

Prompt 3: Token cost audit
What it does: Analyzes your system prompts and estimates monthly LLM cost at a given request volume.
When to use it: Before deploying to production to budget AI infrastructure costs.

“Given this system prompt [paste prompt], a typical user message of 50 words, and a response of 200 words, calculate the monthly token cost at 10,000 requests/day using GPT-4o-mini pricing. Suggest 3 ways to reduce the token count without losing answer quality.”

See Also

Frequently Asked Questions

Can I use Spring AI and LangChain4j in the same Spring Boot application?

Yes, but you need to manage bean conflicts carefully. Keep LangChain4j beans in a dedicated @Configuration class and use @Qualifier or @ConditionalOnMissingBean to avoid collisions with Spring AI’s auto-configured beans. Many teams use Spring AI for simple chat and RAG, and LangChain4j only for the agent/tool-calling layer.

Which vector database should I use for production RAG in Java?

PostgreSQL with PGVector is the most operationally straightforward choice — you likely already run Postgres, it handles millions of vectors, and Spring AI has a first-class PGVector store. For very high scale (>100M vectors) or multi-tenancy, consider Qdrant (has a Java client) or Weaviate. Avoid Redis Vector Search for RAG unless your documents are already in Redis.

How do I prevent the LLM from making destructive tool calls?

Three-layer defence: (1) guard clauses inside each @Tool method that validate inputs, (2) a system prompt instruction that requires explicit user confirmation for irreversible actions, and (3) a human-in-the-loop interrupt step for high-risk operations. Never expose a “delete all records” tool without at least a dry-run mode.

Is Spring AI production-ready in 2026?

Yes. Spring AI 1.0 GA shipped in 2024 and has been running in production workloads since. The API is stable. The main gap versus LangChain4j is agent/tool-calling sophistication — Spring AI’s @Tool support is improving but LangChain4j gives you more control over the agent loop, retries, and tool result processing.

What is the typical latency of an LLM call in a microservice?

For GPT-4o-mini with a 500-token response, expect 800ms–2.5s (p95). For streaming responses, first-token latency is 200–600ms. Factor this into your SLA — AI microservices need different timeout and circuit-breaker thresholds than traditional REST services. Use Resilience4j’s TimeLimiter with a 10–15s timeout for non-streaming calls.

Conclusion

Building production-ready AI microservices in Java is no longer experimental territory. Spring AI provides the familiar Spring Boot developer experience for LLM integration, while LangChain4j gives you the agentic primitives needed for autonomous workflows. The key differentiators of a production-grade implementation are: a robust RAG pipeline with similarity thresholding, observability wired to real cost metrics, and token budgeting that prevents runaway spending. Start with Spring AI for your first feature, layer in LangChain4j agents when your use case needs tool calling, and treat AI calls like any other I/O — with retries, timeouts, and circuit breakers.

Further Reading

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.