Leveraging LLM Models and Deployments in SAP AI Core
In the previous post, we set up SAP AI Core and made our first LLM call from a CAP application. Now it's time to go deeper into the world of foundation models - understanding the available options, managing deployments programmatically, implementing streaming for better user experience, and optimizing our token usage.
This post is part of a series about building AI-powered applications with SAP BTP:
- Getting Started with SAP AI Core and the SAP AI SDK in CAP
- Leveraging LLM Models and Deployments in SAP AI Core (this post)
- Orchestrating AI Workflows with SAP AI Core (coming soon)
- Document Grounding with RAG in SAP AI Core (coming soon)
- Production-Ready AI Applications with SAP AI Core (coming soon)
What are we building?
In this post, we'll enhance our Support Ticket Intelligence System by:
- Understanding and using different foundation models
- Managing deployments via the AI Core API
- Implementing streaming responses for real-time feedback
- Building a more sophisticated ticket response suggester
- Optimizing prompts and token usage for cost efficiency
By the end, you'll have a solid understanding of how to work effectively with LLMs in the SAP ecosystem.
Available Foundation Models in SAP AI Core
SAP AI Core provides access to multiple foundation models from different providers. The availability depends on your region and service plan, but typically includes:
OpenAI Models (via Azure)
| Model | Use Case | Context Window |
|---|---|---|
gpt-4o |
Best overall performance, multimodal | 128K tokens |
gpt-4o-mini |
Cost-effective for simpler tasks | 128K tokens |
gpt-4 |
Previous generation, still powerful | 8K/32K tokens |
gpt-3.5-turbo |
Fast and economical | 16K tokens |
Anthropic Models
| Model | Use Case | Context Window |
|---|---|---|
claude-3-opus |
Most capable, complex reasoning | 200K tokens |
claude-3-sonnet |
Balanced performance and cost | 200K tokens |
claude-3-haiku |
Fastest, most economical | 200K tokens |
Google Models
| Model | Use Case | Context Window |
|---|---|---|
gemini-1.5-pro |
Long context, multimodal | 1M tokens |
gemini-1.5-flash |
Fast responses | 1M tokens |
Choosing the Right Model
The model you choose depends on several factors:
-
Task complexity: For simple classification,
gpt-4o-miniorclaude-3-haikuwork well. For nuanced responses, usegpt-4oorclaude-3-sonnet. -
Context length: If you need to process long documents, Gemini or Claude models offer larger context windows.
-
Cost: Models vary significantly in price per token. For high-volume applications, optimize for cost.
-
Latency: Smaller models respond faster, which matters for real-time applications.
For our Support Ticket System, we'll use gpt-4o for response generation (quality matters) and gpt-4o-mini for classification (simpler task).
Managing Deployments Programmatically
While the AI Launchpad UI is convenient, you'll often need to manage deployments via code - especially for automation and CI/CD pipelines.
Listing Available Deployments
Let's create a utility to work with deployments. Install the AI API package if you haven't:
npm install @sap-ai-sdk/ai-apiCreate /srv/lib/deployment-manager.js:
const { DeploymentApi } = require('@sap-ai-sdk/ai-api');
class DeploymentManager {
constructor(resourceGroup = 'default') {
this.resourceGroup = resourceGroup;
}
/**
* List all deployments in the resource group
*/
async listDeployments() {
try {
const response = await DeploymentApi.deploymentQuery({
'AI-Resource-Group': this.resourceGroup
});
return response.resources.map(deployment => ({
id: deployment.id,
configurationId: deployment.configurationId,
configurationName: deployment.configurationName,
status: deployment.status,
deploymentUrl: deployment.deploymentUrl,
createdAt: deployment.createdAt
}));
} catch (error) {
console.error('Error listing deployments:', error);
throw error;
}
}
/**
* Get details of a specific deployment
*/
async getDeployment(deploymentId) {
try {
const deployment = await DeploymentApi.deploymentGet(
deploymentId,
{ 'AI-Resource-Group': this.resourceGroup }
);
return deployment;
} catch (error) {
console.error(`Error getting deployment ${deploymentId}:`, error);
throw error;
}
}
/**
* Find a running deployment for a specific model
*/
async findDeploymentForModel(modelName) {
const deployments = await this.listDeployments();
// Filter for running deployments that match the model
const matching = deployments.filter(d =>
d.status === 'RUNNING' &&
d.configurationName?.toLowerCase().includes(modelName.toLowerCase())
);
if (matching.length === 0) {
throw new Error(`No running deployment found for model: ${modelName}`);
}
return matching[0];
}
/**
* Create a new deployment from a configuration
*/
async createDeployment(configurationId) {
try {
const response = await DeploymentApi.deploymentCreate(
{ configurationId },
{ 'AI-Resource-Group': this.resourceGroup }
);
console.log(`Deployment created: ${response.id}`);
return response;
} catch (error) {
console.error('Error creating deployment:', error);
throw error;
}
}
/**
* Delete a deployment
*/
async deleteDeployment(deploymentId) {
try {
// First, set to STOPPED
await DeploymentApi.deploymentModify(
deploymentId,
{ targetStatus: 'STOPPED' },
{ 'AI-Resource-Group': this.resourceGroup }
);
// Then delete
await DeploymentApi.deploymentDelete(
deploymentId,
{ 'AI-Resource-Group': this.resourceGroup }
);
console.log(`Deployment ${deploymentId} deleted`);
} catch (error) {
console.error(`Error deleting deployment ${deploymentId}:`, error);
throw error;
}
}
}
module.exports = DeploymentManager;This utility class provides a clean interface for deployment management. The key points:
-
Resource Groups: All operations are scoped to a resource group. The default is usually
default. -
Status management: Deployments have states like
PENDING,RUNNING,STOPPED. You can only delete stopped deployments. -
Configuration vs Deployment: A configuration defines the model settings; a deployment is a running instance of that configuration.
Using the Deployment Manager
Let's add an admin endpoint to our service. Update /srv/ticket-service.cds:
using support.db as db from '../db/schema';
service TicketService @(path: '/api') {
entity Tickets as projection on db.Tickets;
action generateResponse(ticketId: UUID) returns String;
action classifyTicket(ticketId: UUID) returns String;
// Admin actions
@requires: 'admin'
function listDeployments() returns array of {
id: String;
configurationName: String;
status: String;
};
}And update the handler in /srv/ticket-service.js:
const cds = require('@sap/cds');
const DeploymentManager = require('./lib/deployment-manager');
module.exports = class TicketService extends cds.ApplicationService {
async init() {
const { Tickets } = this.entities;
const deploymentManager = new DeploymentManager();
// List deployments
this.on('listDeployments', async () => {
return await deploymentManager.listDeployments();
});
// ... rest of handlers
await super.init();
}
};Implementing Streaming Responses
For better user experience, especially with longer AI responses, streaming allows you to display the response as it's generated rather than waiting for the complete response.
Why Streaming Matters
Without streaming:
- User clicks "Generate Response"
- Waits 5-10 seconds seeing nothing
- Suddenly sees the complete response
With streaming:
- User clicks "Generate Response"
- Immediately starts seeing text appear word by word
- Can read the beginning while the rest generates
This perceived responsiveness dramatically improves UX.
Implementing Streaming in CAP
Create /srv/lib/streaming-llm.js:
/**
* LLM client with streaming support
*/
class StreamingLLMClient {
/**
* Generate a streaming response
* @param {Object} options - Generation options
* @param {Function} onChunk - Callback for each chunk received
* @returns {Promise<string>} - Complete response
*/
async generateStream(options, onChunk) {
const { AzureOpenAiChatClient } = await import('@sap-ai-sdk/foundation-models');
const client = new AzureOpenAiChatClient(options.model || 'gpt-4o');
const response = await client.run({
messages: options.messages,
max_tokens: options.maxTokens || 1000,
temperature: options.temperature || 0.7,
stream: true // Enable streaming
});
let fullContent = '';
// Process the stream
for await (const chunk of response) {
const content = chunk.getDeltaContent();
if (content) {
fullContent += content;
if (onChunk) {
onChunk(content);
}
}
}
return fullContent;
}
/**
* Generate a non-streaming response (for comparison)
*/
async generate(options) {
const { AzureOpenAiChatClient } = await import('@sap-ai-sdk/foundation-models');
const client = new AzureOpenAiChatClient(options.model || 'gpt-4o');
const response = await client.run({
messages: options.messages,
max_tokens: options.maxTokens || 1000,
temperature: options.temperature || 0.7
});
return {
content: response.getContent(),
usage: response.getTokenUsage(),
finishReason: response.getFinishReason()
};
}
}
module.exports = StreamingLLMClient;Exposing Streaming via Server-Sent Events (SSE)
CAP doesn't natively support SSE, but we can add a custom Express endpoint. Update /srv/ticket-service.js:
const cds = require('@sap/cds');
const StreamingLLMClient = require('./lib/streaming-llm');
const DeploymentManager = require('./lib/deployment-manager');
module.exports = class TicketService extends cds.ApplicationService {
async init() {
const { Tickets } = this.entities;
const llmClient = new StreamingLLMClient();
// Register custom Express middleware for streaming endpoint
this.on('bootstrap', (srv) => {
srv.app.get('/api/tickets/:id/stream-response', async (req, res) => {
const ticketId = req.params.id;
// Set SSE headers
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
try {
// Fetch ticket
const ticket = await SELECT.one.from(Tickets).where({ ID: ticketId });
if (!ticket) {
res.write(`event: error\ndata: Ticket not found\n\n`);
res.end();
return;
}
const messages = [
{
role: 'system',
content: `You are a helpful customer support assistant.
Provide professional, empathetic responses to customer tickets.`
},
{
role: 'user',
content: `Generate a response for this ticket:
Subject: ${ticket.subject}
Description: ${ticket.description}`
}
];
// Stream the response
let fullResponse = '';
await llmClient.generateStream(
{ messages, model: 'gpt-4o' },
(chunk) => {
fullResponse += chunk;
res.write(`data: ${JSON.stringify({ chunk })}\n\n`);
}
);
// Send completion event
res.write(`event: complete\ndata: ${JSON.stringify({ fullResponse })}\n\n`);
// Update ticket with response
await UPDATE(Tickets).set({ aiResponse: fullResponse }).where({ ID: ticketId });
res.end();
} catch (error) {
console.error('Streaming error:', error);
res.write(`event: error\ndata: ${error.message}\n\n`);
res.end();
}
});
});
// ... rest of handlers
await super.init();
}
};Client-Side Consumption
Here's how a frontend would consume the SSE stream:
// Frontend JavaScript example
async function streamTicketResponse(ticketId) {
const responseContainer = document.getElementById('response');
responseContainer.textContent = '';
const eventSource = new EventSource(`/api/tickets/${ticketId}/stream-response`);
eventSource.onmessage = (event) => {
const data = JSON.parse(event.data);
responseContainer.textContent += data.chunk;
};
eventSource.addEventListener('complete', (event) => {
console.log('Response complete:', JSON.parse(event.data));
eventSource.close();
});
eventSource.addEventListener('error', (event) => {
console.error('Stream error:', event.data);
eventSource.close();
});
}Building a Sophisticated Response Suggester
Now let's build a more intelligent response generator that considers context and provides structured output.
Enhanced Prompt Engineering
The quality of AI responses heavily depends on your prompts. Here's an improved version:
Create /srv/lib/prompts.js:
/**
* Prompt templates for the Support Ticket AI
*/
const PROMPTS = {
RESPONSE_SYSTEM: `You are an expert customer support agent for a software company.
Your responsibilities:
1. Provide helpful, accurate, and empathetic responses
2. Address the customer's specific concern
3. Offer clear steps or solutions when applicable
4. Maintain a professional yet friendly tone
5. Ask clarifying questions if the issue is unclear
Guidelines:
- Keep responses concise but complete
- Use bullet points for multi-step solutions
- Acknowledge the customer's frustration when appropriate
- Never make promises you can't keep
- If you don't know something, say so honestly`,
RESPONSE_USER: (ticket) => `Please draft a response for the following support ticket:
**Ticket ID:** ${ticket.ID}
**Subject:** ${ticket.subject}
**Priority:** ${ticket.priority || 'Not set'}
**Category:** ${ticket.category || 'Uncategorized'}
**Customer Message:**
${ticket.description}
---
Provide a professional response that:
1. Acknowledges their issue
2. Provides helpful information or next steps
3. Ends with an offer for further assistance`,
CLASSIFICATION_SYSTEM: `You are a ticket classification system. Analyze support tickets and provide structured classification.
Categories available:
- Technical Issue
- Billing Question
- Feature Request
- Account Access
- Bug Report
- General Inquiry
Priority levels:
- Critical: System down, data loss, security issue
- High: Major functionality broken, blocking issue
- Medium: Feature not working as expected, workaround exists
- Low: Minor issue, cosmetic, nice-to-have
Sentiment:
- Frustrated: Customer is upset, angry, or disappointed
- Neutral: Standard inquiry, no strong emotion
- Positive: Customer is satisfied, providing praise
Respond ONLY with valid JSON in this exact format:
{
"category": "string",
"priority": "string",
"sentiment": "string",
"confidence": number,
"reasoning": "string"
}`,
CLASSIFICATION_USER: (ticket) => `Classify this support ticket:
Subject: ${ticket.subject}
Description: ${ticket.description}
Provide classification as JSON.`
};
module.exports = PROMPTS;Implementing Classification
Add the classification logic to your service. Update /srv/ticket-service.js:
const cds = require('@sap/cds');
const PROMPTS = require('./lib/prompts');
const StreamingLLMClient = require('./lib/streaming-llm');
module.exports = class TicketService extends cds.ApplicationService {
async init() {
const { Tickets } = this.entities;
const llmClient = new StreamingLLMClient();
// Handler for generating AI response
this.on('generateResponse', async (req) => {
const { ticketId } = req.data;
const ticket = await SELECT.one.from(Tickets).where({ ID: ticketId });
if (!ticket) {
return req.error(404, `Ticket ${ticketId} not found`);
}
try {
const result = await llmClient.generate({
model: 'gpt-4o',
messages: [
{ role: 'system', content: PROMPTS.RESPONSE_SYSTEM },
{ role: 'user', content: PROMPTS.RESPONSE_USER(ticket) }
],
maxTokens: 800,
temperature: 0.7
});
await UPDATE(Tickets)
.set({ aiResponse: result.content })
.where({ ID: ticketId });
console.log(`Token usage: ${JSON.stringify(result.usage)}`);
return result.content;
} catch (error) {
console.error('AI generation error:', error);
return req.error(500, 'Failed to generate response');
}
});
// Handler for classifying ticket
this.on('classifyTicket', async (req) => {
const { ticketId } = req.data;
const ticket = await SELECT.one.from(Tickets).where({ ID: ticketId });
if (!ticket) {
return req.error(404, `Ticket ${ticketId} not found`);
}
try {
// Use a smaller, faster model for classification
const result = await llmClient.generate({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: PROMPTS.CLASSIFICATION_SYSTEM },
{ role: 'user', content: PROMPTS.CLASSIFICATION_USER(ticket) }
],
maxTokens: 200,
temperature: 0.3 // Lower temperature for consistent classification
});
// Parse the JSON response
const classification = JSON.parse(result.content);
// Update ticket with classification
await UPDATE(Tickets)
.set({
category: classification.category,
priority: classification.priority,
sentiment: classification.sentiment
})
.where({ ID: ticketId });
return JSON.stringify(classification);
} catch (error) {
console.error('Classification error:', error);
return req.error(500, 'Failed to classify ticket');
}
});
// Auto-classify new tickets
this.after('CREATE', 'Tickets', async (ticket) => {
// Fire and forget - classify in background
setImmediate(async () => {
try {
const result = await llmClient.generate({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: PROMPTS.CLASSIFICATION_SYSTEM },
{ role: 'user', content: PROMPTS.CLASSIFICATION_USER(ticket) }
],
maxTokens: 200,
temperature: 0.3
});
const classification = JSON.parse(result.content);
await UPDATE(Tickets)
.set({
category: classification.category,
priority: classification.priority,
sentiment: classification.sentiment
})
.where({ ID: ticket.ID });
console.log(`Auto-classified ticket ${ticket.ID}:`, classification);
} catch (error) {
console.error(`Auto-classification failed for ${ticket.ID}:`, error);
}
});
});
await super.init();
}
};Understanding Temperature and Token Settings
Two critical parameters affect LLM output:
Temperature (0.0 - 2.0):
0.0-0.3: Deterministic, consistent output. Use for classification, data extraction.0.5-0.8: Balanced creativity and consistency. Use for general responses.0.9-1.2: Creative, varied output. Use for brainstorming, creative writing.>1.2: Highly random, often incoherent. Rarely useful.
Max Tokens:
- Controls the maximum length of the response
- 1 token ≈ 4 characters in English
- Set based on expected output length
- Too low = truncated responses
- Too high = wasted quota (you're charged for capacity, not just usage)
Token Optimization Strategies
AI services charge per token, so optimization matters for production applications.
1. Choose the Right Model
// For simple tasks, use smaller models
const classificationResult = await llmClient.generate({
model: 'gpt-4o-mini', // 10x cheaper than gpt-4o
// ...
});
// For complex tasks, use powerful models
const responseResult = await llmClient.generate({
model: 'gpt-4o', // Better quality
// ...
});2. Optimize Prompts
// ❌ Verbose prompt (more tokens)
const badPrompt = `
I would like you to please analyze the following customer support ticket
and then provide me with a detailed and comprehensive response that the
support agent could potentially use to reply to this customer. The response
should be professional and helpful and address all of the customer's concerns...
`;
// ✅ Concise prompt (fewer tokens)
const goodPrompt = `
Analyze this support ticket and draft a professional response:
${ticket.description}
`;3. Implement Caching
Create /srv/lib/response-cache.js:
const crypto = require('crypto');
/**
* Simple in-memory cache for AI responses
* In production, use Redis or similar
*/
class ResponseCache {
constructor(ttlMs = 3600000) { // 1 hour default TTL
this.cache = new Map();
this.ttlMs = ttlMs;
}
_generateKey(messages) {
const content = JSON.stringify(messages);
return crypto.createHash('md5').update(content).digest('hex');
}
get(messages) {
const key = this._generateKey(messages);
const entry = this.cache.get(key);
if (!entry) return null;
if (Date.now() > entry.expiresAt) {
this.cache.delete(key);
return null;
}
console.log('Cache hit for:', key);
return entry.value;
}
set(messages, value) {
const key = this._generateKey(messages);
this.cache.set(key, {
value,
expiresAt: Date.now() + this.ttlMs
});
}
clear() {
this.cache.clear();
}
}
module.exports = new ResponseCache();4. Monitor Usage
Add usage tracking to your service:
// Track token usage
let totalTokensUsed = 0;
this.on('generateResponse', async (req) => {
// ... generation logic
const result = await llmClient.generate(options);
// Track usage
const usage = result.usage;
totalTokensUsed += usage.total_tokens;
console.log(`Request tokens: ${usage.total_tokens}`);
console.log(`Session total: ${totalTokensUsed}`);
// In production, store this in a database for billing/monitoring
});Testing Our Enhanced Service
Update /test/requests.http:
### Create a new support ticket
POST http://localhost:4004/api/Tickets
Content-Type: application/json
{
"subject": "Payment failed but money was deducted",
"description": "I tried to purchase the premium plan yesterday and the payment failed with an error. However, I can see that $99 was deducted from my bank account. I need this resolved urgently as I'm being charged for something I don't have access to. This is very frustrating!"
}
### Get all tickets (check auto-classification)
GET http://localhost:4004/api/Tickets
### Manually classify a ticket
POST http://localhost:4004/api/classifyTicket
Content-Type: application/json
{
"ticketId": "YOUR-TICKET-ID"
}
### Generate AI response
POST http://localhost:4004/api/generateResponse
Content-Type: application/json
{
"ticketId": "YOUR-TICKET-ID"
}
### Test streaming (in browser or with curl)
# curl -N http://localhost:4004/api/tickets/YOUR-TICKET-ID/stream-responseRecap
In this post, we've significantly expanded our AI capabilities:
- Explored foundation models: Understood the different models available and when to use each
- Managed deployments programmatically: Built utilities to list, create, and manage AI Core deployments
- Implemented streaming: Added real-time response streaming for better UX
- Enhanced our prompts: Created structured, effective prompts for different tasks
- Added classification: Auto-classify tickets by category, priority, and sentiment
- Optimized for cost: Learned strategies to minimize token usage
Our Support Ticket System now automatically classifies incoming tickets and can generate helpful responses on demand.
Next Steps
In the next post, Orchestrating AI Workflows with SAP AI Core, we'll learn how to:
- Use the orchestration service to chain multiple AI operations
- Build complex workflows (classify → analyze → respond)
- Implement content filtering and guardrails
- Add templating for consistent prompt management
Stay tuned!
