Leveraging LLM Models and Deployments in SAP AI Core

In the previous post, we set up SAP AI Core and made our first LLM call from a CAP application. Now it's time to go deeper into the world of foundation models - understanding the available options, managing deployments programmatically, implementing streaming for better user experience, and optimizing our token usage.

This post is part of a series about building AI-powered applications with SAP BTP:

Getting Started with SAP AI Core and the SAP AI SDK in CAP
Leveraging LLM Models and Deployments in SAP AI Core (this post)
Orchestrating AI Workflows with SAP AI Core (coming soon)
Document Grounding with RAG in SAP AI Core (coming soon)
Production-Ready AI Applications with SAP AI Core (coming soon)

What are we building?

In this post, we'll enhance our Support Ticket Intelligence System by:

Understanding and using different foundation models
Managing deployments via the AI Core API
Implementing streaming responses for real-time feedback
Building a more sophisticated ticket response suggester
Optimizing prompts and token usage for cost efficiency

By the end, you'll have a solid understanding of how to work effectively with LLMs in the SAP ecosystem.

Available Foundation Models in SAP AI Core

SAP AI Core provides access to multiple foundation models from different providers. The availability depends on your region and service plan, but typically includes:

OpenAI Models (via Azure)

Model	Use Case	Context Window
`gpt-4o`	Best overall performance, multimodal	128K tokens
`gpt-4o-mini`	Cost-effective for simpler tasks	128K tokens
`gpt-4`	Previous generation, still powerful	8K/32K tokens
`gpt-3.5-turbo`	Fast and economical	16K tokens

Anthropic Models

Model	Use Case	Context Window
`claude-3-opus`	Most capable, complex reasoning	200K tokens
`claude-3-sonnet`	Balanced performance and cost	200K tokens
`claude-3-haiku`	Fastest, most economical	200K tokens

Google Models

Model	Use Case	Context Window
`gemini-1.5-pro`	Long context, multimodal	1M tokens
`gemini-1.5-flash`	Fast responses	1M tokens

Choosing the Right Model

The model you choose depends on several factors:

Task complexity: For simple classification, gpt-4o-mini or claude-3-haiku work well. For nuanced responses, use gpt-4o or claude-3-sonnet.
Context length: If you need to process long documents, Gemini or Claude models offer larger context windows.
Cost: Models vary significantly in price per token. For high-volume applications, optimize for cost.
Latency: Smaller models respond faster, which matters for real-time applications.

For our Support Ticket System, we'll use gpt-4o for response generation (quality matters) and gpt-4o-mini for classification (simpler task).

Managing Deployments Programmatically

While the AI Launchpad UI is convenient, you'll often need to manage deployments via code - especially for automation and CI/CD pipelines.

Listing Available Deployments

Let's create a utility to work with deployments. Install the AI API package if you haven't:

Copy

npm install @sap-ai-sdk/ai-api

Create /srv/lib/deployment-manager.js:

Copy

const { DeploymentApi } = require('@sap-ai-sdk/ai-api');

class DeploymentManager {
  constructor(resourceGroup = 'default') {
    this.resourceGroup = resourceGroup;
  }

  /**
   * List all deployments in the resource group
   */
  async listDeployments() {
    try {
      const response = await DeploymentApi.deploymentQuery({
        'AI-Resource-Group': this.resourceGroup
      });
      
      return response.resources.map(deployment => ({
        id: deployment.id,
        configurationId: deployment.configurationId,
        configurationName: deployment.configurationName,
        status: deployment.status,
        deploymentUrl: deployment.deploymentUrl,
        createdAt: deployment.createdAt
      }));
    } catch (error) {
      console.error('Error listing deployments:', error);
      throw error;
    }
  }

  /**
   * Get details of a specific deployment
   */
  async getDeployment(deploymentId) {
    try {
      const deployment = await DeploymentApi.deploymentGet(
        deploymentId,
        { 'AI-Resource-Group': this.resourceGroup }
      );
      return deployment;
    } catch (error) {
      console.error(`Error getting deployment ${deploymentId}:`, error);
      throw error;
    }
  }

  /**
   * Find a running deployment for a specific model
   */
  async findDeploymentForModel(modelName) {
    const deployments = await this.listDeployments();
    
    // Filter for running deployments that match the model
    const matching = deployments.filter(d => 
      d.status === 'RUNNING' && 
      d.configurationName?.toLowerCase().includes(modelName.toLowerCase())
    );
    
    if (matching.length === 0) {
      throw new Error(`No running deployment found for model: ${modelName}`);
    }
    
    return matching[0];
  }

  /**
   * Create a new deployment from a configuration
   */
  async createDeployment(configurationId) {
    try {
      const response = await DeploymentApi.deploymentCreate(
        { configurationId },
        { 'AI-Resource-Group': this.resourceGroup }
      );
      
      console.log(`Deployment created: ${response.id}`);
      return response;
    } catch (error) {
      console.error('Error creating deployment:', error);
      throw error;
    }
  }

  /**
   * Delete a deployment
   */
  async deleteDeployment(deploymentId) {
    try {
      // First, set to STOPPED
      await DeploymentApi.deploymentModify(
        deploymentId,
        { targetStatus: 'STOPPED' },
        { 'AI-Resource-Group': this.resourceGroup }
      );
      
      // Then delete
      await DeploymentApi.deploymentDelete(
        deploymentId,
        { 'AI-Resource-Group': this.resourceGroup }
      );
      
      console.log(`Deployment ${deploymentId} deleted`);
    } catch (error) {
      console.error(`Error deleting deployment ${deploymentId}:`, error);
      throw error;
    }
  }
}

module.exports = DeploymentManager;

This utility class provides a clean interface for deployment management. The key points:

Resource Groups: All operations are scoped to a resource group. The default is usually default.
Status management: Deployments have states like PENDING, RUNNING, STOPPED. You can only delete stopped deployments.
Configuration vs Deployment: A configuration defines the model settings; a deployment is a running instance of that configuration.

Using the Deployment Manager

Let's add an admin endpoint to our service. Update /srv/ticket-service.cds:

Copy

using support.db as db from '../db/schema';

service TicketService @(path: '/api') {
  entity Tickets as projection on db.Tickets;
  
  action generateResponse(ticketId: UUID) returns String;
  action classifyTicket(ticketId: UUID) returns String;
  
  // Admin actions
  @requires: 'admin'
  function listDeployments() returns array of {
    id: String;
    configurationName: String;
    status: String;
  };
}

And update the handler in /srv/ticket-service.js:

Copy

const cds = require('@sap/cds');
const DeploymentManager = require('./lib/deployment-manager');

module.exports = class TicketService extends cds.ApplicationService {
  
  async init() {
    const { Tickets } = this.entities;
    const deploymentManager = new DeploymentManager();
    
    // List deployments
    this.on('listDeployments', async () => {
      return await deploymentManager.listDeployments();
    });
    
    // ... rest of handlers
    await super.init();
  }
};

Implementing Streaming Responses

For better user experience, especially with longer AI responses, streaming allows you to display the response as it's generated rather than waiting for the complete response.

Why Streaming Matters

Without streaming:

User clicks "Generate Response"
Waits 5-10 seconds seeing nothing
Suddenly sees the complete response

With streaming:

User clicks "Generate Response"
Immediately starts seeing text appear word by word
Can read the beginning while the rest generates

This perceived responsiveness dramatically improves UX.

Implementing Streaming in CAP

Create /srv/lib/streaming-llm.js:

Copy

/**
 * LLM client with streaming support
 */
class StreamingLLMClient {
  
  /**
   * Generate a streaming response
   * @param {Object} options - Generation options
   * @param {Function} onChunk - Callback for each chunk received
   * @returns {Promise<string>} - Complete response
   */
  async generateStream(options, onChunk) {
    const { AzureOpenAiChatClient } = await import('@sap-ai-sdk/foundation-models');
    
    const client = new AzureOpenAiChatClient(options.model || 'gpt-4o');
    
    const response = await client.run({
      messages: options.messages,
      max_tokens: options.maxTokens || 1000,
      temperature: options.temperature || 0.7,
      stream: true  // Enable streaming
    });
    
    let fullContent = '';
    
    // Process the stream
    for await (const chunk of response) {
      const content = chunk.getDeltaContent();
      if (content) {
        fullContent += content;
        if (onChunk) {
          onChunk(content);
        }
      }
    }
    
    return fullContent;
  }
  
  /**
   * Generate a non-streaming response (for comparison)
   */
  async generate(options) {
    const { AzureOpenAiChatClient } = await import('@sap-ai-sdk/foundation-models');
    
    const client = new AzureOpenAiChatClient(options.model || 'gpt-4o');
    
    const response = await client.run({
      messages: options.messages,
      max_tokens: options.maxTokens || 1000,
      temperature: options.temperature || 0.7
    });
    
    return {
      content: response.getContent(),
      usage: response.getTokenUsage(),
      finishReason: response.getFinishReason()
    };
  }
}

module.exports = StreamingLLMClient;

Exposing Streaming via Server-Sent Events (SSE)

CAP doesn't natively support SSE, but we can add a custom Express endpoint. Update /srv/ticket-service.js:

Copy

const cds = require('@sap/cds');
const StreamingLLMClient = require('./lib/streaming-llm');
const DeploymentManager = require('./lib/deployment-manager');

module.exports = class TicketService extends cds.ApplicationService {
  
  async init() {
    const { Tickets } = this.entities;
    const llmClient = new StreamingLLMClient();
    
    // Register custom Express middleware for streaming endpoint
    this.on('bootstrap', (srv) => {
      srv.app.get('/api/tickets/:id/stream-response', async (req, res) => {
        const ticketId = req.params.id;
        
        // Set SSE headers
        res.setHeader('Content-Type', 'text/event-stream');
        res.setHeader('Cache-Control', 'no-cache');
        res.setHeader('Connection', 'keep-alive');
        
        try {
          // Fetch ticket
          const ticket = await SELECT.one.from(Tickets).where({ ID: ticketId });
          if (!ticket) {
            res.write(`event: error\ndata: Ticket not found\n\n`);
            res.end();
            return;
          }
          
          const messages = [
            {
              role: 'system',
              content: `You are a helpful customer support assistant. 
Provide professional, empathetic responses to customer tickets.`
            },
            {
              role: 'user',
              content: `Generate a response for this ticket:
Subject: ${ticket.subject}
Description: ${ticket.description}`
            }
          ];
          
          // Stream the response
          let fullResponse = '';
          await llmClient.generateStream(
            { messages, model: 'gpt-4o' },
            (chunk) => {
              fullResponse += chunk;
              res.write(`data: ${JSON.stringify({ chunk })}\n\n`);
            }
          );
          
          // Send completion event
          res.write(`event: complete\ndata: ${JSON.stringify({ fullResponse })}\n\n`);
          
          // Update ticket with response
          await UPDATE(Tickets).set({ aiResponse: fullResponse }).where({ ID: ticketId });
          
          res.end();
        } catch (error) {
          console.error('Streaming error:', error);
          res.write(`event: error\ndata: ${error.message}\n\n`);
          res.end();
        }
      });
    });
    
    // ... rest of handlers
    await super.init();
  }
};

Client-Side Consumption

Here's how a frontend would consume the SSE stream:

Copy

// Frontend JavaScript example
async function streamTicketResponse(ticketId) {
  const responseContainer = document.getElementById('response');
  responseContainer.textContent = '';
  
  const eventSource = new EventSource(`/api/tickets/${ticketId}/stream-response`);
  
  eventSource.onmessage = (event) => {
    const data = JSON.parse(event.data);
    responseContainer.textContent += data.chunk;
  };
  
  eventSource.addEventListener('complete', (event) => {
    console.log('Response complete:', JSON.parse(event.data));
    eventSource.close();
  });
  
  eventSource.addEventListener('error', (event) => {
    console.error('Stream error:', event.data);
    eventSource.close();
  });
}

Building a Sophisticated Response Suggester

Now let's build a more intelligent response generator that considers context and provides structured output.

Enhanced Prompt Engineering

The quality of AI responses heavily depends on your prompts. Here's an improved version:

Create /srv/lib/prompts.js:

Copy

/**
 * Prompt templates for the Support Ticket AI
 */
const PROMPTS = {
  
  RESPONSE_SYSTEM: `You are an expert customer support agent for a software company.

Your responsibilities:
1. Provide helpful, accurate, and empathetic responses
2. Address the customer's specific concern
3. Offer clear steps or solutions when applicable
4. Maintain a professional yet friendly tone
5. Ask clarifying questions if the issue is unclear

Guidelines:
- Keep responses concise but complete
- Use bullet points for multi-step solutions
- Acknowledge the customer's frustration when appropriate
- Never make promises you can't keep
- If you don't know something, say so honestly`,

  RESPONSE_USER: (ticket) => `Please draft a response for the following support ticket:

**Ticket ID:** ${ticket.ID}
**Subject:** ${ticket.subject}
**Priority:** ${ticket.priority || 'Not set'}
**Category:** ${ticket.category || 'Uncategorized'}

**Customer Message:**
${ticket.description}

---

Provide a professional response that:
1. Acknowledges their issue
2. Provides helpful information or next steps
3. Ends with an offer for further assistance`,

  CLASSIFICATION_SYSTEM: `You are a ticket classification system. Analyze support tickets and provide structured classification.

Categories available:
- Technical Issue
- Billing Question
- Feature Request
- Account Access
- Bug Report
- General Inquiry

Priority levels:
- Critical: System down, data loss, security issue
- High: Major functionality broken, blocking issue
- Medium: Feature not working as expected, workaround exists
- Low: Minor issue, cosmetic, nice-to-have

Sentiment:
- Frustrated: Customer is upset, angry, or disappointed
- Neutral: Standard inquiry, no strong emotion
- Positive: Customer is satisfied, providing praise

Respond ONLY with valid JSON in this exact format:
{
  "category": "string",
  "priority": "string",
  "sentiment": "string",
  "confidence": number,
  "reasoning": "string"
}`,

  CLASSIFICATION_USER: (ticket) => `Classify this support ticket:

Subject: ${ticket.subject}

Description: ${ticket.description}

Provide classification as JSON.`
};

module.exports = PROMPTS;

Implementing Classification

Add the classification logic to your service. Update /srv/ticket-service.js:

Copy

const cds = require('@sap/cds');
const PROMPTS = require('./lib/prompts');
const StreamingLLMClient = require('./lib/streaming-llm');

module.exports = class TicketService extends cds.ApplicationService {
  
  async init() {
    const { Tickets } = this.entities;
    const llmClient = new StreamingLLMClient();
    
    // Handler for generating AI response
    this.on('generateResponse', async (req) => {
      const { ticketId } = req.data;
      
      const ticket = await SELECT.one.from(Tickets).where({ ID: ticketId });
      if (!ticket) {
        return req.error(404, `Ticket ${ticketId} not found`);
      }
      
      try {
        const result = await llmClient.generate({
          model: 'gpt-4o',
          messages: [
            { role: 'system', content: PROMPTS.RESPONSE_SYSTEM },
            { role: 'user', content: PROMPTS.RESPONSE_USER(ticket) }
          ],
          maxTokens: 800,
          temperature: 0.7
        });
        
        await UPDATE(Tickets)
          .set({ aiResponse: result.content })
          .where({ ID: ticketId });
        
        console.log(`Token usage: ${JSON.stringify(result.usage)}`);
        
        return result.content;
      } catch (error) {
        console.error('AI generation error:', error);
        return req.error(500, 'Failed to generate response');
      }
    });
    
    // Handler for classifying ticket
    this.on('classifyTicket', async (req) => {
      const { ticketId } = req.data;
      
      const ticket = await SELECT.one.from(Tickets).where({ ID: ticketId });
      if (!ticket) {
        return req.error(404, `Ticket ${ticketId} not found`);
      }
      
      try {
        // Use a smaller, faster model for classification
        const result = await llmClient.generate({
          model: 'gpt-4o-mini',
          messages: [
            { role: 'system', content: PROMPTS.CLASSIFICATION_SYSTEM },
            { role: 'user', content: PROMPTS.CLASSIFICATION_USER(ticket) }
          ],
          maxTokens: 200,
          temperature: 0.3  // Lower temperature for consistent classification
        });
        
        // Parse the JSON response
        const classification = JSON.parse(result.content);
        
        // Update ticket with classification
        await UPDATE(Tickets)
          .set({
            category: classification.category,
            priority: classification.priority,
            sentiment: classification.sentiment
          })
          .where({ ID: ticketId });
        
        return JSON.stringify(classification);
      } catch (error) {
        console.error('Classification error:', error);
        return req.error(500, 'Failed to classify ticket');
      }
    });
    
    // Auto-classify new tickets
    this.after('CREATE', 'Tickets', async (ticket) => {
      // Fire and forget - classify in background
      setImmediate(async () => {
        try {
          const result = await llmClient.generate({
            model: 'gpt-4o-mini',
            messages: [
              { role: 'system', content: PROMPTS.CLASSIFICATION_SYSTEM },
              { role: 'user', content: PROMPTS.CLASSIFICATION_USER(ticket) }
            ],
            maxTokens: 200,
            temperature: 0.3
          });
          
          const classification = JSON.parse(result.content);
          
          await UPDATE(Tickets)
            .set({
              category: classification.category,
              priority: classification.priority,
              sentiment: classification.sentiment
            })
            .where({ ID: ticket.ID });
          
          console.log(`Auto-classified ticket ${ticket.ID}:`, classification);
        } catch (error) {
          console.error(`Auto-classification failed for ${ticket.ID}:`, error);
        }
      });
    });
    
    await super.init();
  }
};

Understanding Temperature and Token Settings

Two critical parameters affect LLM output:

Temperature (0.0 - 2.0):

0.0-0.3: Deterministic, consistent output. Use for classification, data extraction.
0.5-0.8: Balanced creativity and consistency. Use for general responses.
0.9-1.2: Creative, varied output. Use for brainstorming, creative writing.
>1.2: Highly random, often incoherent. Rarely useful.

Max Tokens:

Controls the maximum length of the response
1 token ≈ 4 characters in English
Set based on expected output length
Too low = truncated responses
Too high = wasted quota (you're charged for capacity, not just usage)

Token Optimization Strategies

AI services charge per token, so optimization matters for production applications.

1. Choose the Right Model

Copy

// For simple tasks, use smaller models
const classificationResult = await llmClient.generate({
  model: 'gpt-4o-mini',  // 10x cheaper than gpt-4o
  // ...
});

// For complex tasks, use powerful models
const responseResult = await llmClient.generate({
  model: 'gpt-4o',  // Better quality
  // ...
});

2. Optimize Prompts

Copy

// ❌ Verbose prompt (more tokens)
const badPrompt = `
I would like you to please analyze the following customer support ticket 
and then provide me with a detailed and comprehensive response that the 
support agent could potentially use to reply to this customer. The response
should be professional and helpful and address all of the customer's concerns...
`;

// ✅ Concise prompt (fewer tokens)
const goodPrompt = `
Analyze this support ticket and draft a professional response:
${ticket.description}
`;

3. Implement Caching

Create /srv/lib/response-cache.js:

Copy

const crypto = require('crypto');

/**
 * Simple in-memory cache for AI responses
 * In production, use Redis or similar
 */
class ResponseCache {
  constructor(ttlMs = 3600000) { // 1 hour default TTL
    this.cache = new Map();
    this.ttlMs = ttlMs;
  }
  
  _generateKey(messages) {
    const content = JSON.stringify(messages);
    return crypto.createHash('md5').update(content).digest('hex');
  }
  
  get(messages) {
    const key = this._generateKey(messages);
    const entry = this.cache.get(key);
    
    if (!entry) return null;
    
    if (Date.now() > entry.expiresAt) {
      this.cache.delete(key);
      return null;
    }
    
    console.log('Cache hit for:', key);
    return entry.value;
  }
  
  set(messages, value) {
    const key = this._generateKey(messages);
    this.cache.set(key, {
      value,
      expiresAt: Date.now() + this.ttlMs
    });
  }
  
  clear() {
    this.cache.clear();
  }
}

module.exports = new ResponseCache();

4. Monitor Usage

Add usage tracking to your service:

Copy

// Track token usage
let totalTokensUsed = 0;

this.on('generateResponse', async (req) => {
  // ... generation logic
  
  const result = await llmClient.generate(options);
  
  // Track usage
  const usage = result.usage;
  totalTokensUsed += usage.total_tokens;
  
  console.log(`Request tokens: ${usage.total_tokens}`);
  console.log(`Session total: ${totalTokensUsed}`);
  
  // In production, store this in a database for billing/monitoring
});

Testing Our Enhanced Service

Update /test/requests.http:

Copy

### Create a new support ticket
POST http://localhost:4004/api/Tickets
Content-Type: application/json

{
  "subject": "Payment failed but money was deducted",
  "description": "I tried to purchase the premium plan yesterday and the payment failed with an error. However, I can see that $99 was deducted from my bank account. I need this resolved urgently as I'm being charged for something I don't have access to. This is very frustrating!"
}

### Get all tickets (check auto-classification)
GET http://localhost:4004/api/Tickets

### Manually classify a ticket
POST http://localhost:4004/api/classifyTicket
Content-Type: application/json

{
  "ticketId": "YOUR-TICKET-ID"
}

### Generate AI response
POST http://localhost:4004/api/generateResponse
Content-Type: application/json

{
  "ticketId": "YOUR-TICKET-ID"
}

### Test streaming (in browser or with curl)
# curl -N http://localhost:4004/api/tickets/YOUR-TICKET-ID/stream-response

Recap

In this post, we've significantly expanded our AI capabilities:

Explored foundation models: Understood the different models available and when to use each
Managed deployments programmatically: Built utilities to list, create, and manage AI Core deployments
Implemented streaming: Added real-time response streaming for better UX
Enhanced our prompts: Created structured, effective prompts for different tasks
Added classification: Auto-classify tickets by category, priority, and sentiment
Optimized for cost: Learned strategies to minimize token usage

Our Support Ticket System now automatically classifies incoming tickets and can generate helpful responses on demand.

Next Steps

In the next post, Orchestrating AI Workflows with SAP AI Core, we'll learn how to:

Use the orchestration service to chain multiple AI operations
Build complex workflows (classify → analyze → respond)
Implement content filtering and guardrails
Add templating for consistent prompt management

Stay tuned!

Leveraging LLM Models and Deployments in SAP AI Core

What are we building?

Available Foundation Models in SAP AI Core

OpenAI Models (via Azure)

Anthropic Models

Google Models

Choosing the Right Model

Managing Deployments Programmatically

Listing Available Deployments

Using the Deployment Manager

Implementing Streaming Responses

Why Streaming Matters

Implementing Streaming in CAP

Exposing Streaming via Server-Sent Events (SSE)

Client-Side Consumption

Building a Sophisticated Response Suggester

Enhanced Prompt Engineering

Implementing Classification

Understanding Temperature and Token Settings

Token Optimization Strategies

1. Choose the Right Model

2. Optimize Prompts

3. Implement Caching

4. Monitor Usage

Testing Our Enhanced Service

Recap

Next Steps

Resources

Published

Topics

Content