Inference

Inference is the runtime phase where a trained model produces outputs for real user requests.

Main Priorities

  • Latency
  • Throughput
  • Reliability
  • Cost efficiency

Common Levers

  • Quantization
  • Caching
  • Batching
  • Hardware selection