Model Optimization
← Back to Model Serving
Techniques to make models faster and smaller for deployment. ONNX runtime (cross-framework optimization), TensorRT (NVIDIA GPU optimization), quantization (reduce precision), pruning (remove unimportant weights), distillation (train small model to mimic large).
Related
- Inference Optimization (LLM-specific optimization)
- Edge Deployment (optimization for resource-constrained devices)