Model Optimization

Back to Model Serving

Techniques to make models faster and smaller for deployment. ONNX runtime (cross-framework optimization), TensorRT (NVIDIA GPU optimization), quantization (reduce precision), pruning (remove unimportant weights), distillation (train small model to mimic large).


mlops optimization deployment