Topic 1 Question 32
You developed an ML model with AI Platform, and you want to move it to production. You serve a few thousand queries per second and are experiencing latency issues. Incoming requests are served by a load balancer that distributes them across multiple Kubeflow CPU-only pods running on Google Kubernetes Engine (GKE). Your goal is to improve the serving latency without changing the underlying infrastructure. What should you do?
Significantly increase the max_batch_size TensorFlow Serving parameter.
Switch to the tensorflow-model-server-universal version of TensorFlow Serving.
Significantly increase the max_enqueued_batches TensorFlow Serving parameter.
Recompile TensorFlow Serving using the source to support CPU-specific optimizations. Instruct GKE to choose an appropriate baseline minimum CPU platform for serving nodes.
ユーザの投票
コメント(14)
D is correct since this question is focusing on server performance which development env is higher than production env. It's already throttling so increase the pressure on them won't help. Both A and C is essentially doing this. B is a bit mysterious, but we definitely know that D would work.
👍 18Y2Data2021/09/14it should be A. https://github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md#batch-scheduling-parameters-and-tuning max_batch_size: The maximum size of any batch. This parameter governs the throughput/latency tradeoff, and also avoids having batches that are so large they exceed some resource constraint (e.g. GPU memory to hold a batch's data). As with D, it will change the infrastructure.
👍 3DucLee31102021/06/30Answer D. > "In addition, optimizing the saved model before deploying it (for example, by stripping unused parts) can reduce prediction latency. If you're training a TensorFlow model, we recommend that you optimize the SavedModel using the Graph Transformation Tools." https://cloud.google.com/architecture/minimizing-predictive-serving-latency-in-machine-learning#optimizing_models_for_serving However, I currently do not understand what "CPU-specific optimizations" exactly means. Any ideas?
A is not correct: bigger batch size => increase latency (i.e., the opposite outcome)
👍 2ramen_lover2021/11/06
シャッフルモード