Topic 1 Question 32

Professional Machine Learning Engineer

Topic 1 Question 32
You developed an ML model with AI Platform, and you want to move it to production. You serve a few thousand queries per second and are experiencing latency issues. Incoming requests are served by a load balancer that distributes them across multiple Kubeflow CPU-only pods running on Google Kubernetes Engine (GKE). Your goal is to improve the serving latency without changing the underlying infrastructure. What should you do?
- Significantly increase the max_batch_size TensorFlow Serving parameter.
- Switch to the tensorflow-model-server-universal version of TensorFlow Serving.
- Significantly increase the max_enqueued_batches TensorFlow Serving parameter.
- Recompile TensorFlow Serving using the source to support CPU-specific optimizations. Instruct GKE to choose an appropriate baseline minimum CPU platform for serving nodes.
ユーザの投票
コメント(14)
- D is correct since this question is focusing on server performance which development env is higher than production env. It's already throttling so increase the pressure on them won't help. Both A and C is essentially doing this. B is a bit mysterious, but we definitely know that D would work.
  
  👍 18
  Y2Data2021/09/14
- it should be A. https://github.com/tensorflow/serving/blob/master/tensorflow_serving/batching/README.md#batch-scheduling-parameters-and-tuning max_batch_size: The maximum size of any batch. This parameter governs the throughput/latency tradeoff, and also avoids having batches that are so large they exceed some resource constraint (e.g. GPU memory to hold a batch's data). As with D, it will change the infrastructure.
  
  👍 3
  DucLee31102021/06/30
- Answer D. > "In addition, optimizing the saved model before deploying it (for example, by stripping unused parts) can reduce prediction latency. If you're training a TensorFlow model, we recommend that you optimize the SavedModel using the Graph Transformation Tools." https://cloud.google.com/architecture/minimizing-predictive-serving-latency-in-machine-learning#optimizing_models_for_serving However, I currently do not understand what "CPU-specific optimizations" exactly means. Any ideas?
  
  A is not correct: bigger batch size => increase latency (i.e., the opposite outcome)
  
  👍 2
  ramen_lover2021/11/06
シャッフルモード

ユーザの投票

コメント(14)