Topic 1 Question 150

Professional Data Engineer

Topic 1 Question 150
You want to build a managed Hadoop system as your data lake. The data transformation process is composed of a series of Hadoop jobs executed in sequence. To accomplish the design of separating storage from compute, you decided to use the Cloud Storage connector to store all input data, output data, and intermediary data. However, you noticed that one Hadoop job runs very slowly with Cloud Dataproc, when compared with the on-premises bare-metal Hadoop environment (8-core nodes with 100-GB RAM). Analysis shows that this particular Hadoop job is disk I/O intensive. You want to resolve the issue. What should you do?
- Allocate sufficient memory to the Hadoop cluster, so that the intermediary data of that particular Hadoop job can be held in memory
- Allocate sufficient persistent disk space to the Hadoop cluster, and store the intermediate data of that particular Hadoop job on native HDFS
- Allocate more CPU cores of the virtual machine instances of the Hadoop cluster so that the networking bandwidth for each instance can scale up
- Allocate additional network interface card (NIC), and configure link aggregation in the operating system to use the combined throughput when working with Cloud Storage
ユーザの投票
コメント(17)
- Correct: B
  
  Local HDFS storage is a good option if:
  
  Your jobs require a lot of metadata operations—for example, you have thousands of partitions and directories, and each file size is relatively small. You modify the HDFS data frequently or you rename directories. (Cloud Storage objects are immutable, so renaming a directory is an expensive operation because it consists of copying all objects to a new key and deleting them afterwards.) You heavily use the append operation on HDFS files. You have workloads that involve heavy I/O. For example, you have a lot of partitioned writes, such as the following:
  
  spark.read().write.partitionBy(...).parquet("gs://")
  
  You have I/O workloads that are especially sensitive to latency. For example, you require single-digit millisecond latency per storage operation.
  
  👍 32
  [Removed]2020/03/25
- Answer B Its google recommended approach to use LocalDisk/HDFS to store Intermediate result and use Cloud Storage for initial and final results.
  
  👍 11
  Rajokkiyam2020/04/02
- Correct Answer is Option B - Adding persistent disk space, reasons:- - The question mentions that the particular job is "disk I/O intensive - so the word "disk" is explicitly mentioned. - Option B also mentions about local HDFS storage, which is ideally a good option of general I/O intensive work.
  
  👍 5
  Alasmindas2020/11/13
シャッフルモード

ユーザの投票

コメント(17)

Correct Answer is Option B - Adding persistent disk space, reasons:- - The question mentions that the particular job is "disk I/O intensive - so the word "disk" is explicitly mentioned. - Option B also mentions about local HDFS storage, which is ideally a good option of general I/O intensive work.