Topic 1 Question 58

Professional Data Engineer

Topic 1 Question 58
You architect a system to analyze seismic data. Your extract, transform, and load (ETL) process runs as a series of MapReduce jobs on an Apache Hadoop cluster. The ETL process takes days to process a data set because some steps are computationally expensive. Then you discover that a sensor calibration step has been omitted. How should you change your ETL process to carry out sensor calibration systematically in the future?
- Modify the transformMapReduce jobs to apply sensor calibration before they do anything else.
- Introduce a new MapReduce job to apply sensor calibration to raw data, and ensure all other MapReduce jobs are chained after this.
- Add sensor calibration data to the output of the ETL process, and document that all users need to apply sensor calibration themselves.
- Develop an algorithm through simulation to predict variance of data output from the last MapReduce job based on calibration factors, and apply the correction to all data.
ユーザの投票
コメント(15)
- Should go with B. Two reasons, it is a cleaner approach with single job to handle the calibration before the data is used in the pipeline. Second, doing this step in later stages can be complex and maintenance of those jobs in the future will become challenging.
  
  👍 53
  SteelWarrior2020/09/22
- Answer: A Description: My take on this is for sensor calibration you just need to update the transform function, rather than creating a whole new mapreduce job and storing/passing the values to next job
  
  👍 19
  [Removed]2020/03/27
- Should be B. It's a Data Quality step which has to go right after Raw Ingest. Otherwise you repeat the same step unknown (see "job_s_" in A) number of times, possibly for no reason, therefore extending ETL time.
  
  👍 5
  YuriP2020/08/03
シャッフルモード

ユーザの投票

コメント(15)