Topic 1 Question 49

Professional Data Engineer

Topic 1 Question 49
Your company produces 20,000 files every hour. Each data file is formatted as a comma separated values (CSV) file that is less than 4 KB. All files must be ingested on Google Cloud Platform before they can be processed. Your company site has a 200 ms latency to Google Cloud, and your Internet connection bandwidth is limited as 50 Mbps. You currently deploy a secure FTP (SFTP) server on a virtual machine in Google Compute Engine as the data ingestion point. A local SFTP client runs on a dedicated machine to transmit the CSV files as is. The goal is to make reports with data from the previous day available to the executives by 10:00 a.m. each day. This design is barely able to keep up with the current volume, even though the bandwidth utilization is rather low. You are told that due to seasonality, your company expects the number of files to double for the next three months. Which two actions should you take?

2 つ選択
- Introduce data compression for each file to increase the rate file of file transfer.
- Contact your internet service provider (ISP) to increase your maximum bandwidth to at least 100 Mbps.
- Redesign the data ingestion process to use gsutil tool to send the CSV files to a storage bucket in parallel.
- Assemble 1,000 files into a tape archive (TAR) file. Transmit the TAR files instead, and disassemble the CSV files in the cloud upon receiving them.
- Create an S3-compatible storage endpoint in your network, and use Google Cloud Storage Transfer Service to transfer on-premises data to the designated storage bucket.
ユーザの投票
コメント(17)
- E cannot be: Transfer Service is recommended for 300mbps or faster https://cloud.google.com/storage-transfer/docs/on-prem-overview
  
  Bandwidth is not an issue, so B is not an answer
  
  Cloud Storage loading gets better throughput the larger the files are. Therefore making them smaller with compression does not seem a solution. -m option to do parallel work is recommended. Therefore A is not and C is an answer. https://medium.com/@duhroach/optimizing-google-cloud-storage-small-file-upload-performance-ad26530201dc
  
  That leaves D as the other option. It is true you cannot user tar directly with gsutil, but you can load the tar file to Cloud Storage, move the file to a Compute Engine instance with Linux, use tar to split files and copy them back to Cloud Storage. Batching many files in a larger tar will improve Cloud Storage throughput.
  
  So, given the alternatives, I think answer is CD
  
  👍 46
  Toto20202020/12/16
- Should be AC
  
  👍 32
  [Removed]2020/03/20
- 20k files * 24 hours = 480k files x2 * 4kilobyte= 3.8gb everyday and must be processed in 10 hours
  
  either C or E (not both)= total size is less than 1tb per day, so C (gsutil) is the right tool
  
  in gsutil, the process can be paralleled (so you can utilize bandwidth throughput) and using rsync it is also possible to compress (to increase the transfer rate). and you do not need to decompress at target (to decrease process time at target such as untar or decompress). So A is better than E.!
  
  the following cmd does the job. gsutil -m rsync -az sourceDir gs://targetBucket
  👍 3
  Tanzu2022/02/12
シャッフルモード

ユーザの投票

コメント(17)