Topic 1 Question 5

Professional Data Engineer

Topic 1 Question 5
An external customer provides you with a daily dump of data from their database. The data flows into Google Cloud Storage GCS as comma-separated values (CSV) files. You want to analyze this data in Google BigQuery, but the data could have rows that are formatted incorrectly or corrupted. How should you build this pipeline?
- Use federated data sources, and check data in the SQL query.
- Enable BigQuery monitoring in Google Stackdriver and create an alert.
- Import the data into BigQuery using the gcloud CLI and set max_bad_records to 0.
- Run a Google Cloud Dataflow batch pipeline to import the data into BigQuery, and push errors to another dead-letter table for analysis.
ユーザの投票
コメント(17)
- Agreed: D
  
  👍 23
  [Removed]2020/03/14
- The answer is D. An ETL pipeline will be implemented for this scenario. Check out handling invalid inputs in cloud data flow
  
  https://cloud.google.com/blog/products/gcp/handling-invalid-inputs-in-dataflow
  
  ParDos . . . and don’ts: handling invalid inputs in Dataflow using Side Outputs as a “Dead Letter” file
  
  👍 12
  Radhika79832020/11/04
- Disagree a bit here. Could well be A. In one Coursera video course (https://www.coursera.org/learn/batch-data-pipelines-gcp/lecture/SkDus/how-to-carry-out-operations-in-bigquery), they do have a video about when to just use an SQL query to find wrong data without creating a Dataflow pipeline. The question says "SQL" as a language, not Cloud SQL as a service. Federated Sources is great because you can federate a CSV file in GCS with BigQuery. From the video: "In this section, we'll take a look at exactly how BigQuery can help with some of those data quality issues we just described. Let's start with validity, what do we mean by invalid? It can mean things like corrupted data maybe data that is missing a timestamp"
  
  👍 5
  fire5587872021/08/16
シャッフルモード

ユーザの投票

コメント(17)