Topic 1 Question 18
You are working with a large dataset of customer reviews stored in Cloud Storage. The dataset contains several inconsistencies, such as missing values, incorrect data types, and duplicate entries. You need to clean the data to ensure that it is accurate and consistent before using it for analysis. What should you do?
Use the PythonOperator in Cloud Composer to clean the data and load it into BigQuery. Use SQL for analysis.
Use BigQuery to batch load the data into BigQuery. Use SQL for cleaning and analysis.
Use Storage Transfer Service to move the data to a different Cloud Storage bucket. Use event triggers to invoke Cloud Run functions to load the data into BigQuery. Use SQL for analysis.
Use Cloud Run functions to clean the data and load it into BigQuery. Use SQL for analysis.
ユーザの投票
コメント(2)
- 正解だと思う選択肢: A
PythonOperator allows leveraging Python libraries (e.g., Pandas, PySpark) to perform robust data cleaning tasks:
Handle missing values (e.g., imputation, filtering).
Fix incorrect data types (e.g., string-to-date conversions).
Remove duplicates (e.g., using deduplication logic).
👍 2SaquibHerman2025/02/18 - 正解だと思う選択肢: B
The best option is B. Use BigQuery to batch load the data into BigQuery and use SQL for cleaning and analysis. Loading directly into BigQuery and using SQL provides the optimal balance of efficiency and simplicity for cleaning large datasets before analysis by leveraging BigQuery's scalable processing for both loading and transformation. Option A (Cloud Composer + PythonOperator) adds unnecessary complexity of workflow orchestration and external processing before loading, reducing efficiency. Option C (Storage Transfer Service + Cloud Run) overcomplicates the process with extra data movement and event-driven functions, making it less direct for data cleaning. Option D (Cloud Run functions) is less efficient for large-scale data cleaning compared to BigQuery SQL's parallel processing and adds complexity before data is in BigQuery for analysis. Therefore, loading into BigQuery and using SQL is the most efficient and straightforward approach for cleaning data before analysis in this scenario.
👍 1n21837128472025/02/27
シャッフルモード