Skip to main content
data advanced

Build ETL Data Pipeline

Create robust ETL pipelines with AI. Get code, architecture, and best practices for extracting, transforming, and loading data efficiently.

Works with: chatgptclaudegemini

Prompt Template

You are a senior data engineer tasked with designing and implementing a comprehensive ETL (Extract, Transform, Load) data pipeline. Create a detailed solution for the following requirements: **Data Sources:** [DATA_SOURCES] **Target System:** [TARGET_SYSTEM] **Data Volume:** [DATA_VOLUME] **Processing Frequency:** [PROCESSING_FREQUENCY] **Business Requirements:** [BUSINESS_REQUIREMENTS] Provide a complete ETL pipeline solution including: 1. **Architecture Overview:** Design the overall pipeline architecture with clear data flow diagrams and component relationships 2. **Extraction Strategy:** Detail how to extract data from each source, including connection methods, incremental vs full loads, error handling, and data validation 3. **Transformation Logic:** Specify all data transformations needed including data cleaning, normalization, aggregations, joins, and business rule implementations 4. **Loading Strategy:** Define how data will be loaded into the target system, including staging strategies, upsert logic, and performance optimizations 5. **Code Implementation:** Provide production-ready code examples using appropriate tools and languages (Python, SQL, etc.) 6. **Monitoring & Error Handling:** Include logging, alerting, data quality checks, and failure recovery mechanisms 7. **Performance Optimization:** Recommend indexing strategies, partitioning, parallel processing, and resource allocation 8. **Deployment & Orchestration:** Suggest scheduling tools, dependency management, and CI/CD considerations Ensure the solution is scalable, maintainable, and follows data engineering best practices.

Variables to Customize

[DATA_SOURCES]

The systems and formats you're extracting data from

Example: PostgreSQL customer database, Salesforce CRM API, CSV files from FTP server, real-time Kafka streams

[TARGET_SYSTEM]

Where the processed data will be stored

Example: Amazon Redshift data warehouse with daily reporting tables

[DATA_VOLUME]

Scale of data being processed

Example: 10GB daily incremental updates, 2TB historical data

[PROCESSING_FREQUENCY]

How often the pipeline should run

Example: Daily batch processing at 2 AM, with real-time streaming for critical events

[BUSINESS_REQUIREMENTS]

Specific business needs and transformations required

Example: Customer 360 view combining sales, support, and marketing data with data quality scores and churn predictions

Example Output

## ETL Pipeline Architecture ### 1. Architecture Overview The pipeline follows a medallion architecture with Bronze (raw), Silver (cleaned), and Gold (business-ready) layers in Amazon S3, orchestrated by Apache Airflow. ### 2. Extraction Strategy ```python # PostgreSQL extraction with incremental loading def extract_customer_data(): query = "SELECT * FROM customers WHERE updated_at > %s" return pd.read_sql(query, connection, params=[last_run_timestamp]) ``` ### 3. Transformation Logic - Data deduplication using customer_id and timestamp - Standardize phone numbers and addresses - Calculate customer lifetime value - Apply data quality scores (completeness, accuracy) ### 4. Loading Strategy Using UPSERT operations with staging tables: ```sql MERGE INTO dim_customers USING staging_customers ON dim_customers.customer_id = staging_customers.customer_id WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT ... ``` ### 5. Monitoring - CloudWatch alerts for pipeline failures - Data quality metrics dashboard - Row count validation and schema drift detection ### 6. Performance Optimizations - Partition Redshift tables by date - Use COPY command for bulk loads - Parallel processing for independent transformations

Pro Tips for Best Results

  • Start with a small dataset to test your pipeline logic before scaling to full production volumes
  • Implement comprehensive logging and monitoring from day one - pipeline failures are inevitable and need quick detection
  • Use staging tables and atomic operations to ensure data consistency and enable easy rollbacks
  • Design for idempotency - your pipeline should produce the same results when run multiple times on the same data
  • Include data quality checks at each stage and establish clear SLAs for acceptable data freshness and accuracy

Tags

Want 500+ Expert Prompts?

Get the Premium Prompt Pack — organized, tested, and ready to use.

Get it for $29

Related Prompts You Might Like