- 12 Sections
- 82 Lessons
- 10 Weeks
Expand all sectionsCollapse all sections
- Module 1 – Big Data & Hadoop Foundations9
- 1.1Bigdata overview
- 1.2Hadoop Hdfs commands hands-on
- 1.3Hadoop vs Spark – Architectural differences & when to use each
- 1.4Hadoop Ecosystem Overview – HDFS, YARN, MapReduce
- 1.5Apache Hive – Data Warehousing on Hadoop
- 1.6Apache Sqoop – Import/export with Sqoop
- 1.7Apache Oozie – Workflow scheduling and job orchestration
- 1.8Differences between Hadoop & Spark, advantages of Spark
- 1.9Hands-on: Creating Hive tables and querying
- Module 2 – AWS for Data Engineering6
- 2.1EC2: Launch Linux/Windows servers, connect via SSH
- 2.2S3: Create buckets, using boto3, s3 cli commands
- 2.3RDS: Create MySQL, Mssql,PostgreSQL databases
- 2.4IAM & Roles: Secure access control for Databricks and Spark jobs
- 2.5CloudWatch: Monitor Databricks workloads, set alerts, trigger autoscaling
- 2.6databriks mount s3 data and process
- Module 3 – Azure for Data Engineering11
- 3.1Azure Storage: Blob Storage vs Data Lake Storage Gen2, folder structures, ACLs
- 3.2Azure Virtual Machines: Provisioning for Spark/Hadoop workloads
- 3.3Azure SQL Database: Create, connect, and integrate with Databricks
- 3.4ADLS Gen2 Integration with Databricks: Mounting, secure access with Azure Key Vault
- 3.5Azure Databricks Integration: Data Factory triggers, Synapse analytics connections
- 3.6Other Azure Concepts: Event Hub, Azure Active Directory, Azure Stream Analytics
- 3.7ADF : 20 important activities
- 3.8ADF: DataFlow 20 usecases
- 3.9Load data from Azure SQL to Delta Lake
- 3.10Use Azure Event Hub as a streaming source
- 3.11Hands-on: Mount ADLS to Databricks
- Module 4 – Apache Spark Essentials7
- 4.1Spark Architecture & Components
- 4.2RDD, DataFrame, Dataset APIs – Use cases & performance tradeoffs
- 4.3Transformations vs Actions – Lazy evaluation & DAG execution
- 4.4SparkContext, SparkSession, and SQLContext deep dive
- 4.5RDD 20 different usecases examples
- 4.6Hands-on: Build & optimize Spark jobs for CSV, JSON, XML, Avro, Parquet
- 4.7Dag, Stages, Memory management in pyspark
- Module 5 – PySpark Advanced Data Processing7
- 5.1Spark Memory Management & Resource Optimization
- 5.2Integration with RDBMS (MySQL, Oracle) & NoSQL
- 5.3Data ingestion patterns: Batch vs Streaming
- 5.4Data pipeline orchestration with Airflow & Oozie
- 5.5Date , String, Windows, functions, Regular expression in pyspark
- 5.6Hands-on: Optimizing joins, partitions, caching in PySpark
- 5.7Spark job tuning – shuffle management, broadcast joins, skew handling
- Module 6 – Databricks Fundamentals8
- 6.1Databricks vs Spark vs Snowflake – Architectural considerations
- 6.2Navigating the Databricks Workspace & Notebooks
- 6.3DBFS (Databricks File System) commands & utilities
- 6.4Managing clusters – Job vs All-Purpose, High Concurrency, Autoscaling
- 6.5Hands-on: Setting up clusters with AWS & Azure storage integration
- 6.6Delta table incremental data import, Update, delete, timetravel
- 6.7Databricks SQL Warehouses vs Clusters – cost/performance optimization
- 6.8Photon execution engine in Databricks
- Module 7 – Databricks & Cloud Storage Integration6
- 7.1AWS S3 with Databricks: Buckets, policies, and data access
- 7.2Azure Blob & Data Lake Storage Gen2 with Databricks
- 7.3Mounting cloud storage in Databricks securely (Secrets Utility, Key Vault, IAM)
- 7.4Hands-on: Reading/writing large datasets to cloud storage
- 7.5Data migration projects
- 7.6End-to-end logging strategy (Spark logs, Databricks logs, CloudWatch/Monitor)
- Module 8 – Streaming Data Engineering5
- 8.1Structured Streaming in Databricks – Kafka, Event Hub, Kinesis integration
- 8.2Handling schema drift, bad records, and regex-based cleaning
- 8.3Real-time ingestion into Delta Lake & external databases
- 8.4Hands-on: Streaming ETL pipeline from Kafka to Delta Lake with CDC
- 8.5Clickstream Analytics pipeline (web/app events → Kafka → Structure streaming)
- Module 9 – Delta Lake & Lakehouse Architecture7
- 9.1Delta Lake fundamentals – ACID transactions & schema enforcement
- 9.2Delta best practices – Optimize, Z-Order, Vacuum
- 9.3Implementing Slowly Changing Dimensions (SCD Type 1 & 2)
- 9.4Deduplication techniques in batch & streaming
- 9.5Hands-on: Build an end-to-end Lakehouse pipeline
- 9.6CI/CD pipelines with Azure DevOps / GitHub Actions / AWS CodePipeline
- 9.7Git integration with Databricks Repos
- Module 10 – Databricks Unity Catalog & Security6
- 10.1Centralized governance with Unity Catalog
- 10.2Hive metastore vs Unity Catalog
- 10.3Schema & table creation, external tables
- 10.4Row-level & column-level security, masking strategies
- 10.5Role-based access control (RBAC) & Azure Active Directory/IAM integration
- 10.6Hands-on: Secure a multi-tenant Databricks environmen
- Module 11 – Advanced Databricks Workflows4
- Capstone Project – Multi-Cloud Data Engineering Pipeline6
- 12.1Ingest data from AWS S3 & Azure ADLS into Databricks
- 12.2Process batch + streaming data using PySpark
- 12.3Store results in Delta Lake and serve to analytics tools
- 12.4Apply governance with Unity Catalog
- 12.5Monitor with AWS CloudWatch & Azure Monitor
- 12.6Clickstream Analytics pipeline (web/app events → Kafka → Delta Lake → BI)
Hadoop Hdfs commands hands-on
Next