Semester 6Year 3 · EvenCore Subject★★★ Moderate
CS 604

Big Data Analytics

Study of Hadoop, MapReduce, Spark, NoSQL databases, and big data processing frameworks.

4Units
28Topics
4Credits
60hLecture hrs
100Max marks
Your Progress
0 / 28 topics
0% complete
Overview
🎯
Why it matters
Facebook processes petabytes of data. Netflix recommendations analyze billions of records. Big Data = big money. Companies pay premium for engineers who can handle massive datasets.
💼
Placement relevance
Data Engineer roles at FAANG. Analytics positions. Hadoop/Spark skills valued. Growing field with ₹20-45 LPA for big data specialists. Cloud companies need big data expertise.
🔗
Prerequisites for
Data Engineering · Data Science · Cloud Data Platforms · Stream Processing · Data Warehousing
📚
Recommended books
Hadoop: The Definitive Guide by Tom White · Learning Spark by Holden Karau · Big Data: Principles and Best Practices by Nathan Marz · MongoDB: The Definitive Guide by Shannon Bradshaw
Curriculum — 4 Units
U1
Unit 1 · 7 Topics · 0% complete
Big Data Basics
Key Formulae
MapReduce:Map(key, value) → Shuffle/Sort → Reduce(key, list<values>)
HDFS:NameNode (metadata) + DataNodes (blocks, replication factor 3)
3Vs (Volume, Velocity, Variety)
Big Data Characteristics (Veracity, Value)
Distributed Systems Concepts
Hadoop Ecosystem Overview
HDFS Architecture
MapReduce Programming Model
YARN (Resource Management)
U2
Unit 2 · 7 Topics · 0% complete
NoSQL Databases
Key Formulae
CAP:Consistency, Availability, Partition Tolerance (choose 2 of 3)
BASE:Basically Available, Soft state, Eventual consistency
CAP Theorem
BASE vs ACID
MongoDB (Document Store)
Cassandra (Column Store)
HBase (Column-Oriented)
Redis (Key-Value Store)
Graph Databases (Neo4j)
U3
Unit 3 · 7 Topics · 0% complete
Apache Spark
Key Formulae
RDD Operations:Lazy transformations + eager actions
DAG:Directed Acyclic Graph for execution optimization
RDDs (Resilient Distributed Datasets)
Transformations (map, filter, flatMap)
Actions (collect, count, reduce)
Spark SQL & DataFrames
Spark Streaming
MLlib (Machine Learning Library)
Spark vs MapReduce
U4
Unit 4 · 7 Topics · 0% complete
Data Analytics & Tools
Key Formulae
ETL:Extract → Transform → Load (data pipeline)
Lambda Architecture:Batch Layer + Speed Layer + Serving Layer
Hive (SQL on Hadoop)
Pig (Data Flow Language)
Apache Kafka (Streaming)
Data Warehousing
ETL Processes
Data Visualization
Real-Time Analytics
Previous Year Questions
Unit 12023 · End Semester10 marks
Write MapReduce pseudocode for Word Count problem. Given input: 'hello world hello'. Show Map output, Shuffle phase, and Reduce output step-by-step.
Unit 22023 · End Semester8 marks
Explain CAP theorem with examples. For an e-commerce site, would you prioritize CA, CP, or AP? Justify. Compare MongoDB and Cassandra.
Unit 32022 · End Semester6 marks
What are RDDs in Spark? Explain transformations vs actions with examples. Why is Spark faster than MapReduce?
Exam Strategy
🗺️
MapReduce examples
Word count, average calculation, max value — practice 5 problems. Show Map output (key-value pairs), Shuffle phase, Reduce output. Tabular format helps.
🎯
CAP theorem is gold
CAP theorem + ACID vs BASE comparison appears in EVERY exam. Make a comparison table. Give examples: MongoDB (CP), Cassandra (AP).
Spark vs Hadoop
Why Spark is faster (in-memory vs disk). RDD lineage for fault tolerance. Lazy evaluation concept. Always asked in exams.
Related Subjects
Semester 5
Machine Learning
CS 501
Semester 4
Database Management Systems
CS 401
Semester 5
Cloud Computing
CS 505