Most of the students, asking me Which is the best Hadoop or spark?
Answer is both based on scenario. The main reason. Hadoop is origin for bigdata especially to process parallel. Where as Spark is origin to process data in Memory. Both are very important to implement any bigdata applications.
First of all Hadoop, means combination of HDFS+YARN+Mapreduce+Hive/Pig
Similarly Spark means combination of HDFS +YARN+Mapreduce+SparkSQL/Streaming
It means HDFS and YARN common in both Hadoop and Spark. Only difference is Processing engine and it’s architecture. Means Spark is Replacement of Hadoop processing engine called MapReduce, but not replacement of Hadoop.
Hadoop or Spark Which is the best?
As per my experience, Hadoop highly recommended to understand and learn bigdata. Where as to get a job, spark highly recommended. The main reason What are you doing with Hadoop almost all tasks you can do using spark. Additionally Spark support streaming data, machine learning data, but hadoop doesn’t support. So that Most of the bigdata companies looking for Spark developers instead of only Hadoop developers.
Now If you want to learn Spark, No need any special skills, but Scala language and sql knowledge highly recommended. To implement any framework must have a programming language. Spark framework using Scala programming language, Hadoop framework use java language.
In programming world, Hello world is fundamental program, similarly in bigdata wordcount is fundamental program. Most of the freshers, get scared when they see map-reduce word-count program. The main reason it’s more than 50-80 lines, but in spark just 4 lines. Means spark concise the code, it’s programmer friendly.
If you know C language, you can easily learn Java language, similarly If you know core java, you can easily learn Scala. The main reason both Scala and Java run on the top of JVM only. Similarly If you know Hadoop, you can easily understand Spark, but it’s not mandatory.
Future of Hadoop
Of course, I am not saying Hadoop is waste. Still many organizations using Hive, sqoop, Pig in production environment especially in data warehouse projects. Let example if you want to get incremental data from oracle using spark. It’s difficult, but in sqoop it’s easy. Spark default using parquet format, Hive default using orc format. Compare with parquet, orc optimize a lot. Still If you want to process only historical data, no need streaming data, at that time hadoop/Hive highly recommended, instead of spark.
Hadoop to improve your bigdata knowlege, but to get a job Spark highly recommended. The main reason, in production environment getting streaming, batch data, to process both at a time, spark best suitable.