Your Big Data Buzzword Dictionary in 6 minutes
Data is the new oil. With evolving data technologies, organizations are able to unlock the business values from data. However, with all these fancy big data technologies and buzzwords, can you distinguish what they mean and usage?
This story, you can use it as your quick big data buzzword dictionary or even a cheat sheet for your data engineer interview to identify the differences that people always get confused/mixed up.
A quick quiz — do you explain the differences between —Python, Scala, Hive, Spark and Hadoop? How about YARN, MapReduce or even Databricks? If you could clearly explain them, probably you could end the reading here. Otherwise, if you feel they sounds very similar that you can’t really tell the differences. You may find this story useful.
What are Spark, Hadoop and Hive?
All three are actually Apache data software projects. Apache Hadoop and Apache Spark are both open-source frameworks for big data processing.
The key differences between them are,
Processing data: Hadoop — MapReduce vs. Spark — Resilient Distributed Datasets (RDDs)
File storage: Hadoop — Hadoop distributed file system (“HDFS”) that data files can be stored in multiple machines vs. Spark is mainly used for computation on top of built storage (which could be Hadoop / Azure Blob / local storage) without a file system
Performance Differences: Hadoop — Designed for batch processing with huge volume but not for iterative data processing vs. Spark — Designed for in-memory iterative data processing (There are number of research papers deep diving in MapReduce vs. RDDs to explain the differences and how they operate)
Which one is the best framework?
Spark itself is also Hadoop-compatible cluster-computing platform that it could be run on dataset in Hadoop easily. It could be adopted by developers with PySpark APIs (on Python) or in Databricks Notebook.
Many people said Spark is a Hadoop enhancement to MapReduce that Spark’s data processing speeds are up to 100x faster than MapReduce . What is the value of Hadoop itself? Can we just adopt Spark for all big data applications?
Even with the performance differences, Hadoop itself is still very useful with the benefits from HDFS with data protection amid a hardware failure, also the scalability to multiple machines for big data sets (from gigabytes to petabytes). However, in reality, not every organization gets this huge amount of data volume required Hadoop. Spark could be more efficient framework for organization to enable Big data applications (e.g. ML, AI) without Hadoop
For individual, given Hadoop MapReduce is only on Java, it could be hard for some non-developer background data analyst to pick up. Spark could be a good starting point for beginner.
How about HIVE?
You may realize the Hive icon is similar with Hadoop elephant. It’s simply because Apache Hive is built on top of Apache Hadoop for providing data query and analysis as a distributed data warehouse system. It operates on HDFS (“Hadoop Distributed File System”) which could provides data operations capabilities using a SQL-like interface called HiveQL. In another words, it enables SQL-based querying languages on Hadoop to perform queries/analysis just like relational database.
Then what is Hive equivalent in Spark framework? It’s Spark SQL which isa Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.
What are Python and Scala?
They are actually the programing languages which could be used in Apache Spark. In particular, PySpark and Scala Spark are two most popular languages in the big data community for Spark framework. You may also hear people using Java and R APIs.
In fact, these programing languages could be used in many other developments (in both frontend and backend systems/applications) but not only limited to data.
How to choose the most suitable programing languages?
Choosing the right language is an important decision for a development team / enterprise. It’s hard to switch once you develop core libraries with one language.
There are some common misconceptions like “Scala is 10x faster than Python” which makes Scala sounds much more powerful than Python. Some fun comments from online and my experiences to give you some ideas,
Python
- Python is a first class citizen in Spark
- More people are familiar with Python than Scala and R nowadays (R was the most popular when I learnt data science eight years ago)
- Python has great libraries, but most are not compatible on Spark
- Python is more analytical oriented with NLP / machine learning / Artificial Intelligence libraries
Scala
- Many people feel Scala is complex language because it’s a compile-time, type-safe language
- It offers certain features that cannot be offered in PySpark, like Datasets (in Python, it will be DataFrame). Compile time checks give an awesome developer experience when working with an IDE comparing with Notebook (E.g. Databrick)
- Scala is more engineering oriented but still with data science libraries but relatively weak in visualization
Conclusion? There is no simple answer
In reality, there are many other factors you may consider beyond performance.
- Usage / business use cases
- Performance
- Ease to use
- Programmer preferences / Team skillset
- Specific features / libraries ecosystems
If you are individual developers without any enterprise preferred languages, Scala would be more powerful to utilize the full potential of Spark with arcane syntax and it also provides access to the latest features of the Spark, as Apache Spark is written in Scala.
BONUS: YARN, MapReduce and DataBricks
YARN (“Yet Another Resource Negotiator”), MapReduce and HDFS are key components in the Hadoop ecosystem being used in different stages of data processing. We have light-touched MapReduce and HDFS in the Spark vs. Hadoop section.
YARN: Cluster resource manager that schedules tasks and allocates resources to applications in Hadoop.
MapReduce: Splits big data processing tasks into smaller ones, distributes the small tasks across different nodes, then runs each task
HDFS: File storage system that manages large data sets running on hardware. It also provides high-throughput data access, high fault-tolerance and scalability
There are many other components within Hadoop ecosystem (e.g. HBase for NOSQL database, Pig for scripting, OOZIE for scheduling) that you should further explore if you are keen to learn more about Hadoop.
A new generation data platform — Databricks
It provides a unified, open platform built on Apache Spark and optimized for performance. The Databricks notebook could support multiple programming languages including Python and Scala.
It optimizes the developer’s experience for data engineers and data scientists with user-friendly UI for development, dataset management, cluster management, Spark job flow. It’s easy to deploy on different public cloud providers.
It’s difficult to tell the differences without real experience and coding. If you are really keen to get your hands dirty, you should check the above frameworks and programming languages out! Just pick an online course and get your journey now!