The typical response I get when I mention our usage of Spark is something along the lines of “Oh, it must be about the extra speed over Hadoop you get from the in-memory processing.” Speed and the in-memory aspect are certainly two things Spark is known for, and they are also touted on the project’s website prominently. However, neither of those are among the primary reasons why I invested resources to move my team to Spark as the default Big Data framework. Let’s take a look at what makes the difference.
First of all, with Pyspark you can very easily and quickly express your processing in Python. This is not necessarily a major point for a regular development team (for which I’d gravitate towards the Scala API to Spark), but it certainly helps for a data science team. Python is generally well-known by many data scientists, and it allows us to tap into existing processing libraries quickly. For example, quickly integrating existing code to fetch malware samples and dissect them for static analysis has been a breeze.
While Pig (as an example) has been very useful for us for quick ad-hoc queries and jobs, Pyspark makes it easy to also express complex processing. It is easier to write reusable components and to structure the code to quickly try out new ideas. Furthermore, iterative processing (i.e. processing where next steps depend on the results of previous ones) can be easily accomplished.
It is also straightforward to wrap unit tests around Pyspark code. This has been quite important for us as jobs get more and more complex. Being able to make changes at the end of the data pipeline and being able to quickly test both the change or how it impacts processing across multiple stages at a time has saved us from a lot of headaches.
Spark is also very easy to use on a local laptop—download and unpack the tarball, done. There’s no need to set up a cluster or download a lot of components to get started. We can do a lot of development (including unit testing) on a local machine before moving the code to a cluster in the cloud.
Lastly, there is the ecosystem. Spark comes with a lot of components such as GraphX and MLlib. With Spark Streaming, code developed initially for batch jobs can later be reused for stream processing. While we are not making much use yet of many of these extras (in some cases due to maturity issues), access to these features will allow us to move faster as they become more robust in later versions.
Of course there is also a cost to using Spark. Aside from developing new connectors, we had to invest a good chunk of time in operationalizing Spark. Small jobs generally run without problems, but as jobs become more complex and data becomes very large, we ran into frequent errors. Spark’s error messages are not always helpful, and the error message you first see is frequently not the root cause of the problem. There are many, many knobs to tune, and some of them are not documented. The documentation in general can be spotty. The good news is that over the last couple of releases, a lot of progress has been made.
So at the end of the day, it is about speed—but not the speed of execution but the speed of development and the speed of getting from an idea to results. And yes, if you use it right, your jobs run faster too.