Scala is becoming a major influence in the big data world -- Spark, Kafka and much of Twitter's data stack are written in it -- so it's probably as good a time as any to learn it. You can do just that next month at the
Scala by the Bay and
Big Data Scala conferences, taking place August 13-18 in Oakland.
The conferences will cover a lot of ground with regard to programming in Scala and its associated ecosystem, including the so-called SMACK (Spark, Mesos, Akka, Cassandra, Kafka) stack for building big data pipelines. We recently spoke with conference organizer Alexy Khrabrov, who discussed why Scala is so important, how the big data ecosystem is shaping up around Spark, and what attendees can expect to learn at next month's events.
Here's what he had to say.
MESOSPHERE: Can describe what people should expect from the Scala by the Bay and Big Data Scala events?
ALEXY KHRABROV: This is the third year of Scala by the Bay. It's a classic, traditional kind of Scala developer conference similar to Scala Days. You have APIs, Akka, Play, functional programming, and all the other good stuff about Scala such as type safety, program idiom, subject functional. It's for folks who know Scala pretty well, or folks who know other programming languages pretty well -- maybe who want to upgrade from Java, for example.
Scala by the Bay takes place August 14 and 15 at the Kaiser Center iin Oakland, and Twitter just added a
FinagleCon day --
Finagle is Twitter's basic RPC fabric -- that is on August 13 at the new Twitter building.
Then we have complete data pipeline training on Sunday, August 16, which leads into Big Data Scala on Monday and Tuesday, August 17 and 18. That will focus on end-to-end big data pipelines in Scala, and on data science on the JVM. So we'll have talks about
Akka,
Kafka,
Spark,
Cassandra,
Mesos,
Scalding and basically big data systems. We'll also have talks on machine learning, data processing, data mining and more using Scala.
Is there a connection to Mesos in all of this?
As many folks know, Mesos was a project at AMPLab to manage distributed clusters, and Spark was a little demo app for Mesos. So there's an inherent connection in that they were both developed by the same group, and then Ben Hindman went to Twitter to productionize Mesos and Matei Zaharia went on the Databricks to commercialize Spark.
And then connection is still there because
Spark runs very well on Mesos. In fact, many consultants will deploy Mesos for you as part of a Spark installation, so you don't need the whole Hadoop ecosystem to start running Spark. You can just run it standalone, often in production, with features such as resilience and resource management. Mesos is a well-understood way to run Spark flows.
With Mesosphere helping with pipeline training at the conference, I think it will become obvious that you can start Spark, Kafka and Cassandra all as services. It's a very nice and easy way to productionize the pipelines we're talking about.
And that -- Spark, Mesos, Akka, Cassanda, Kafka -- is called the SMACK stack?
Yes, that's the acronym. You can also call it the SMACK HARD stack, because it's highly available, resilient and distributed by default.
What's fundamentally better or different about this new stack than the old way of doing things?
I think it signifies, maybe, the teenage years of big data. The first time big data appeared was pretty recently -- it's all been less than 10 years. When Hadoop emerged, it was all basically MapReduce jobs and HDFS, and then HBase, but the way to run those jobs was extremely convoluted.
The emergence of Spark owes directly to its simplicity and the high level of abstraction of Scala. You have these distributed collections that are functional by default, and you apply transformations to them. And also they're interactive, so you can play with them just like you do with Python or Ruby. Spark takes that exact idea, without any change, and puts it on a cluster.
In my view, Spark is basically distributed Scala, and we'll have some talks connecting the collections in Scala and the distributed, resilient datasets in Spark. I think it's the right way for many data scientists and computer scientists to look at data. Data is basically these sets, and in MapReduce you don't have this inherent notion. It's gotten lost in a lot of glue, and Spark gives you this overall approach where it becomes very easy and interactive to experiment with data.
Mesos is a great way to run this thing because while Spark beautiful, you need to run it 24-7 if you want to run it as a business. Twitter clearly showed that it's a successful business that can run at web scale 24-7 on Mesos.
Cassandra is a very convenient NoSQL store, and all this process data needs to be persisted somewhere, so Cassandra shows up as a very east way to persist. Akka and Kafka are pieces of the data pipeline that take data from the API and pump it into Spark. So, together, the SMACK stack basically gives you and end-to-end data pipeline.
Explain the data pipeline training sessions you'll be doing at the events.
We have multiple focused courses happening, but I'm going to talk specifically about the complete pipeline training. This is the unique training we're building specifically for this conference, and I don't think anybody has had this type of training before.
We're taking the five pieces of the SMACK stack and we're building an open source repository that will show a single app with all the pieces. The app is going to be Spark After Dark, a dating site where you can match yourself with other people and use machine learning to find a life partner. We'll build the app by taking a web app with your preferences, and then going to Kafka, doing the machine learning on Spark, using Cassandra as a data store, and the whole thing will run on Mesos.
It will be a hands-on training for hundreds of developers. All five segments of the SMACK stack will be presented by experienced engineers from companies very familiar with the technologies. We'll have folks from Typesafe, Confluent, DataStax, Databricks and Mesosphere. Mesosphere will actually open the training by showing
how to use the Mesosphere DCOS cluster to spin up the Spark, Cassandra and Kafka clusters. Each segment will involve lectures and explanations of what these things are while following the code, and we'll also ask folks to do some light coding.
You'll basically end the day with a complete webscale startup similar to Match.com. Hopefully, you'll take this knowledge and build your own startup or take this technologies and use them to improve your own business at your existing company. We want it to inspire people and make them better engineers.
So is Hadoop still a piece of this puzzle?
At this point, people mean different things by "Hadoop."
It used to be essentially MapReduce jobs and HDFS, HBase and Hive, but now many of these pieces have been replaced by pieces of the Spark ecosystem. MapReduce has certainly given way to most datasets when they can fit in RAM, and even when they can't. Hive has started to give way to Spark SQL, which is a huge driver of Spark adoption. I think 100 percent of Spark users now use Spark SQL.
HBase is certainly an alternative to Cassandra, so between those two databases you have most of the webscale NoSQL space covered.
HDFS I think is the mainstay. I don't think there's currently an alternative to HDFS in terms of infinite storage where you can just add machines and keep dumping web logs into it. But HDFS can just be ran as a service on Mesosphere.
It's an evolution, and what used to be referred to as "Hadoop" now means "Spark." It just means an ensemble of technologies enabling a data pipeline, storage and analytics, which the SMACK stack now does.
Is there a lot of overlap between the conferences -- Scala by the Bay and Big Data Scala? Should I come to one or both?
I think it really depends. Scala is now becoming this glue that connects the dots in big data. It sounds kind of corny, but it's actually true that if everything is moving through the same pipeline you don't need to do ETL. We're unveiling a NoETL campaign and NoETL.org, because basically ETL should be a thing of the past. You should not extract and then load again. Everything should flow through the system, and if you keep it in the system, you have these tools available such as Akka, Play and Kafka.
If you really want to master the pipeline, you should definitely come to
Scala by the Bay and
Big Data Scala. If you also want to know how Twitter does it -- Twitter has a slightly alternative version, where they have open sources equivalents for most of these things --
FinagleCon should be great. It will include a bunch of functional programming and Scala knowledge that you can use for Spark, Kafka and in other contexts.
Depending on your specific interests, you can come to Scala by the Bay to really learn about programming in Scala. And everyone who registers gets a free ticket to FinagleCon. If you also want to learn big data and data science, like if you want to learn Scala for Spark, then Big Data Scala is a good choice.