Product

Why DC/OS and Apache Spark are better together

May 10, 2016

D2iQ

D2iQ

4 min read

 
According to Mesosphere co-founder and Apache Mesos co-creator Benjamin Hindman, the popular Apache Spark data-processing system was conceived in one of the last places you might expect: his parents' suburban Colorado home.
 
It was the summer of 2009, and the University of California, the Berkeley team building Mesos took a working retreat in the mountains at Ben's parents' home in Colorado. Fellow doctoral students Matei Zaharia and Andy Konwinski, who worked with Ben in the university's RAD Lab (which became the AMPLab in 2011), wanted a break from campus and Bay Area life and a chance to focus on writing some Mesos code. One goal for the retreat was to build a framework (application) that would show off Mesos' capabilities.
 
Incidentally, Matei was already focused on the idea of speeding up Hadoop by moving computation in-memory. By the end of the week, he had killed two birds with one stone by creating a prototype of his Hadoop alternative that also served as a proof of concept for Mesos. Thus Spark was born.
 
However, the now-hugely-popular Apache project almost wasn't. The trio of creators still thought of Spark as merely a proof of concept for Mesos, a notion their faculty encouraged. It was members of the broader open source community that really saw the promise of Spark and convinced Ben, Matei and Andy that Spark could stand on its own because it improved so many things that Hadoop users found very frustrating.
 
Spark was open sourced in 2010 and became an Apache Incubator Project in 2012. Mesos became an Apache Incubator Project in 2010. Since then, both technologies have matured to top-level Apache projects and have made the leap from academia into mainstream enterprise IT. Ben used Mesos to harpoon the fail whale at Twitter before joining Mesosphere as co-founder and chief architect in 2014, while Matei and Andy went on to co-found Spark-based startup Databricks in 2013.
 
Together again on DC/OS
 
Since then both Mesos and Spark have focused on changing the game in large-scale computing, albeit more in parallel than in conjunction with one another. The Mesos community has concentrated on making it easy to manage distributed systems on shared clusters, while the Spark community focused on streamlining processes and improving the performance of real-time analytics. Both open source projects are now very popular, running in production within some of the world's largest companies, and also its most innovative ones.
 
Now, Mesosphere is bringing Mesos and Spark closer together again via our recently open sourced DC/OS platform. DC/OS adds layers of operational simplicity and features above and around its Mesos kernel, creating a foundation for running Spark (and other distributed systems) in ways not possible on other platforms. Among the benefits of running Spark on DC/OS are:
 
  • Isolation: In large organizations, teams in different departments often share a Spark cluster, but may run different versions. They want isolation between the different versions so one doesn't interfere with another, and accessibility so each Spark instance has access to enough resources to run its job. Multi-version Spark support is unique to DC/OS, and not available in other environments.
  • True shared infrastructure: Overall, Mesosphere is breaking down datacenter silos by allowing organizations to run any database, web service or anything, really, alongside Spark. The flexibility to co-locate these other services on the same cluster (even the same machines or cores) as Spark will make for a much more functional Spark experience, especially as data-processing systems need to integrate more closely with each other and other modern application components, such as containers and microservices.
  • Performance: Mesosphere engineers also are optimizing the performance of Spark on DC/OS, including with a new scheduling algorithm that ensures all Spark tasks get an equal amount of CPU and RAM. This feature is available in the Spark DC/OS package today, and will land in the 2.0 release of Apache Spark.
  • Apache Zeppelin: We're integrating Apache Zeppelin, a popular open source graphing project by NFLabs. Apache Zeppelin, an incubating Apache project, enables users to make data-driven, interactive and collaborative documents with SQL, Scala and more.
  • One-click installation for all Spark components: These include Dispatcher, History, UIs and other hard-to-install projects. With Spark History Server, users can run and read the event logs afterwards to recreate the results of a given job.
  • Tight tracking with Apache Spark releases: Spark packages for DC/OS are up to date with Apache Spark within 24 hours of new releases. This means DC/OS users don't have to choose between operational efficiency and running the latest version of Spark.
  • Integration with secure HDFS clusters: Users can secure their data at rest inside HDFS via the Kerberos function that is built in to the Mesosphere DC/OS, and easily access it for analysis with Spark. In-motion data is secured via SSL.
 
But Spark is only one piece of a bigger puzzle when it comes to building the data infrastructure for modern applications. DC/OS also supports a number of other technologies for storing and processing data, including Kafka, Cassandra, HDFS and numerous distributed databases. Our Infinity solution takes the hard work out of building data platforms by delivering Spark, Kafka, Cassandra and Akka as an integrated real-time data pipeline.
 
You can learn more about running Spark, Infinity and a whole lot more (including Docker containers, databases and web apps) in our documentation. For guidance on installing DC/OS, check out its documentation, as well as this handy step-by-step guide from our blog.

Ready to get started?