Differences in Spark Implementations & Why They Matter


Spark has become one of the most widely supported big data technologies, with quite possibly the fastest-emerging ecosystem out there. In fact, according to KDNuggets, it is now the largest open source data processing project, with more than 750 contributors from over 200 organizations.

Why is it so popular? In part, it’s because Spark exposes several different interfaces. But that variety also leads to varying levels of support for the product and, unfortunately, those differing profiles tend to be conflated under one broad heading, as if they were all the same.

Download your free ebook on Spark implementations

The reality is, of course, not all Spark support is the same, and it’s crucial that you understand the differences before making a buying decision.

Spark SQL: The Ticket In

Many BI products on the market have added support through the use of Spark SQL, a technology that allows tabular data to be queried with the same dialect of SQL used by an earlier open source technology, Apache Hive. Spark SQL, therefore, is akin to earlier SQL-on-Hadoop technologies. Though it uses the Spark engine behind the scenes, Spark SQL suffers from the same disadvantages as Hive and Impala: data must be in a structured, tabular format to be queried. This forces Spark to be treated as if it were a relational database, which cripples many of the advantages of big data technology.

  • Pros: Lots of analysts are already comfortable in SQL, so little net-new learning/onramp.
  • Cons: SQL cripples the benefits of Spark, forcing structured, tabular format for query, limiting the benefits of unstructured data.

No (Programming) Guts, no Glory?

If you want the full power of Spark, you’ll want to use its processing engine more directly. A couple of vendors do offer that support in their products, exposing its programming models, either through a command line or “Notebook” interface. With this approach, all barriers to functionality are removed, but a barrier to entry is introduced, and it’s a big one: you need to be a developer to use Spark within these products. For a lot of organizations, that’s simply not actionable. And, even for organizations that have the developer resources in-house, deploying them to work on data analytics projects may amount to an intolerable hidden cost.

  • Pros: Leverages full power of Spark; very granular control
  • Cons: Must be a Spark developer to use; need specialized, not broadly available skills.

UI on Top; Full Power Underneath

The next approach is to abstract Spark’s complexity away, through the creation of a business user interface, but take full advantage of its power behind the scenes. This allows users to leverage it well beyond its SQL query interface, and do so without developers on board or the need for developer training.

There are products on the market that offer such an architecture, and they are both powerful and popular. But their apparent strength – that they are tightly coupled with the Spark engine – is also their weakness. Because one day, it will be supplanted by some other new engine. When that inevitable engine churn takes place, these products will either have to stick with Spark, or be reengineered for something new. And such a reengineering would be a far from trivial undertaking.

  • Pros: Get full power without needing specialized skills
  • Cons: Increased dependence on Spark makes it difficult to swap out engines as new tools emerge.

A Modular, Poly-Engine Design

Smart Execution

Datameer’s Smart Execution engine intelligently selects the best processing engine or combination of engines for every single job.

Just because Spark was the best choice of execution engine when a product was built doesn’t mean it will remain so. The big data industry moves quickly and – some may argue – often rather destructively. The only way to negotiate that potential threat is to expect it and plan for it. The best way to build a product for Spark and serve customers’ interest in being future-proofed is to provide a business user-oriented interface, and a modular design that works great with multiple engines. This is the approach Datameer takes with our Smart Execution engine.

Such a design provides the same benefits as products that have user interfaces built directly over Spark, but it also lets customers take advantage of other engines which may be more appropriate for certain workloads. More to the point, though, this modular architecture allows the work that customers do in a product today to work with big data execution frameworks that don’t yet even exist.

Put the Data in the Spotlight

Breaking the tight coupling between product and engine allows the customer to focus on the data itself, rather than the implementation details of which engine will process it. That liberty of focus applies not just to working with the data and but also to building an investment strategy around big data technology.

A modular design provides portability between engines, and reduces the high-stakes pressure around picking just one. This removes a huge barrier to embarking on a customer’s big data analytics journey in general, and shortens the time to reaping that journey’s rewards. Customers benefit from ever-improving technologies, greatly reduce their risk in adopting them and shield themselves from the turbulence of the fast changing big data technology marketplace.

Want to learn more? Read your free whitepaper, Understanding Differences in Spark Implementations.


Connect with Datameer: