Datameer Blog post
Closing Out 2015: Let’s Talk Data Governance, More Ecosystem Support, Azure HDInsight Deployment and a Cool New Spark Connector
by Datameer on Nov 25, 2015
Version 5.11 of Datameer is here, and it’s a big release. Each new feature is impressive on its own, but when you consider all of them together, we’re closing out 2015 with a bang.
We’re delivering governance features that we first announced in June, with a new goody or two thrown in. We’ve got enhanced ecosystem support too, including new Hadoop distribution compatibility and native connectors for the SWIFT file system as well as Amazon Redshift. If that weren’t enough, Datameer now runs beautifully on Azure HDInsight, Microsoft’s cloud Hadoop offering – in fact, we’re now deployable from the Azure Marketplace. And Datameer’s new Apache Spark connector is shipping in the box, too.
See? Put all that together in a single release and you’ve got quite a package. Let’s unpack this manifest a bit, so we can understand the features, and their value, a bit better.
On the data governance side, we’re now shipping some features we announced earlier in the year. Within the product itself, you’ll see this manifested in a new data lineage feature, available from the toolbar. Similar to the sheet dependency graph, this new visual aid shows you the full lineage of data all the way from original source to the sheets it’s used in and even the business infographics it helps drive. Beyond the user interface though, lies a full listener-based API that provides full audit information about each entity in Datameer. Whether a filter is added to a sheet; a permission modified on a data import job; or a new user, group or role is created, the listener API can broadcast the event, to your own application, or to a 3rd party application that may be integrated with Datameer.
We were so excited by the API that we built out a detailed example of its use, resulting in integration between Datameer and the Git version control system. And we were so excited by the sample that we decided to ship it as a supported component. That means that all audit-related information can be made available in the form of a Git repository, and can be viewed in the Git client of your choice. This integration includes visual file differencing of Datameer’s underlying JSON serialization format, so that not only will you know about changes that have taken place, but you’ll also be able to see the “before” and “after” states of your Datameer project, as you browse through a chronological view of tracked events. And, again, it’s a Datameer-supported scenario.
Embracing the Big Data ecosystem
Does your organization use Amazon Redshift for data warehousing in the cloud? If so, you’re not alone; it’s been one of Amazon Web Services’ fastest growing services to date. One of the neat things about Redshift is that its code base shares some common heritage with that of the popular open source relational database, PostgreSQL. That design meant that Redshift had compatibility on day one with all software that had a Postgres connector, including Datameer. That was a great start, but now that Redshift is more established, Datameer ships with such a native Redshift connector.
Azure HDInsight, here we come!
As of our 5.10 release, we support version 3.2 of HDInsight, Microsoft’s cloud Hadoop offering. You can now visit the Azure Marketplace and in one fell swoop of provisioning, deploy an HDInsight Hadoop cluster and a Datameer instance running on and edge node, pre-configured to work with the deployed cluster. All the virtual private networking, storage configuration and credential management is set up at the same time, automatically. If you’re an Azure customer, even one with a trial subscription, this is an excellent way to evaluate Datameer against a real Hadoop cluster, with minimal expense and no on-premises infrastructure.
SWIFT is here
While we’re on the subject of the cloud, let’s face facts: the hybrid cloud is real and OpenStack is a big driver of it. OpenStack’s object file system, SWIFT, has a great affinity to Hadoop’s own HDFS: it’s distributed, redundant and it works on commodity hard drives. So we thought it would make sense to add a connector for SWIFT, making it usable as both a data source and a destination. It’s just one more offering in the 5.11 smorgasbord.
Spark: it’s time
And then there’s Apache Spark. With its in-memory efficiencies, support for multiple programming languages and built-in modules for machine learning, graph processing, SQL query and streaming data processing, it’s a data processing engine that has grown in popularity immensely, over a short period of time. Datameer has been watching Spark closely, monitoring its enterprise-readiness and keeping tabs on interest in it from our customers.
And as our first tranche of Spark support, we now ship a connector that lets you use Spark as a data source or data destination. On the data source side, we make use of Spark SQL to query data you might be using in Spark. That’s more efficient than reading the contents of an entire file…and if the data is already in memory, it’s better still. On the destination side, Datameer can push data into Spark SQL as well (using data definition language queries). Once the data is pushed out, Spark users can do interesting things with it, like building predictive models using Spark MLlib.
In fact, there’s even more than I’ve discussed here, including integration with Apache Sentry; enhancements to our Google Analytics connectivity; SHA-encrypted data masking; scripted deployment support for metadata management systems like Collibra; REST API support for setting Datameer permissions; and Pearson and Spearman correlation support in our Smart Analytics module’s Column Dependencies implementation. I told you this is a big release. There’s more cool stuff to come, too. I’ll be back to chat about that in due time. Meanwhile, the Datameer team wishes you happy holidays, and great insights for the new year.
At Datameer, we’re obsessed with making data the most valuable asset in any organization. We believe that when people have unconstrained access to explore massive amounts of data at the speed of thought, they can make data-driven decisions that can wholly impact the future of any business.