About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

Closing Out 2015: Let’s Talk Data Governance, More Ecosystem Support, Azure HDInsight Deployment and a Cool New Spark Connector

By on November 25, 2015

Version 5.11 of Datameer is here, and it’s a big release. Each new feature is impressive on its own, but when you consider all of them together, we’re closing out 2015 with a bang.

We’re delivering governance features that we first announced in June, with a new goody or two thrown in. We’ve got enhanced ecosystem support too, including new Hadoop distribution compatibility and native connectors for the SWIFT file system as well as Amazon Redshift. If that weren’t enough, Datameer now runs beautifully on Azure HDInsight, Microsoft’s cloud Hadoop offering – in fact, we’re now deployable from the Azure Marketplace. And Datameer’s new Apache Spark connector is shipping in the box, too.

See? Put all that together in a single release and you’ve got quite a package. Let’s unpack this manifest a bit, so we can understand the features, and their value, a bit better.

Hullo gov’nah!

On the data governance side, we’re now shipping some features we announced earlier in the year. Within the product itself, you’ll see this manifested in a new data lineage feature, available from the toolbar. Similar to the sheet dependency graph, this new visual aid shows you the full lineage of data all the way from original source to the sheets it’s used in and even the business infographics it helps drive. Beyond the user interface though, lies a full listener-based API that provides full audit information about each entity in Datameer. Whether a filter is added to a sheet; a permission modified on a data import job; or a new user, group or role is created, the listener API can broadcast the event, to your own application, or to a 3rd party application that may be integrated with Datameer.

We were so excited by the API that we built out a detailed example of its use, resulting in integration between Datameer and the Git version control system. And we were so excited by the sample that we decided to ship it as a supported component. That means that all audit-related information can be made available in the form of a Git repository, and can be viewed in the Git client of your choice. This integration includes visual file differencing of Datameer’s underlying JSON serialization format, so that not only will you know about changes that have taken place, but you’ll also be able to see the “before” and “after” states of your Datameer project, as you browse through a chronological view of tracked events. And, again, it’s a Datameer-supported scenario.

Embracing the Big Data ecosystem

 Does your organization use Amazon Redshift for data warehousing in the cloud? If so, you’re not alone; it’s been one of Amazon Web Services’ fastest growing services to date. One of the neat things about Redshift is that its code base shares some common heritage with that of the popular open source relational database, PostgreSQL. That design meant that Redshift had compatibility on day one with all software that had a Postgres connector, including Datameer. That was a great start, but now that Redshift is more established, Datameer ships with such a native Redshift connector.

Azure HDInsight, here we come!

As of our 5.10 release, we support version 3.2 of HDInsight, Microsoft’s cloud Hadoop offering. You can now visit the Azure Marketplace and in one fell swoop of provisioning, deploy an HDInsight Hadoop cluster and a Datameer instance running on and edge node, pre-configured to work with the deployed cluster. All the virtual private networking, storage configuration and credential management is set up at the same time, automatically. If you’re an Azure customer, even one with a trial subscription, this is an excellent way to evaluate Datameer against a real Hadoop cluster, with minimal expense and no on-premises infrastructure.

SWIFT is here

While we’re on the subject of the cloud, let’s face facts: the hybrid cloud is real and OpenStack is a big driver of it. OpenStack’s object file system, SWIFT, has a great affinity to Hadoop’s own HDFS: it’s distributed, redundant and it works on commodity hard drives. So we thought it would make sense to add a connector for SWIFT, making it usable as both a data source and a destination. It’s just one more offering in the 5.11 smorgasbord.

Spark: it’s time
And then there’s Apache Spark. With its in-memory efficiencies, support for multiple programming languages and built-in modules for machine learning, graph processing, SQL query and streaming data processing, it’s a data processing engine that has grown in popularity immensely, over a short period of time. Datameer has been watching Spark closely, monitoring its enterprise-readiness and keeping tabs on interest in it from our customers.

And as our first tranche of Spark support, we now ship a connector that lets you use Spark as a data source or data destination. On the data source side, we make use of Spark SQL to query data you might be using in Spark. That’s more efficient than reading the contents of an entire file…and if the data is already in memory, it’s better still. On the destination side, Datameer can push data into Spark SQL as well (using data definition language queries). Once the data is pushed out, Spark users can do interesting things with it, like building predictive models using Spark MLlib.

What else?

In fact, there’s even more than I’ve discussed here, including integration with Apache Sentry; enhancements to our Google Analytics connectivity; SHA-encrypted data masking; scripted deployment support for metadata management systems like Collibra; REST API support for setting Datameer permissions; and Pearson and Spearman correlation support in our Smart Analytics module’s Column Dependencies implementation. I told you this is a big release. There’s more cool stuff to come, too. I’ll be back to chat about that in due time. Meanwhile, the Datameer team wishes you happy holidays, and great insights for the new year.


Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook


Andrew Brust

Andrew is Datameer's Sr. Director of Market Strategy and Intelligence. He covers big data and analytics for ZDNet, is conference co-chair for Visual Studio Live! and is a Microsoft Data Platform MVP.

Subscribe