Recently, I came across blog posts from two different analysts that caught my attention. Doug Henschen at Constellation Research penned a post titled “Democratize the Data Lake: Make Big Data Accessible.” And over at Forrester, Brian Hopkins wrote an entry in his blog that he called “Insights-Driven Businesses Are Stealing Your Customers.” The two posts, which I’ll summarize in a moment, cover different topics. But upon reading both, I realized they spoke to different symptoms of a common underlying problem and it’s one I’d like to address.
Filling the Gaps on the Data Lake
Henschen’s post acts as a companion to his report “Democratize Big Data: How to Bring Order and Accessibility to Data Lakes.” The post, and ostensibly the report, discuss the phenomenon of companies building data lakes and filling them with all sorts of data, but nonetheless finding it difficult to connect to, use and analyze that data. The apparent result is low ROI on the data lake.
Henschen sees tools in the categories of:
- Data cataloging and metadata management
- Data discovery and self-service data prep
These fill the gaps that cause data lakes to feature high barriers to entry in terms of usability. He mentions Datameer as one of those helpful tools and says that “both camps are bringing automation and repeatability to data lake management and governance.”
Don’t Be Data Driven — Be Insights Driven
Hopkins’ post is also a companion to a report, specifically “The Insight-Driven Business.” Hopkins focuses on the notion that amassing data is insufficient in helping businesses get ahead. In fact, he argues the very notion of being “data-driven” is-ill conceived. Why? Because, Hopkins says, it’s not about the data at all. It’s about the knowledge that comes from that data and the repeatable application of that knowledge to a business process.
Hopkins calls businesses that apply their data in this way “insights-driven,” and says that from last year through 2020, they can expect a compound annual growth rate between 27 percent and 40 percent. Hopkins says that projection is based on a revenue model built by Forrester that “conservatively forecasts” the revenue of insights-driven companies from now through the aforementioned destination year. He also says that companies in this group extend beyond those which were born digital to include the likes of The Washington Post and Alaska Airlines.
The Initial Stage of the Big Data Maturity Model
Henschen is focused on tools; Hopkins on process and business practices. But both their posts are reactions to something that I think a lot of us instinctively know, but no one has explicitly called out.
It’s also something no one may care to admit, but here goes: the initial stage of the Big Data maturity model involves using the technology for cheap storage.
Think about it.
The Hadoop Distributed File System (HDFS) is based on federating cheap commodity Direct Attached Storage (DAS) drives into a distributed storage system with built-in redundancy. Compare that to Network Attached Storage (NAS) hardware built into storage and data warehousing appliances. There’s at least one order of magnitude difference in price between the two and with most HDFS implementations, there’s no lock-in to a specific vendor.
A Crossroads in Big Data
Now, take all that cheap storage, combine it with Big Data’s credo that all data is important a none of it should be discarded, and what do you get? A data lake that’s about as organized and usable as that stuffed-to-the-gills hall closet that you’re scared to open and can only close if you lean your whole body against the door.
With that approach to data, you get a data lake that’s not very accessible, owned by a business that’s storage-laden rather than insights-driven. And, yes, data preparation, data discovery and automated data lake governance/management can help. So too can a commitment to mine the data for insights and create institutionalized actions around them.
But a majority of companies aren’t there yet. In fact, many enterprises out there have only reached a big data adoption level that has them doing three things:
- Running pilot projects amongst a core, elite group of Hadoop specialists and their brave business customers
- Storing raw data into the data lake, which consists of a folder structure in HDFS
- Trying to measure the ROI of the above two activities
And therein lies the root cause of data lake accessibility and insights-poor processes: only the preparatory steps have been taken.
Best Practices for Data Lake Accessibility
What’s the remedy? First, let’s be mindful that companies at the above adoption level are not at fault. In fact, if anyone is to blame, it’s vendors for not giving enough proactive guidance. But let’s take blame out of the equation, because this is really just about a natural maturity model around adoption of a new technology. Now customers need help getting to the next level in, and eventually all the way through, the maturity model.
That next level has a few different facets to it:
- Getting the data lake well-documented and curated
- Enabling customers to explore, shape and analyze the data even if it’s not well-curated
- Empowering customers to find patterns in the data so they can do something about them
Steps to Data Lake Accessibility
The best way to get there is good old-fashioned outreach. Vendors and consultants should check on their customers and help them through an end-to-end exercise of grabbing, preparing and analyzing a data set. Here are some outlined steps:
- If possible, assist the customer to institutionalize certain actions or decisions around the findings. If not, at the very least get them to conjecture what specific actions should likely be taken.
- Next, check on the customer to make sure she goes through the process again, but mostly on her own. Do not withhold support, but make sure the customer is in the driver’s seat for the process.
- And lastly, while this is really something only the customer can do, help her form habits around doing this process early and often.
Examples of Data Lake Use Cases
We can give some concrete examples. Start with a retail operation that is dumping supply chain data (including warehouse merchandise scans, GPS data from trucks and point-of-sale data from stores) into a data lake. Beyond gathering and storing the data, the retail operation needs to blend it, aggregate it (perhaps by day, or geography or shift or session), enhance it with weather data and ultimately get a sense of which products sell best where, during which seasons and weather conditions, and then change distribution patterns accordingly.
Another possibility: an elevator maintenance firm should do more than accumulate ride data from the various cars in the various buildings it serves. It should also aggregate by:
- Elevator bank and/or date
The firm should then enhance it with security desk data and build predictive models that optimize which elevator cars should home-base on which floors and which parts and personnel should be on site or close by, at which times.
The Future for Big Data ROI
Changes like these are ones of policy and culture, and effecting the change is far from easy. While vendors usually wouldn’t be involved in such fundamental change, it’s necessary here because the technology is still young and customers won’t derive benefit otherwise. So outreach and training is required. Ultimately, these circumstances are temporary but it will probably be years before the outreach is unnecessary.
The tech industry tends to sell its products based on vision. Not all the time, but usually, these visions are bold and bring benefits to the customers that come on board. But vision doesn’t just implement itself. It takes hard work, initial apparent setback, then more hard work and perseverance to get customers to the next level.
Vendors got customers this far, and those customers paid good money for it. Now it’s up to vendors and implementers to help customers proceed. If you’re a customer, insist on that help from your vendor; the best ones will welcome the opportunity.