Stefan's Blog

Big Data Musings From Datameer's CEO

Big Data & Brews: Justin Borgman, CEO of Hadapt

By on January 7, 2014

Tags: , , , , , , 1 comment

I can now say that I enjoy a gluten-free beer, thanks to Justin Borgman, the CEO of Hadapt. We got a chance to talk about one of my favorite ;) topics… SQL on Hadoop, and a bit about what it’s like to be an entrepreneur in the Big Data space. Check out the first of our 2-part conversation:

TRANSCRIPT
Justin Borgman from Hadapt joins us today for Big Data & brews. Justin, welcome.
Thank you for having me.
Please introduce your brew, and then yourself and your company.
Sounds good so this is a beer called Green’s Double Dark Ale, it’s actually gluten-free, so I am sure that I’m the first one on the show that is a gluten-free person. I’ve actually been gluten free for about 12 years, I was sort of diagnosed with an allergy, and I like to think of myself as a trend-setter, now its cool, gluten-free is cool. everyone’s gluten free.
Well we will feel good drinking this beer. It’s a healthy beer.
Healthier than all the others.
Uh oh, did it get shaken up a little bit?
We had one explode with Michael Stack actually that was a lot of fun.
Oh really?
Yeah.
So, what’s Hadapt.
Hadapt, we’ve been around 3 years now. We are basically building an analytical database on top of Hadoop. So if you’re familiar with this whole SQL on Hadoop space.
No.
No? you’ve never heard of it? Haha.
It must be something new.
Yeah, well nothing new anymore. But we like to think of ourselves as kind of the pioneers in the space, way back three years ago based on some research in the old computer science department. So my co-founder is actually Danial Ivati, who is a professor there, whose prior research actually led him to founding Vertica before becoming a professor. So effectively we built this platform on top of Hadoop, it sits on top of the Hadoop distribution and it allows you to work with data inside of Hadoop using SQL.
[1:51] Literally on top, or under?
Sort of embedded inside of it. This gives me a chance to draw. We like to think of Hadoop as sort of the operating system for us. And it has all that fault tolerance and resource management and so forth and we work with any distribution so just like there’s different distos of Linux, there’s different distros of Hadoop. And then Hadapt installs on top of that and effectively what we’re doing is you’ve already got HDFS as part of Hadoop certainly but we actually add a relational database component, that’s one of the things that makes us very unique architecturally. Hopefully I can spell here on the fly. So what that does it allows us to basically leverage relational database technology, which is more mature, and has a lot of advantages from a performance standpoint in terms of indexing, and so forth and leverage that as a part of the platform. And then we’ve got our own sort of SQL query engine up here that allows you to query that data in SQL, you can query both data inside of HDFS and also data in sort of this relational store.
And did you rewrite those parts? One of the challenges we see is that some folks are trying to reinvent the wheel with SQL on Hadoop. Right so and one really interesting observation it looks like your RDBMS is not sitting on top of HDFS right so if you think about other SQL related things, and they’re usually super slow because HDFS is sequential optimized file system, this is kind of more of a B-tree random access, so your RDBMS is not on top of HDFS, do I understand that correctly?
It’s not on top of HDFS, it is certainly installed on the same data nodes, but to your point this is written in C, this is optimized, and therefore you get a lot of those advantages.
So you are by far the fastest SQL on Hadoop then.
Yeah we certainly think so.
Let’s get in that conversation. Who is the biggest, fastest, etc.
Well that would be a good bench marking exercise which certainly some of our customers have done, but you know we think that we’re the fastest and also the most mature. It turns out it is really hard to build a database. Because to your point a lot of the open source vendors are trying to reinvent the wheel and that means adding a lot of database features from scratch if you’re building them within that platform whereas we’re able to bring a lot more to the table we’ve been working on this a lot longer. So, our query engine is more mature, our optimization is more mature. More SQL support, more optimizations built into that query engine.
[4:41] Let’s try this beer.
So what do you think for a gluten-free beer.
It’s good. You know, some people bring pale ales, or IPAs, and I’m not a big fan, but I like the dark beers. Its good. I didn’t even know there’s gluten in beer. Well that feels good. Let’s have more gluten free beer.
So tell me a little bit about the history of the company. So you guys all started together at which University?
Yale University. And yeah I was actually in business school at the time, I was actually a software engineer before business school, but while I was there, I happened to meet a few folks in the computer science department, Daniel being one of them, Camille being another one, that was Daniel’s Ph.D student who is also my co-founder. And they had this really interesting research called HadoopDB and it was idea basically taking a lot of the lessons learned from C-Store and Vertica, C-store was the paper that became Vertica, and trying to apply them to the Hadoop world which at that time was still emerging, that was in 2009, we incorporated in 2010, and effectively trying to bring some of that performance and functionality to this platform and also make it more accessible. And I think this is one area where you and I certainly agree, how can we make this all more usable for enterprise customers, not just the really smart guys at Facebook for example. So that was what was really attractive to me when I first read this paper. I sort of got to know them, we hit it off really well. I was an MBA student, so initially there was some skepticism…
They were like “hmm do you have sneakers?”
Right, haha, exactly but yeah it worked out well, we decided we were going to go for it and commercialize this, and so we went through the licensing process with the university there were a number of patents pending on that technology at the time and so secured that, raised some money, and ended up moving to Boston. So we’re the Hadoop company of Boston.
How many big data companies are in Boston? And so Vertica used to be there, right? Or is still there.
Still there yep, part of HP now, it’s a great database town, Netezza is another great example, which IBM acquired, certainly there was Endeca, which Oracle acquired, and a number of startups actually, the big data scene I would say blossomed in the last 2 years. So, it’s a great town. You know when we started in New Haven Connecticut, we kind of knew we had to move somewhere, and we debated between Silicon Valley or Boston, but we felt like Boston had a lot of advantages to it in terms of the infrastructure engineering talent that is abundant there and I don’t have to compete with Google and Facebook like I’m sure you do here.
[7:37] Yeah we actually don’t have engineers in Silicon Valley, yeah.
Oh really?
We have engineers in NY and Germany. I mean we have a few engineers here and services, sales and marketing, but yeah exactly building a technology startup and hiring a whole bunch of engineers in silicon valley is a money game, and if you’re trying to build a lean startup you don’t want to play that game. You will not out-compete Google in salaries.
Right.
So that’s why… It’s a globalized environment, you know, we’re trying to be a global company. Its fun.
What part of Germany?
So, close to Berlin. Yeah. That’s where my old company was. It’s a small university town called Halle, we have 6 universities within 200 kilometers, so that’s for us the really nice hub. 1,700 year town, a lot of chateaus, and 25% students. Check it out, Halle.
So yeah how big are you guys now? Like how many engineers do you have in Boston?
So we have about 30 engineers just in Boston. And then we have a small office in Poland, interestingly which goes back to our initial founding, when we first started we were sort of struggling to find good people in New Haven, and one of my co-founders Kamil is Polish so these are people he knew. So we kept that office, so we’ve got about 10 people in Warsaw,
Nice city.
Not quite as nice architecturally perhaps as others, but no its an awesome city and I’ve been over there a couple times.
Cool. So now you are 3 years old, you built a technology, tell me about your experience as an entrepreneur, you know, bringing this to market, what are the learnings and where is the technology resonating the most?
Well you know I would say first of all, and I would be curious to know if you agree with me, but its been an overwhelming learning experience, right?
That’s what it is.
Yeah, for sure. And I’ve really enjoyed that process it sort of makes me want to do it again at some point in the future, you know, just to leverage everything that I’ve learned so far.
See all the gray hairs, think about that twice.
Haha, but yeah, its been wonderful. I think that the market is so robust for what we’re doing right now, it’s hot, there’s so much interest, you know a lot of people sort of have these visions of being able to move away from legacy Teradata environments, what have you, and go to Hadoop, and certainly Hadoop can’t do all those things yet today, but that kind of vision, that future is exciting to people and its great to be a part of that.
[10:26] Yep and so how did you guys bring your product to the market? You said you work with all the different Hadoop distributions. Are those guys reselling, are you selling directly, help me to understand.
Good question. So if our relationship was on a Facebook page, it’d be “it’s complicated” with some of them, because when it started out certainly we were very good partners. We were very complimentary to what they do, but now a lot of them started offering their own SQL on Hadoop offerings, like Impala, and so forth. So now we still run on those distributions but we sort of co-opetate with them I guess.
So you basically have your own folks that bring in your customers. And your customers download the trial, or they call you up if they want to try it? How would I try your product?
Yeah so you’d call us up or fill out the contact form online and then we’ll work with you to do the trial. We give you an evaluation agreement, let you work with the software, hopefully do a proof of concept that hopefully if all goes well ends up with you wanting to buy the software.
Ok so no online trial or download?
We don’t have like a free download at this point. We’ve been thinking about some ways to create like a sandbox online perhaps where people can work with it in a virtualized environment, that’s something we may do, but up to this point its been more people reaching out to us and then working with customers.
Cool so what are some of the use cases people are implementing, where are you seeing your technology being very strong? SQL, structured data, the classical use case for logfile analytics on Hadoop, maybe help me to contrast that. Why would I say I really want to have this Hadapt thing.
Yeah so do we have an eraser?
No, I basically use my elbow. Haha. We should. Note to self, we need an eraser.
Um so all I was going to say here was the way we look at it there’s certainly this spectrum of structured and unstructured data. And SQL on Hadoop is largely about the structured data. How can you do the structured traditional SQL workloads on structured data, and that’s certainly where we began. But where we’ve evolved is by allowing enterprise users to query this entire spectrum. So in addition to SQL we also introduced full text search, which works on this end of the spectrum, and then we just introduced a feature in version 2 of our software back in September that we call schema-less SQL, which is basically the ability to query semi-structured data like JSON, or XML, so key-value data, and you can query that directly.
[13:24] Oh cool, is that more of an x-path query language for JSON or is it still straight SQL.
It’s straight SQL. So what we actually do is, and don’t ask me too many questions about how this works or we’ll have to get one of our engineers involved…
We’ll just call them in.
Haha, it’s all magic. No but basically what we do is we take the key value pairs in the JSON data and we materialize them in a tabular format, so the keys become treated like columns so you can continue to query using SQL, JSON, XML etc. even as that schema is changing. So you don’t have to change your ETL process, there is no real ETL process per se, or its sort of happening automatically in a way, and also you don’t have to change the schema of your database, because we’re automatically sort of materializing that. So you may change your JSON, add a new attribute to your JSON file, and now that’s immediately query-able. We did that because, to answer your question about use cases, we found a lot of customers that want to do analytics on JSON XML data, whether that’s clickstream data or some kind of event log data, or data coming from a key value store like a Mongo or HBase or what have you, and so this was kind of an easier way to allow them to query that directly. I can talk about one customer example. This is a customer actually in Boston that they have a really cool business, it’s called Objective Logistics, and what they do is they take point of sale data from restaurants and bars and they take the receipt data, the check data, and they basically use that data and analyze that data to determine the best wait staff in the restaurant and the bar. Who’s selling the most, what sort of margin items are they selling you know some items are higher margin than others, what kind of tip are they getting, and they stack rank these folks and then they do a couple things with that. First of all they reveal that to everyone in the restaurant so there’s sort of some inherent competition.
Right, gamifying it.
Gamifying it, that’s exactly right. And then secondly, the people that are at the top get rewarded by choosing their shifts first. And if you work in a shift business, especially in a restaurant, that’s important. Exactly. So that’s a case where there’s all that data’s stored in JSON, its coming from all this different point of sale systems, and its changing all this time, every restaurant is different, every point of sale is different, so having that flexibility is important.
[15:57] Cool that’s interesting. And what kind of industries are you guys seeing? We see a lot of financial services, telco, retail, the new thing is kind of optimizing production systems, so lean production based on big data, do you guys have hot spots, do you have any vertical where you say oh, 90% of our customers are in retail, or something like that?
Yeah I would say the top three for us are what I would call Internet, SaaS-based businesses doing funnel analysis and that sort of thing, retail is one as well, and then financial services around security and fraud detection, that sort of thing.
Ok interesting, cool. And what’s kind of the deployment sizes you guys are seeing with tyour technology.
It ranges everything from 5-6 nodes, or even in EC2 we see our customers deploying this in Amazon.
Oh interesting we don’t see this at all.
Really? Interesting. I mean I would say the majority would still be on premise, but we do see it as a trend more and more people using Amazon. And then, deployment size, that was the question, everything up to a couple hundred nodes so far. The interesting thing is that because we price on a per-node basis, certainly, customers very smartly decide to build beefier nodes in some of those on-premise situations.
Interesting so we price on data.
Data ingestion?
Yeah, data ingestion.
Kind of like a Splunk model?
Yeah exactly where its about we don’t care how many users you have, we don’t care if you have 10 or 500 machines, the beauty is we see really high ROIs around even 10 node hadoop clusters. The success of your Hadoop or Big Data strategy isn’t related to your Hadoop cluster, funnily enough the engineers are always measuring themselves by the size of their Hadoop cluster, but I don’t know why that is.
Right.
[18:14] Anyway, cool so what’s, is there anything you’re really excited about in the market? What’s next? Is YARN playing an important role for you guys?
Um it will, that’s certainly something we’ve certainly been paying a lot of attention to, and I think that’s something we’ll be leveraging very soon, so that’s certainly one thing. But you know I think it’s just exciting to watch the market evolve, and watch more and more people move from experimental mode to production mode, and that’s what very exciting for us. And that’s where we think our value prop is more appealing, I’d imagine you’d probably feel the same way. We don’t find too many customers that want to just sort of run Hadoop in the lab and want Hadapt for that, its more sort of that next step when they’re like “Okay, now we actually want to use Hadoop in a real production case, how can we make it more accessible and usable and build applications against it and so forth.”
Cool, wel let’s have another drink of the gluten-free beer. We’ll be back next week with the next episode of Big Data & Brews.
Cheers.

1 comment

  1. […] This week’s episode of Big Data & Brews is part 2 of my discussion with Justin Borgman, the CEO of Hadapt. We talk about the future of Hadoop & Hadapt, schemaless SQL,  company culture & the benefits(?) of beers on coding, Lucene, Solr, AWS, YARN, and more. In case you missed part 1, you can view that here. […]