I thought it would be fun this week to flip the tables and show one of our first recorded episodes that we did with Kelly, one of our senior software engineers. The episode never got aired and I thought it was time to bring it out and share it. He joined our team from EMI Music where he worked on the digital distribution infrastructure. We touch on a little bit of everything, including his collection of cool “nerd shirts.”
Stefan: Hey, welcome back to Big Data Brews today, with Kelly.
Kelly, do you want to introduce yourself? What do you do? Where do you come from?
Kelly: I’m living in New York right now, and am a senior engineer here at Datameer.
Stefan: A kick-ass senior engineer.
Kelly: Well, some days are better than others.
Stefan: What days are better?
Kelly: When you actually win.
Stefan: Over the bugs?
Kelly: (Laughs) Yes, always.
Stefan: Tell us a little bit of what’s your background and what you’re working on.
Kelly: My background has been a mix of both consulting and product development. Before Datameer I was working at EMI Music in kind of like a skunkworks group, building newer technology or using newer technology to solve other business problems.
Stefan: So EMI Music’s not really known as a technology company. What are they usually known for? What is their biggest product?
Stefan: Yeah? Contracts?
Kelly: Yeah. Music contracts.
Stefan: Like, for example? I don’t even know EMI Music.
Kelly: Like the Beatles?
Stefan: The Beatles! Oh, okay.
Kelly: Yeah, kind of.
Stefan: Oh, so you’ve worked for The Beatles?
Kelly: Well, no.
Stefan: Okay, but with The Beatles somehow?
Kelly: Yeah, kind of.
Stefan: What did you do in that group?
Kelly: They were trying to change how, internally, a lot of software is developed in the company, and try to work with more of the large amount of the data that they could be using to figure out how to sell music better or who to advertise to or where they should be providing different services for the musicians or things like that. A lot of it started out going through a lot of their IT Systems and seeing what they were currently doing, and then we started working on targeted interim products and projects. [2:11]
Stefan: What was the coolest project from maybe a technology and kind of from the IT perspective?
Kelly: Once we started collecting a lot of social networking data, a lot of their marketing spend and collecting their entity data, like how they structured their products, like records or digital downloads of some sort, those sorts of things from the internal systems. Once you started mixing that data together, they were able to do some pretty interesting things. One of the cooler products was basically like a … not a self-service analytics system, but kind of self-service. What was pre-packaged for them they could go access, based off the data we’ve collected from those different systems. [3:02]
Stefan: You talked about mixing. What’s about the mix here?
Kelly: This, somebody got me here, because I did them a favor. They went and bought me some beer, so I figured I would share.
Stefan: (Laughs) Oh, so you re-gift the beer is what you’re saying?
Kelly: Yeah. (Laughter)
Stefan: Well okay, fair enough. What do we have? What’s that? Beer with natural flavors added.
Kelly: Yeah, so it doesn’t…
Stefan: Yeah, you wouldn’t call that beer in Germany.
Kelly: This is not German beer, no.
Stefan: No. What’s your favorite beer then? Do you even drink beer in New York? Isn’t it all fancy cocktails in New York?
Kelly: No, I’ve been switching more over to beer. Cocktails get a little too expensive in New York.
Stefan: Oh, okay. (Laughs) What’s yours?
Kelly: I think one of my favorite beers is actually Green Flash, it’s out of San Diego. It’s a very good, kind of floral, hoppy IPA, which you wouldn’t like either, so.
Stefan: No. (Laughs) Okay. Let’s open this.
Kelly: You want to open this? (Laughs) Let’s see if I can do this without cutting myself.
Kelly: Oh, you’ve got to be kidding me.
Stefan: You must be an engineer.
Kelly: I know, right? I can never do this. Ah, there we go.
Kelly: Here you go.
Stefan: Thank you. Are you sure it’s not a twist? (Laughter)
Kelly: Yeah, exactly.
Stefan: Prost! It’s actually not too bad.
Stefan: It’s kind of refreshing, yeah.
Kelly: Good workday beer.
Stefan: Yeah. Tell me a little bit about that activity feed that you guys build based on Hadoop. I mean that’s what, maybe four years, five years ago? That was a cool idea. How did you guys overcome, and what were the challenges you tried to solve in the beginning? [4:52]
Kelly: Initially we started out working, knowing that we were going to start basically drinking from the firehose of the data. We started out working with Hadoop. To build out, even though the data sizing didn’t really require it, we knew once we started collecting a lot more of this data that it was get into larger and larger data sets. We wanted to be able to scale quickly for that need.
Stefan: So you basically preprocessed, stored the result in AWS, and then you didn’t have to do any database queries, or…?
Stefan: Yeah, okay, but everybody would say, “Oh my god, why would you store all this?” and “Isn’t that expensive?”
Kelly: Yeah, but if you already know the queries the users are going to execute, then it’s easy enough to basically pre-generate all of them and store them in S3. Then the cost of adding a database layer and then query and everything else, it didn’t really solve the underlying problem, which was like we were basically doing report generation. Since we had Hadoop there, we could just generate all the reports at once. [6:25]
Stefan: I guess that’s against the gut feeling of a normal engineer, but in the end, economically, it makes a lot of sense, right? Storage is incredibly cheap.
Stefan: The user perceives it as super-fast because it’s pre-processed.
Stefan: Huh, interesting. Okay.
Stefan: Nice. Definitely EMI was one of the very early Hadoop users that did quite sophisticated stuff there.
Stefan: What are you working on today? What are kind of the exciting projects you’re working on today from a technology perspective? [7:18]
Kelly: I’ve been getting into our visual analytics, so more of the data manual like algorithms we have in the product, helping to make those very fast, and performance, and hopefully usable in the product. Most of the time I spend going through the backend code base, especially our integrations into Hadoop, and working on how we process data in the workbooks and in the import jobs.
Stefan: What do you like and what do you absolutely not like about Hadoop as you work with the beast on a daily basis? [8:02]
Kelly: It’s gotten a lot better.
Stefan: Oh, that’s good.
Stefan: Well, cheers on that! Let’s drink on that.
Kelly: Yeah, exactly. (Laughs)
Kelly: Thanks to all the people at Yahoo, and Hortonworks, and Cloudera for making it a lot better.
Stefan: And MapR, and IBM, and Oracle, and Microsoft, and NetApp, and all of you. Thank you.
Kelly: (Laughs) And whoever we forget.
Stefan: Sponsor now. (Laughter)
Kelly: When we first started, especially even back in EMI, things would change, a lot of API changes, rather rapidly. Worst case is semantic changes. Even though the API didn’t change semantically it worked different, which are harder to deal with. Here at Datameer that can be bigger challenges we want to support, all these different distributions and things. [8:58]
As I’ve said, it’s gotten a lot better. Some of the initial issues, even like HDFS not having a good bulk interface, when you’re dealing with lots of file, that’s been addressed in 2.0 now. Same with … I just checked out … we’ve had issues before with class-loading, because we use a lot of jars and stuff and libraries that make our jobs easier. But also Hadoop communities use them and sometimes they use an older version, and you always…
Stefan: Run into some problems?
Stefan: What are the new features in 2.0 you’re most excited about?
Kelly: I’m most excited about? I’m actually really excited that they’re starting to publish what APIs are more fixed, what is really published versus just public for easier coding on their side, things like that. YARN is definitely very interesting. Especially for what we do here, there’s going to be a lot of cool stuff [00:10:00] we can do with it, especially building some of our analytics and stuff on top of that, I think would be a very big one. [10:08]
Stefan: Yeah, but you didn’t answer my question about, “What do you really hate about Hadoop?” For a guy from New York you seem to be very polite.
Stefan: Go for it.
Kelly: Yeah. I think initially there was a lot of, especially using that programming model for even smaller amounts of data just to get some simple parallels, kind of like a sandbox that MapReduce gives you. That wasn’t really dealt with very well by the core, and…
Stefan: They always thought of 2000 machines, right? They never thought that there might be normal people that have a couple tree machines or maybe just a single multi-core CPU machine, yeah.
Kelly: Certain things I think that would actually MapReduce still faster and relevant, like some of the things that they wanted to do in Hive with Stinger, there has been certain papers from Google that have shown that even removing the sort from the MapReduce, the shuffle, can in some cases give you really good speed, and that’s one thing that they don’t really want to fix it. Which, to me, is like one of those, you know, for us it would be pretty useful, and I’m sure for a lot of people. Other projects, like Pig and Hive would also find it very useful as well.
Stefan: In all that work that you did in Hadoop in what, six, seven years now, what are the … I know you’re like the research paper guy, you always have the latest and you send stuff around. What’s the most coolest publications around that, that you might want to see in Hadoop in? [11:52]
Kelly: Okay. (Laughs)
Stefan: What’s the latest one?
Kelly: As I said, the Google paper where they built the SQL engine on top of MapReduce and did some of those optimizations, I think that would be very useful to see in Hadoop.
Stefan: Are you a fan on SQL in Hadoop? I mean aren’t you in the wrong company for that? [12:16]
Kelly: (Laughs) You know, it’s a tool. If you can model your problem like that and it’s faster for you to do it that way, I think it’s fine. I think a lot of the power that we have, especially in the way our aggregation system works, and the series calculations and things, I think there’s way more you can do outside of SQL, because SQL being set-based…
Stefan: Good job. (Laughter)
Kelly: Yeah. For me, I try to be very pragmatic about it, and know that some people, they think in it. Sometimes it’s easier to solve a problem a different way. [13:00]
Stefan: Outside of Hadoop, any technologies you’re really excited about? You’ve talked a little bit about interactive visualization you do. Are you involved with the tree, or are you excited about any other cool stuff in the backend? What’s your hobby, what’s your technology hobby, the thing that excites you? [13:24]
Kelly: The thing that excites me is, I guess, working with distribute systems and language support for that. I’ve been looking at Go as a language, and I think technology-wise there’s a lot of great stuff in there.
Stefan: Tell us more about this. Go.
Stefan: That’s coming out of Google, right?
Kelly: Yeah. Basically they’re kind of marketing it as a replacement for other system languages like C and C++. It has a lot of great ideas in it, and it’s very simple, and one of my favorite things about it is how quick the compiler is. [14:07]
Stefan: What cool ideas do they have? They have inversion of control, for example?
Kelly: No. The biggest thing … well, it’s kind of like inversion control, is goroutines and channels for communication between them. Instead of concurrency with locks and threads and stuff, you actually…
Stefan: It’s more kind of an Ecto model maybe?
Kelly: Similar, yeah. A lot of it’s based on Tony Hoare’s paper on CSP Communicating Sequential Processes. That influenced Rob Pike I think for a long time, and so that’s why that’s in Go. Based on some of the stuff I’ve had to build, even for Datameer, like when we are dealing with something simple, like parsing values out of a file, we never know. It’s like, “Well are you going to give us more than one value out of that, or not?” That is really easy to describe in Go, where in Java you have to spin up a thread, or work on a buffer, or… [15:08]
Stefan: Is this more event based? I’ve never worked with Go.
Kelly: It’s not really event based. It’s more, as I said, you can spawn up a lot of little mini-processes that manage that stack for you. In Java, where you end up having to kind of create a state machine to do these things efficiently, in Go, you just—
Stefan: Like in JavaCC, or some fun thing.
Kelly: But in Go, you can just say, like, “I’m going to send something on a channel to the next thing,” and it will pause and wait at that point, or if it’s buffered it will keep on going. For me what it does, it’s like the language is helping solve a lot of those common cases that I run into a lot. [15:54]
Stefan: How’s the syntax look? Do you want to write a little “Hello World” on the…
Kelly: (Laughs) No, no, no, it’s…
Stefan: No, why not? Hello World in Go. I guess it’s just Go.
Kelly: I cannot. I’m not a good white-board programmer, it’s not what I am. (Laughs)
Stefan: It’s a special skill set.
Kelly: Yeah, I think so.
Stefan: Okay, Go. Let’s switch topics. Good engineering books.
Kelly: Engineering books?
Stefan: Yeah, like books you read. Blogs, I guess. What should I read? [16:25]
Kelly: What are you trying to do?
Stefan: You’re sharing always really cool content, so what are the things that are, maybe interested in big data, are good sources to follow?
Kelly: I think the last book that I really liked, that I just quickly ran through, was when I was coming up to speed with a lot of the stuff we were doing with the data-mining algorithms. I think it was a little older. It was Programming Collective Intelligence. It’s an O’Reilly book.
Kelly: From a high-level, and the Python code’s pretty easy to read, it gives you a good idea of what these algorithms are good at, what they’re not good at. The thing I really liked about it was it had great examples of using both generated data, so what’s the best case, but then also where to go find other data from Web services and you could build stuff out of this. Or use that data from the Web service to then apply those algorithms too, to see how well or not it’s finding, or it’s classifying something, or building clusters or whatever. [17:36]
Kelly: Most of the knowledge I guess I glean now from, most of the books I kind of stay away from. It’s a lot of just searching the internet and finding stuff.
Stefan: Yeah, yeah. Okay, so you are very well known in the company for having really awesome nerd shirts. [18:00]
Stefan: I have to bring this up, but where is the secret? What’s the secret source to get really cool engineering shirts?
Kelly: Engineering shirts? That’s different. I mean, most of mine nerdy I would say, more than engineering, lots of video games. I think being a child of the 80’s and having a Nintendo, so anything associated with that was usually where I would go. Uniqlo, which is kind of a Japanese Gap, they usually have really great T-shirts, prints of different things. I’ve had some good Jazz album covers and a lot of video game shirts from them. Also Shirt.Woot has got some good stuff on it.
Stefan: The most WTF emails I get in this company are actually from you. Anything specific that may be in the market, in the technology, that you really scratch your head, “Why the hell would that happen?” [18:56]
Kelly: Yeah. I think the focus on speed kind of bothers me sometimes.
Kelly: That it’s always, “How quick can I get this?” and…
Stefan: Is it query speed or time to insight? What do you mean with speed?
Kelly: Well, I think it ends up being query speed, right?
Stefan: Yeah, like, “Oh, my Hadoop is faster than yours.”
Kelly: Yeah, and can you actually react in that time? Or it’s like maybe you just don’t know. If you’re doing like ad hoc queries and things, sometimes you want that to come back quickly, so you know where to go to next, so you’re not waiting like an hour or two hours or whatever, which I can understand some of that. As I said before, even doing batch stuff, like ADMI and other places, if you know what you have to solve you can just solve it in a more general sense and just run it, like n-versions of it and build way more reports. Like why…
Stefan: Parallelism, yeah.
Kelly: Yeah, use parallelism instead. I think it’s great that people are working on these technologies, but I think if people are just starting out with these technologies it’s I think kind of the wrong thing to look at.
Stefan: It’s like the best distributed system is an undistributed system, right?
Stefan: The refactoring guy, what’s his name?
Kelly: I’m terrible with names too, but I know who you’re talking about.
Kelly: That’s a good book too, by the way.
Stefan: Yeah. Do you like Java puzzles?
Kelly: Yeah, kind of.
Kelly: But I don’t like finding them in a code base, which is usually where I find them.
Stefan: (Laughs) What’s important in an engineering team for you? You’re a super-senior. How do you make building a complex product like Datameer work? What’s kind of … big teams, small teams, water flow. [21:00]
Kelly: I think everything should be done with water flow.
Kelly: Yeah, perfectly designed out the first time.
Stefan: Okay. Like, you mail the algorithms and then you press “generate code”?
Kelly: Yeah, and it just works, the first time.
Stefan: I know. I was there. UML, ’98 maybe.
Kelly: No. I do like small teams. I like kind of small, good, strong teams together, where each person has a strength but is still kind of a generalist, so you don’t have to deal with bandwidth issues or whatever. Like “Oh this is the only person I know who knows how this works,” right? Everybody should own it. I think obviously the testing stuff is kind of important and all that fun stuff, but I also think about solving the real problem and not either scratching some technology itch, or things like that. [22:03]
Stefan: Well it’s a common problem, especially in the big data space, where people are just scratching the technology itch.
Kelly: Yeah. Also, willingness to fail and being very vocal about it, I think it’s very important. I think a lot of times as engineers we try to be very smart and smarter than the next person. Like you’re in some meeting with somebody, and you feel like you have to maybe show them how smart you are.
Stefan: “This really needs to be in memory, real-time, analytics, undistributed, something…”
Kelly: “I know all the latest and greatest.” I think if you can step back from that and be like, “Oh, that didn’t work, that failed,” and the team is very vocal about that, and then learning from that. Obviously not failing every time and then doing the same thing over and over again, but actually having real feedback.
Stefan: Do you feel that things like Git help to nurture those kind of new processes? Which tools help to move in that right direction? Or is it a more social problem? [23:05]
Kelly: Yes. I think it’s more of a social problem. Because the tools can either help you or hurt you, depending on your understanding of them or whatever. I think once you learn how to use things like Git or Mercurial, it’s like night and day.
Stefan: Compared to CVS?
Kelly: Yeah, because the whole failure thing. It’d be like, “Oh, I broke this, well okay, go back in time, it’s fine.” It’s not a big deal. The decisions you make, especially in a large code base, it’s very easy to be like, “Oh, I’m going to experiment with this, let’s see if this works,” and then come back and be like, “Okay, it didn’t. So, now what? Okay, let’s go down this path.” Right?
The tools enable you to do that, but if the culture’s not there to have that feedback and to accept the fact that you’re going to make mistakes, you’re going to create bugs, you’re not going to just fix them, without that I think it’s a lot harder. For me I like being very vocal about it. [24:01]
Stefan: In context of teams, Frank our VP of product has this theory, right, and I want to see how that resonates with you. He says, “A perfect team is four people.” Not four people, but basically four characters. One character that is an innovator; one character that questions things, right, to the innovator; one character that just gets, excuse my German, make a beep, gets shit done; and then there’s one guy that tries to kind of non-refactor, just never touch the working stuff. Kind of the character that tries to maintain existing stuff. What kind of characters do you see in that perfect team that you described? Like, a general list with some focus areas, or…?
Kelly: Yeah, I would hope that some people would kind of wear those hats, a little team. I’ll make a distributed programming joke, but you should probably be five so you have a quorom, so it’s not two against two, because then you’ll never get anything done. But yeah, I could kind of see that actually. I see that in our team.
Stefan: But it needs to be five to get a quorum?
Kelly: Yeah, that’s what I’m saying, right? You have to have at least five, so there’s maybe two getter-doners or whatever.
Stefan: Anything you want to give away, or advice for big data folks working in the space? Like, “Never touch that API,” or…? (Laughs)
Kelly: Keep it simple.
Stefan: Keep it simple?
Kelly: Yeah, I think a lot of people have a tendency to kind of run to the latest technology just to use it, even though maybe there’s a better way to solve their current problem. Because the more things you add to the system, the more complex it gets, and the more that you then have to deal with those complexities.
You know, it’s funny because we talk about this, what we did at EMI, and it was like, “Well, we’re using Hadoop for things that maybe initially it wasn’t necessary for, we’re just adding some complexity,” but what we were able to do rather quickly with it, by building a simple data warehouse system on top of S3 with it, by just using it like files. Which most developers are used to. If you know UNIX or know any of that stuff, you can start there. You don’t have to go directly into some of the other more interesting technologies that run on top of it.
I think the quicker you can solve the real problems, then you’ll get more and more interesting problems to solve. Then eventually you can start adding those other— [26:37]
Stefan: Rather than over engineering.
Stefan: Okay. Is there a Go binding for Map Reduce, or something like a SCADA?
Kelly: I don’t know, actually. It’s one of those things where it’s like that’s what I do for my job, so sometimes it’s nice to not have to think about it for that, right? I can be like, “Oh, I’m going to code something else.”
Stefan: Yeah, cool.
Hey, thank you very much, Kelly, stopping by from New York. Enjoy the wonderful winter. Good luck with that. Come visit in California if you get cold feet.
Kelly: Anytime. Thanks.