Meet Christina Kirby, Innovation Leader and Senior Data Engineer with Comcast Engineering & Platform Services. We met Christina at a Datameer User Group meeting at Comcast and was impressed with how she gets value from big data on a day-to-day basis. Part one of this two-part series will focus on her background, the big data technologies her team uses and how her team is organized to deliver results. Part two next week will cover her thoughts on data preparation and analysis.
Datameer (DM): Hi Christina, tell us a little bit about you and what you do at Comcast.
Christina: Sure, I have been at Comcast for 10 years now. I am the lead big data engineer for the Network Engineering team. Network Engineering is responsible for managing the network infrastructure at Comcast and delivering service levels.
As for myself, I am a big data evangelist. I architect end-to-end data solutions, primarily in the Apache stack around Hadoop, as well as evaluation of Spark.
Our team manages the tools that monitor network performance and related data flows across the network to provide guidance on network planning and ensure a seamless customer experience.
DM: How did you get into big data?
Christina: I was originally introduced to the world of Hadoop when my prior team was outgrowing our SQL server implementation. We came to a crossroads of deciding how much data to retain or to change our data platform altogether. It seemed natural to move to Hadoop and enable both – the ability to retain historical data and the ability to run queries in real time or close to real time.
DM: You mentioned your data analysis needs were surpassing the technology that you were using, and then that’s when you started looking at Hadoop. How has big data technology evolved and how does it help you make your job easier?
Christina: I don’t know if it necessarily makes it easier, but rather makes it possible! Hadoop is only 10 years old, so it still has some growing up to do. With any evolving technology, as it matures, there are growing pains as you are growing and learning with it. While it has enabled great things that we could have never thought of ten years ago, you have to be ready to ride that wave of instability. You may to have to work through bugs and some pain points at times, but that just means you are moving up the adoption curve and getting closer to that cutting edge. Through partnering with a good support team and Hadoop distribution provider, they will help you navigate your big data journey.
It’s very exciting but it can be turbulent at times. The transition is a culture shift as much as a technology shift. Be ready to communicate to team members and management that were accustomed to stable relational environments that now greater things are possible, but be ready for some bumps along the way.
DM: Tell me about Spark. What are your thoughts on how you would use Spark differently from Hadoop.
Christina: I think Spark and Hadoop are very complementary. Spark enables faster processing and is gaining potentially more excitement than Hadoop saw a decade ago. While we are doing primarily batch processing in Hadoop, Spark will enable more real-time interactions with Machine Learning, as well as streaming functionality. Historically, you had to go to multiple technologies on the Apache stack and put multiple components together to enable this. Now it’s all in one package and ten times faster.
DM: This allows you to think up new use cases and new analyses that you can do?
Christina: Yes, this broadens the list of possibilities. Not only does it let us do what we’re doing today faster, but we’re also in the midst of now building out additional use cases. One of Spark’s top use cases is its ability to process streaming data. With so much data being processed on a daily basis; it has become essential to be able to stream and analyze the data in real time, especially with the emergence and rapid growth of Internet of Things (IoT).
Another of the many Apache Spark use cases is its machine learning capabilities. MapReduce was built to handle batch processing, and SQL-on-Hadoop engines such as Hive or Pig are frequently too slow for interactive analysis. Apache Spark, however, is fast enough to perform exploratory queries without sampling. Over time, Apache Spark will continue to develop its own ecosystem, becoming even more versatile.
DM: One thing we talked about before is that traditionally, you had to be fairly technical to use big data tools for analysis, but the business owner who is seeking answers typically aren’t technical. So the business owner and the analyst and maybe IT needs to get involved in an iterative loop, going back and forth, until the answer is found. Is there a way to minimize that iteration?
Christina: As additional technologies come onto the scene, it is getting easier and enabling more people to put the data to productive use for our various network needs. Not everyone needs to be a MapReduce or Java programmer to take advantage of the benefits of Hadoop and the Apache Stack. While there is great control in using a lower level language if you have background and skill set; the introduction of Hive and Pig and tools like Datameer provide higher level interfaces into your data. These tools make collaboration between your business counterparts and analysts easier and more streamlined. Previously, a business user would need to go to IT and say, “here are my requirements,” and then wait six months – all the while the technology is changing and the data transformed a hundred times over. Now it is becoming a partnership between the business and data team. Tools and technologies are bringing the team together to understand and find insights from the data– the analysts understand the data better and the business users are engaged to understand the process and the technology that is enabling the magic to happen.
DM: Would you say there is a level of self-service that is required for big data to be successfully used and adopted within a company?
Christina: Delivering high-quality governed data to end users for analysis is critical to the success of the organization as it is not possible for a single team to conduct all of the analysis. The more you expose the insights created from your big data environment for consumption, the greater the population of users across the organization that can review and leverage them.
Creating a data-driven culture and the accompanying shift in mindset is not a simple undertaking. This is particularly true for companies that were not built with data in mind. This kind of transformation and adoption must be seen as a journey that may take multiple years. Data-sharing internally and organizing the company’s data into a centralized location is a key first step to this adoption.
DM: In attaining self-service, what part of the analytics workflow is most challenging?
Christina: Ensuring the users consuming the data also understand the data is vital to success, but this can also be the most challenging part. The more users interact with the data, the more perspectives you have, and this can lead to multiple interpretations of the data. The process of analyzing and interpreting the data can be quite complex. New users must be debriefed on the data to understand the data they are using, where data is coming from and what it represents.
DM: What do you think is the biggest misconception about big data in the marketplace?
Christina: With all the hype around big data, it is often dismissed as a buzzword; however there is great value in data and needs to be adopted as part of your company’s strategy. Big data is here to stay and our mission really has become to transform big data into smart data. I think that companies really have not realized the assets that they are sitting on in terms of data or how they can use those assets to become more efficient and improve networks and operations.
DM: A lot of people still seem to be skeptical about the ‘power’ of big data, for lack of a better term. How would you respond to that?
Christina: There are definitely many skeptics when it comes to big data. As with other technology transformations, big data comes with great promise and excitement, which in turn can leads to impatience and insistence on immediate results. Big data carries enormous potential, but the process of integrating it into your operations is evolutionary and requires a lot of work to deliver the desired result. Embracing data and analytics as part of an organization’s culture is key to optimizing business processes, uncovering new business opportunities and delivering a more compelling customer experience.
DM: It’s up to your team to make the data smart so that you can use it, and to look through and find insightful meaning in your data.
Christina: Yes, we support our peers across the organization at the same time, we are trying to understand trends in the network infrastructure–what the data is showing and how the network is evolving. Because we’re at the forefront of visibility to the data, we can see shifts in traffic.
DM: Does your team have a charter or do you do more ad hoc projects requests that come your way?
Christina: We do both. Every year, we develop our charter and goals for the year of what analysis we want the team to accomplish. In addition, we receive questions from our peers and other departments throughout the year.
DM: Your group actually takes a proactive role in analysis. You don’t just wait for requests to come from management on what to analyze?
Christina: Correct, it is a combination of both. We’re at the ready for questions and they come pretty fast and furious at times. We’re also doing comparisons, looking at our data, daily, weekly, monthly in addition to doing collaborations with other analysis teams of seeing how the data correlates.
DM: How do you make the distinction between a data scientist and an analyst?
Christina: There are multiple career paths when it comes to data. Some companies will even use these titles interchangeably. We split up the roles so that the data scientists tackle the open-end questions and the related exploratory work. Harvard Business Review has compared data scientists to experimental physicists as they design equipment, gather data, conduct multiple experiments, and communicate their results. Data scientists define the statistical models and leverage tools like R, Python and other modeling programs.
On the other hand, the analyst is focused on interpreting the data through reporting and analysis and generally handle answering ad hoc questions from the business. Analysts typically conduct analysis that focus on describing the past; while data scientists typically emphasize on manipulating data and creating models to improve the future.
Behind the analysis are data engineers that ensure through governance models coming out of the tools that the data is right and ready for use. Not only that the data is available, but that it is in a format that data scientists and analysts can question and interrogate through cleaning, transforming, organizing and even turning unstructured data into structured data. And don’t forget we need to work with the legal and compliance teams to make sure all of our data collection and use complies with the law and our policies for the privacy and security of data. All of these roles and interactions are essential in the data lifecycle and developing the meaningful stories to tell with the data.
DM: What do you think big data teams will look like in 2-5 years? How will the role of data scientists change in that time?
Christina: Data teams will be integrated into the business units that they serve. Big data enables managers to measure and increasingly know more about their businesses. This in turn translates into knowledge and improved decision making. Data scientists will be a core team member in a cross-functional team making recommendations to the senior leadership team. Data scientists bring the data science skills and analytical perspective needed to solve business problems. As companies become more data-driven, data professionals need to be immersed in the business and strategy conversations alongside fellow business experts.
Be sure to check out Part 2 of this Q&A, where Christina shares her expertise on how to do big data analysis.