Continuing on our blog with Christina Kirby, Innovation Leader and Senior Data Engineer with Comcast Engineering & Platform Services. See part one of the Q&A here. Christina has over 10 years of experience in big data, and today we will discuss what Christina has learned about big data analysis over the years.
DM: Describe a typical analysis process that you go through.
Christina: To summarize, here are the high-level steps that we perform when embarking on an analysis:
- Define the business question(s) you are trying to answer
- Understand what to analyze
- Assess what data you need and what you have at your disposal (that you already have)
- Determine ways to enrich that data
- Evaluate analysis methods to apply
- Analyze and interpret data
- Visualize and deliver the results
For example, if you wanted to see the adoption of the latest Internet Protocol, IPv6, what’s the allocation of v4 addresses vs. v6 addresses around the world? What’s the adoption curve from v4 to v6? On the surface, it may seem like a simple computation. However, it is much more complicated to pull out the data that you need because you can’t simply ask the data, which bucket it goes in as you could be running dual stack (in which case devices are able to run IPv4 and IPv6 in parallel).
We have some advantages as we own some of the datasets, and can leverage peers across the organization to provide additional information to enrich the data and obtain a better understanding of what the data represents. When you have a complete dataset, you can better slice and dice the data as opposed to a one-dimensional view.
DM: You mentioned data preparation. Would you say that’s the most important part and once you have your arms around that, then you can do your analysis?
Christina: Data preparation and integration are definitely key. The more data prep that you do up front, the smoother your analysis will go. Even though the time you spend up front to do that data filtering might be time consuming on the front end, you will become more intimately engaged with the data and that will help you understand what the data is telling you.
After collecting these data sets, you then need to verify the data and check for anomalies. For example, you cannot have a network circuit over 100% as it is not physically feasible.
DM: What happens after data prep? Do you then do your analysis? Is it a one-way data pipeline like that or is it more iterative between data prep and analysis?
Christina: After data preparation, there is data exploration and data munging. Do we need to add any additional variables or computed variables? Is there missing data? Are there additional datasets needed?
When gathering and integrating other datasets into your assessment, be sure to ask if it is safe to join my data with yours and understand the data you are receiving. You don’t want to join the data on network id when you really meant device id.
We may take an iterative approach between data preparation and analysis if we identify additional issues and will continue to do so until we have a complete dataset to move forward with and further analyze.
DM: How do you know that when you cleansed your data, that it’s at the right quality level? Are there certain checks that you can put in place?
A significant part of it is knowing your data and your domain. For example, on a traditional day, peak hours will follow a certain trend. When peak hours shift too much and do not follow the pattern one would expect, it will often be necessary to drill into “is this an actual shift or is there some data quality issue that’s creating this?” Just with any QA process, question your data and ask, “Does this fit? Does this align with what you’re expecting?” It could be that the results did surprise you, and you do a check against another system or another data set. What did this look like a week ago? Does it follow a different trend, is this unique? Seek out other sources to vet against. Be sure to always question your results. If you are not familiar with the data, ask questions of the teams that produce the data.
DM: What’s an important part of analysis that’s often overlooked?
Christina: I think it is key when communicating your analysis to tell a story and ensure that it’s your story that’s being told, because if you simply share a graph or sample dataset, it may produce multiple stories. If you share your data in Excel, someone else could look at it and aggregate it differently and come to a different conclusion.
Ensure that the results of your analysis are clearly shared through highlighting the key findings and takeaways that you want your audience to focus on from the analysis, as it may not always be obvious in the visualization or how it’s presented. When sharing your analysis results, be sure provide the story along with any visualizations.
DM: So the story and the data that supports that story are both equally important.
Christina: Yes, very much so.
DM: Is it possible to go through the analysis process only to arrive at the wrong conclusion?
Christina: Oh definitely, as statistics can lie. When you’re approaching an analysis, do not seek the answer that you want. Instead, let the data speak for itself and be conscience of analysis paralysis.
DM: How do you know? Have you ever done analysis and then arrived at the end and realize, oh this doesn’t look right or the way I approached it is wrong. Were there any lessons learned from your experience?
Christina: We like to conduct peer reviews. When sharing insights at the company level, we always approach it as a team and compare our findings before presenting them to a larger audience. The intent is not to be critical of each other but to learn from the approaches we’ve taken and see how the team’s findings compare to arrive at a consistent message. We collaborate often with other analysis groups within the organization through consortiums. For your conclusions to become something that is adopted, either as part of strategy or business plan, you want a high level of confidence in what you’re presenting and that you have significant data to back that up.
DM: We talked about analysis being part art and part science. What would you say is the art piece and what is the science piece?
Christina: There is definitely a creative edge. Science is the mechanics—the process and the tools, while Art is the creative part that is needed to discern what the data is telling you. If you give the data to two different people, you’re not going to necessarily get the same results, or the same thought process, even if you might get a single end result. There is a level of creativity in terms of what approaches the analysts take and even what algorithms they apply. The science is in getting the data right and the art is in seeing what that data is telling you and extracting insights from the data.
DM: What advice would you give to someone embarking on big data analytics?
Christina: When embarking on data analysis, just as with adopting new technologies, you should be prepared for a lot of experimentation and be open to failure. Keep in mind not all data is equally important.
By embracing mantra of fail fast, it will enable you to learn from mistakes, applying those learnings and accelerating your results. In today’s world of massive data volumes and advanced analytics available, data scientists can test, experiment and fail forward faster, enabling their companies to succeed at a faster pace.
DM: Is there an example that you can give with that?
Christina: Many times analysts and data scientists may start with best practices or use an algorithm that has worked well in the past; however that may not be best for the dataset at hand. For example, say you were looking at a particular website metric and there was a preconceived notion that the data follows a specific distribution, such as Gaussian, Log-normal or Poisson distribution.
Not all datasets will work with a particular algorithm. Algorithms are designed to discover broad generalizations based on some expectations about the dataset distribution. If algorithm A fails, then try algorithm B. There are some simple statistical techniques, as well as analytical functions that you can use to profile and understand your data and its characteristics, such as how spread out your data is, to best determine its distribution.
DM: I think, to me, it’s really important that there’s a lot of discovery and a lot of experimentation in the analysis process. It’s often not a straight line to your answer most of the time, even when you think it’s a simple question.
Christina: It is rarely straight line. The process will likely improve over time but it never quite goes as straight as a crow flies. The key is to use flexible, open-data infrastructure that allows for continually refining your approach until your efforts yield results. In this way, organizations can eliminate the fear and iterate toward more effective use of big data.
Big data is all about asking the right questions, which is why it’s so important to rely on domain knowledge.
Thank you Christina for sharing your experience with us. To summarize the key points:
To learn more about Comcast’s big data journey, be sure to check out this presentation they gave at Hadoop Summit San Jose.
For more information about how Datameer is helping other organizations with their big data journey, visit our customer testimonials page.