How Comcast Turns Big Data into Smart Data (Part 2)
- Datameer, Inc.
- February 23, 2018
Datameer: Describe a typical analysis process that you go through.
Christina: To summarize, here are the high-level steps that we perform when embarking on an analysis:
- Define the business question(s) you are trying to answer
- Understand what to analyze
- Assess what data you need and what you have at your disposal (that you already have)
- Determine ways to enrich that data
- Evaluate analysis methods to apply
- Analyze and interpret data
- Visualize and deliver the results
For example, if you wanted to see the adoption of the latest Internet Protocol, IPv6, what’s the allocation of v4 addresses vs. v6 addresses around the world? What’s the adoption curve from v4 to v6? On the surface, it may seem like a simple computation. However, it is much more complicated to pull out the data you need because you can’t simply ask the data, which bucket it goes in as you could be running dual-stack (in which case devices can run IPv4 and IPv6 in parallel).
We have some advantages as we own some of the datasets and can leverage peers across the organization to provide additional information to enrich the data and better understand what the data represents. When you have a complete dataset, you can better slice and dice the data as opposed to a one-dimensional view.
DM: You mentioned data preparation. Would you say that’s the most important part, and once you have your arms around that, you can do your analysis?
Christina: Data preparation and integration are definitely key. The more data prep that you do upfront, the smoother your analysis will go. Even though the time you spend upfront to do that data filtering might be time-consuming on the front end, you will become more intimately engaged with the data, which will help you understand what the data is telling you.
After collecting these data sets, you then need to verify the data and check for anomalies. For example, you cannot have a network circuit over 100% as it is not physically feasible.
Datameer: What happens after data prep? Do you then do your analysis? Is it a one-way data pipeline like that, or is it more iterative between data prep and analysis?
Christina: After data preparation, there is a data exploration and data munging. Do we need to add any additional variables or computed variables? Is there missing data? Are there additional datasets needed?
When gathering and integrating other datasets into your assessment, be sure to ask if it is safe to join my data with yours and understand the data you are receiving. You don’t want to join the data on network id when you really meant device id.
We may take an iterative approach between data preparation and analysis if we identify additional issues and will continue to do so until we have a complete dataset to move forward with and further analyze.
Datameer: How do you know that it’s at the right quality level when you cleansed your data? Are there certain checks that you can put in place?
A significant part of it is knowing your data and your domain. For example, on a traditional day, peak hours will follow a certain trend. When peak hours shift too much and do not follow the pattern one would expect, it will often be necessary to drill into “is this an actual shift or is there some data quality issue that’s creating this?” Just with any QA process, question your data and ask, “Does this fit? Does this align with what you’re expecting?” It could be that the results did surprise you, and you do a check against another system or another data set. What did this look like a week ago? Does it follow a different trend? Is this unique? Seek out other sources to vet against. Be sure always to question your results. If you are not familiar with the data, ask questions of the teams that produce the data.
Datameer: What’s an important part of the analysis that’s often overlooked?
Christina: I think it is key when communicating your analysis to tell a story and ensure that it’s your story that’s being told because if you share a graph or sample dataset, it may produce multiple stories. If you share your Excel data, someone else could look at it and aggregate it differently and come to a different conclusion.
Ensure that your analysis results are clearly shared by highlighting the key findings and takeaways that you want your audience to focus on from the analysis. It may not always be obvious in the visualization or how it’s presented. When sharing your analysis results, be sure to provide the story along with any visualizations.
Datameer: So the story and the data that supports that story are both equally important.
Christina: Yes, very much so.
Datameer: Is it possible to go through the analysis process only to arrive at the wrong conclusion?
Christina: Oh definitely, as statistics can lie. When you’re approaching an analysis, do not seek the answer that you want. Instead, let the data speak for itself and be conscious of analysis paralysis.
Datameer: How do you know? Have you ever done the analysis and then arrived at the end and realize, oh, this doesn’t look right, or the way I approached it is wrong. Were there any lessons learned from your experience?
Christina: We like to conduct peer reviews. When sharing insights at the company level, we always approach it as a team and compare our findings before presenting them to a larger audience. The intent is not to be critical of each other but to learn from the approaches we’ve taken and see how the team’s findings compare to arrive at a consistent message. We often collaborate with other analysis groups within the organization through consortiums. For your conclusions to become something that is adopted, either as part of a strategy or business plan, you want a high level of confidence in what you’re presenting and that you have significant data to back that up.
Datameer: We talked about analysis being part art and part science. What would you say is the art piece and what is the science piece?
Christina: There is definitely a creative edge. Science is the mechanics—the process and the tools, while Art is the creative part needed to discern what the data is telling you. If you give the data to two different people, you’re not going to necessarily get the same results, or the same thought process, even if you might get a single result. There is a level of creativity in terms of what approaches the analysts take and even what algorithms they apply. The science is in getting the data right, and the art is in seeing what that data is telling you and extracting insights from the data.
Datameer: What advice would you give to someone embarking on big data analytics?
Christina: When embarking on data analysis, just as with adopting new technologies, you should be prepared for a lot of experimentation and be open to failure. Keep in mind not all data is equally important.
Embracing the mantra of fail fast will enable you to learn from mistakes, apply those learnings, and accelerate your results. In today’s world of massive data volumes and advanced analytics available, data scientists can test, experiment, and fail forward faster, enabling their companies to succeed at a faster pace.
Datameer: Is there an example that you can give?
Christina: Many times, analysts and data scientists may start with best practices or use an algorithm that has worked well in the past; however that may not be best for the dataset at hand. For example, say you were looking at a particular website metric. There was a preconceived notion that the data follow a specific distribution, such as Gaussian, Log-normal, or Poisson distribution.
Not all datasets will work with a particular algorithm. Algorithms are designed to discover broad generalizations based on some expectations about the dataset distribution. If algorithm A fails, then try algorithm B. There are some simple statistical techniques, and analytical functions that you can use to profile and understand your data and its characteristics, such as how to spread out your data is, to best determine its distribution.
Datameer: I think, to me, it’s crucial that there’s a lot of discovery and a lot of experimentation in the analysis process. It’s often not a straight line to your answer most of the time, even when you think it’s a simple question.
Christina: It is rarely a straight line. The process will likely improve over time, but it never quite goes as straight as the crow flies. The key is to use flexible, open-data infrastructure to refine your approach until your efforts yield results continually. In this way, organizations can eliminate the fear and iterate toward more effective use of big data.
Big data is all about asking the right questions, which is why it’s so important to rely on domain knowledge.
Thank you, Christina, for sharing your experience with us. To summarize the key points:
- For big data to work, you need to be willing to try and experiment with new technologies.
- Fail fast and learn quickly to get yourself back on the right track—whether it’s trying new technologies or new approaches to analysis.
- Big data analysis needs to be self-service and not reserved for the few with technical expertise.
- Data analysis is part art and part science. There is a creative art to data analysis – science and mathematics are well-founded and provide the tools for exploration. Their timely application and combination are left to the artful eye with the intuition of what is being sought.
- Telling the story about your analysis is equally as important as the outcome of the analysis.
- Build a team and/or peer review into your analysis process.
For more information about how Datameer helps other organizations with their data journey, visit our customer page.