About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

The small print: You’re going to have to clean up all that big data

By on May 19, 2014

**This post originally appeared on VentureBeat**

Garbage. Credit: Meaduva flickr

Last year, surveyors checked in with more than 2,000 IT leaders in the U.S. and Canada, and 60 percent believed their organizations lacked accountability for data quality, while more than 50 percent questioned the validity of their data.

Recent reports have also uncovered that much of the data collected by the U.S. Department of Education is riddled with errors and missing information.

As the amount and type of raw data sources increases exponentially, data-quality issues can wreak havoc on an organization if the data isn’t vetted at all points of the analytics workflow, from ingest to final visualization.

Danger: Look out for dirty data

For example, consider the problems bad data can cause for a typical retailer.

Plenty of data problems can crop up as the retailer gathers information, such as missing product IDs or inaccurate product descriptions. When the product data isn’t standardized, different systems will contain inconsistent information, leading to problems with the retailer’s inventory, fulfillment and logistics.

Inconsistent data about product inventory can lead to overproduction — resulting in write-downs — or underproduction, which can cause late deliveries and out-of-stock notices. Bad distribution data can lead to duplicate shipping orders, returns and reshipments. These basic data issues translate to significant wasted time and money for the company.

Because of such risks, organizations need to be smart about how they’re approaching data from the very beginning of the process and each time new data is added.

While companies have been able to monitor the quality of small data sets for some time now, the increasing size and scope of the data organizations deal with on a daily basis has made this task much more complicated. This is where new big data analytics technologies that enable data profiling during every step of the analytics cycle becomes critical in helping organizations to pick out anomalies from enormous data sets from the get-go. This helps them avoid wasting resources due to bad data issues, and also frees up time for businesses to discover additional analytics use cases.

Gauge data quality first

There are plenty of instances of companies using technology to measure the quality of their data early on, ultimately saving resources and reducing problems down the road.

One bank uses a self-service big data analytics tool to identify loans that have high risk and quantify risk exposure. The bank’s analysis identified loans made to borrowers whose credit scores fell below the normal range for the borrower’s zip code (credit scores often correlate closely with zip codes, with more affluent areas tending to have higher-than-average scores). This helped the bank highlight risky loans and better track its loan portfolios’ overall exposure to defaults, which amounted to over $13 million.

A telecommunications company took an entirely different approach to data quality analysis to more accurately plan its spending on new infrastructure. The company analyzed its customer information to find incorrect subscriber data (invalid email addresses, for example) that skewed results on usage in different areas. By correctly correlating subscriber information with network performance data, the company was able to keep up with existing and forecasted demand and by knowing exactly what infrastructure it needed, the company said it was able to avoid wasting an estimated $140 million on unnecessary capital expenditures.

Data quality has become an important, if sometimes overlooked, piece of the big data equation. Until companies rethink their big data analytics workflow and ensure that data quality is considered at every step of the big data analytics process — from integration all the way through to the final visualization — the benefits of big data will only be partly realized.

Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook

Stefan Groschupf

Stefan Groschupf

Stefan Groschupf is a big data veteran and serial entrepreneur with strong roots in the open source community. He was one of the very few early contributors to Nutch, the open source project that spun out Hadoop, which 10 years later, is considered a 20 billion dollar business. Open source technologies designed and coded by Stefan can be found running in all 20 of the Fortune 20 companies in the world, and innovative open source technologies like Kafka, Storm, Katta and Spark, all rely on technology Stefan designed more than a half decade ago. In 2003, Groschupf was named one of the most innovative Germans under 30 by Stern Magazine. In 2013, Fast Company named Datameer, one of the most innovative companies in the world. Stefan is currently CEO and Chairman of Datameer, the company he co-founded in 2009 after several years of architecting and implementing distributed big data analytic systems for companies like Apple, EMI Music, Hoffmann La Roche, AT&T, the European Union, and others. After two years in the market, Datameer was commercially deployed in more than 30 percent of the Fortune 20. Stefan is a frequent conference speaker, contributor to industry publications and books, holds patents and is advising a set of startups on product, scale and operations. If not working, Stefan is backpacking, sea kayaking, kite boarding or mountain biking. He lives in San Francisco, California.