Analytics Use Cases With Public Data
- Datameer, Inc.
- November 21, 2019
The most challenging aspect of building a new analytics solution or product is collecting reliable data. I will walk through some analytics use cases that harness public data. These standard public datasets can be used as validation or a good starting point for building more tailored solutions and analytics. Public or third-party assets and datasets are standard and a great way to enrich your proprietary data. These data run the gamut from census data, weather data, consumer segmentation to more specific datasets for particular industries or applications such as natural language processing (NLP).
We are going to explore some very useful and accurate datasets that you and your team can use today. I will also walk through analytics use cases on how to apply these datasets to proprietary data in order to make better business decisions.
Three Basic Principles For Public Data
The key is to find high-quality datasets that you can use almost immediately. Here are three principles to hold for when evaluating a dataset – particularly public data:
- No muss, no fuss: Look for datasets that don’t require much time to clean. There are plenty of datasets out there – if it is messy or has too many rows or columns, then move on.
- Keep your eyes on the prize: Your end goal should have a question to answer, which in turn can be answered with data.
- Version control: Once you find the asset you want, make sure to evangelize it with the broader team. Have everyone on the teamwork from the same file, should they need similar data for their project. Datameer Spotlight is a great place to store and tag these files so everyone can use them – more on that later. Let’s explore some common needs, sources, and uses cases!
There is a plethora of public and third-party demographics assets available. Holding to the basic principles previously mentioned, we can easily find many that can be used for a variety of applications. Demographic assets are used for market research, product development, and many other initiatives in both academia and industry.
Using these three principles, we can explore and apply some very useful and reliable analytical assets.
The U.S. Census Bureau provides demographic data as well as data for voting, redistricting, and congressional affairs. Geography is central in the output of the Bureau and provides the framework for survey design, sample selection, data collection, and tabulation.
The Census Bureau collects data about the economy and the people living in the United States from many different sources. Some data are collected from respondents directly, through censuses and surveys. The Bureau collects additional data from other sources. Primary sources for these additional data are federal, state, and local governments, as well as commercial entities.
Many of the census tables are very easy to download and use with a very little formatting and preparation required. Although the U.S. census survey is performed every 10 years, reliable estimations are provided for “non-survey” years.
Here are some excellent datasets easily accessed via the U.S. Census:
The datasets for population and housing unit estimates are released on a flow basis throughout each year. Each new series of data incorporates the latest administrative record data, geographic boundaries, and methodology. The City and Town Population Totals and Metropolitan and Micropolitan Statistical Areas (MSA) Population Totals are incredibly valuable for population changes, in terms of number and characteristics. The MSA Population Totals dataset contains not only MSA number but also County, making it incredibly versatile and a great MSA-to-County crosswalk table.
ANALYTICS USE CASE: High Growth Areas For Medical Services Using US Census and Homeland Infrastructure Data (HIFLD)
A multinational healthcare services company wanted to bolster their national health care delivery platform operating through hospitals and outpatient centers in the United States. The company was looking to strategically optimize the location of new hospitals and outpatient facilities in new geographic markets.
The company had proprietary financial and utilization data for its own facilities but it lacked new market information at scale. There was two important questions management wanted to answer with data.
Where is an opportune, profitable market to enter and invest in new facilities?
Census data can be used to answer this question. By analyzing MSA Population estimates against the companies location data, management can determine growing (and diminishing) populations at the MSA level. Then, this data can be plotted in a map to determine ideal markets – and even precise locations to begin construction on new facilities.
How can we differentiate? Which hospital and outpatient services are needed?
This is stage two of the analysis. In stage one, we determined which markets are potentially ripe for hospital services. But what about competing hospitals in the area? And what services are underrepresented in the new market? For example, should a new-market hospital facility have a helipad or is there already an existing Level 1 Trauma Center (with a helipad) in proximity?
We can enrich our analysis further with public health data from Homeland Infrastructure to answer this question. The HIFL Hospital dataset includes hospital facilities in all 50 States and territories based on data acquired from various state departments or federal sources. The dataset holds not only ZIP of hospitals but also FIPS and lat/long within the standard file. Furthermore, the data contains valuable information on hospital type (critical, general, etc.), NAICS code, ownership, and beds (capacity).
Now a thorough, confident analysis can be performed for management. Not only can competitive hospitals be plotted against the company’s facilities, but also capacity and medical services planning can be performed. Market analysis can further be tuned against high-growth areas to determine the optimal locations for new facilities. Finally, all this enriched data can also be cited when it is time to provide evidence to regulatory bodies in the new market(s) – making those downstream regulatory processes easier.
Income & Financial Statistics
By far, the most valuable public data I have encountered. The IRS collects a wealth of incredibly accurate tax statistics every year. This information is provided via the Agency’s “Tax Statistics” page. The SOI Tax Stats also called “ZIP Code Data” provides median household income by all states. Data files can be downloaded with or without Adjusted Gross Income (AGI) included. The 2017 Individual Income Tax Statistics dataset can be downloaded here.
ANALYTICS USE CASE: Finding Affluent Customers Using IRS Tax Statistics
A large insurance company wanted to target members for outreach to sell premium products, in this case, insurance coverage with richer benefits. The expectation was to identify existing members who would be strong candidates for premium products. Members identified would be included in a direct mail marketing campaign. To maximize the marketing return on investment, only the best candidates would be targeted in the costly campaign.
Affluent members of a certain age and income bracket would be the best candidates for these “buy-up” products. Furthermore, members with property and investments would be considered more affluent compared to those members who did not have these assets.
The company had proprietary membership data including age, gender, number of dependents, and mailing address as well as claims information for its member base. How would the team find the supplemental financial information on its members?
The IRS’ SOI Tax Stats contains a wealth of data collected in tax returns. Metrics such as adjusted gross income, taxable income, real estate taxes, investment interest paid, etc. can all be found in this data. Any data that is reported on an IRS tax return can be found and used at the ZIP code level. This information can be used to determine trends in local markets or it can be used to enrich your propriety customer data. In this case, we use these statistics for the latter to get an idea as to where the affluent members are located and advertise premium products to them.
We have reviewed some principles to keep in mind when searching for public analytics assets. Remember to keep the goal in mind and if a data set you found is too challenging to work with – find another one that is easier to use. In addition, I’ve explored some useful, reliable, and free public data assets that can be applied through a variety of situations with some sample use cases as examples of how to apply these data to proprietary data your company uses.
These public datasets won’t be used on their own, however. You will typically combine these with your existing data as part of a data enrichment process, and further shape and organize the data from there.
Datameer is a powerful SaaS data transformation platform that runs in Snowflake – your modern, scalable cloud data warehouse – that combines to provide a highly scalable and flexible environment to transform your data into meaningful analytics. With Datameer, you can:
- Allow your non-technical analytics team members to work with your complex data without the need to write code using Datameer’s no-code and low-code data transformation interfaces,
- Collaborate amongst technical and non-technical team members to build data models and the data transformation flows to fulfill these models, each using their skills and knowledge
- Fully enrich analytics datasets to add even more flavor to your analysis using the diverse array of graphical formulas and functions,
- Generate rich documentation and add user-supplied attributes, comments, tags, and more to share searchable knowledge about your data across the entire analytics community,
- Use the catalog-like documentation features to crowd-source your data governance processes for greater data democratization and data literacy,
- Maintain full audit trails of how data is transformed and used by the community to further enable your governance and compliance processes,
- Deploy and execute data transformation models directly in Snowflake to gain the scalability your need over your large volumes of data while keeping compute and storage costs low.
Learn more about our innovative SaaS data transformation solution by scheduling a personalized demo today!