Can You Trust Your Data In Alteryx?
- John Morrell
- March 5, 2018
Unfortunately, other vendors such as Alteryx don’t think this way, as seen with their most recent careless data breach. Cavalier use and attitudes towards data lead to missteps that cause breaches.
But, it is also important to remember that with security and governance, architecture matters. In their most recent Market Guide for Data Preparation, Gartner states:
“Data preparation — the most time-consuming task in analytics and BI — is evolving from a self-service activity to an enterprise imperative.”
And, as data preparation moves to an enterprise level, there are important architectural aspects (as well as features) that will determine how effective your security and governance policies are to protect your data and keep you in regulatory compliance.
Why Is This Important?
One result of big data analytics is that companies gather and use more information about their customers to help serve them better and more efficiently. Much of the data used in the analytics process revolves around “who” the customer is, “what” actions they take, and “how” they behave.
A great deal of this data is “personal” information, or PII, about the customer, which requires extra care and safekeeping. Many different government regulations stipulate how PII needs to be secured to protect consumers and deter non-compliance with stiff fines for breaches.
Additional regulations, such as GDPR stipulate that you must secure data and regulate how you can use the data based on consent. This means you need an end-to-end audit trail of how data is used.
Typical Self-Service Data Prep Architectures
Many self-service data preparation tools such as Alteryx are designed to run on an analyst desktop or local server. The self-service UIs and the ability to work on data locally are the two initial aspects that attract individual analysts or groups to use these products. It gives them independence from the IT teams.
But these architectures have one inherent and a rather large flaw – data movement. Set aside the fact that moving data can be very costly on resources. Moving data creates a highly disjointed overall architecture with the inability to provide end-to-end security and governance.
Once data is moved from your data lake, it is up to the local users or teams to apply security policies and govern how data is used. No enterprise policies can be applied and enforced. You lose track of security and how the data is being used and cannot effectively govern the data any longer.
Products such as Alteryx can be even more dangerous because they have specific add-ons to enrich your data with consumer “demographic, segmentation, and firmographic data from Experian, D&B, the US Census Bureau, and more.” This data was discovered and breached on the internet and deemed highly dangerous by Chris Vickery, a cybersecurity researcher from UpGuard.
While intended to serve a good purpose in their analytics, independent self-service data preparation products running on desktops or local servers can potentially leave this data unsecured and potentially misused. This could bring major fines from regulator bodies and loss of trust with the public, leading to lost sales.
The Right Approach
Datameer has the industry’s most extensive security, governance, and auditing features to ensure your data is protected, and you can constantly ensure data is properly being used. It prepares and processes data in one place, eliminating security holes due to data movement.
Besides data movement, you should examine other architectural considerations and features to understand best the security and governance implications for your big data. Let’s look at five key ways you can properly secure your big data and analytics.
1. Keep the Data in Place
As we’ve discussed, processing the data directly on your data lake is the first rule of good big data security and governance. Once data is moved, you immediately start to lose track of how the data is secured, governance becomes difficult, and auditing how the data is used becomes impossible.
A big data platform that prepares, curates, and lets analysts explore data directly on the data lake eliminates these problems. A single suite of security, governance, and management policies can be applied to the data, letting the IT teams ensure the data is fully secured.
Another feature that helps with security and minimizes data movement is data links. With data links, copies of data are not created in the data lake at all. Data remains in the source system until execution time. Then the data is processed for preparation and curation purposes, and only the result is maintained.
Lastly, data retention policies ensure you don’t keep older copies or amounts of data at any processing stage on the data lake. This further ensures that only the data needed for the current processing and analysis reside on the data lake and further minimizes copies.
2. Secure the Data and Connections
At the heart of any good data preparation platform is solid, strong role-based security. This ensures only the right people in the proper roles have access to specific data. One should use role-based security to ensure PII access is only made available to the people and processes that require it.
However, be aware that many data preparation platforms only apply security to entire data workflows or pipelines. Thus, analysts gain access to the artifacts and data at any stage in the pipeline.
In Datameer, artifacts in a data workflow are secured and managed independently, allowing administrators to clamp down security at various points in the pipeline while enabling freedom at other points. Data connections and source data can be highly secured with more accessibility later in the pipeline to see the resulting output.
Finally, the big data platform must integrate security with the data lake’s underlying Hadoop security. Secure impersonation enables secure processing of the data in conjunction with the Hadoop cluster resources. Integration with Hadoop data source security such as Sentry and Ranger, in conjunction with data links, enables a seamlessly managed security level to data in the data lake.
3. Encrypt Where You Need To
Data encryption should be a first-class citizen in your big data platform, not an add-on or afterthought. You should be able to encrypt data directly and mask or obfuscate specific fields inside of your data.
To properly secure your data, you need both encryption-at-rest and encryption-in-transit. Datameer works seamlessly with HDFS and YARN’s built-in capabilities to encrypt all data in Hadoop and adds wire-level SSL encryption of all data transmitted to the user’s browser.
Administrators also need the ability to obfuscate columns upon ingestion to anonymize this data from analysts or users consuming the data downstream. This helps ensure certain fields representing private or personal data cannot be seen by unauthorized users. Administrators can also create different data connections representing the same core data set where various fields are obfuscated or not, depending upon the users’ needs.
4. Track Lineage
As mentioned earlier. Copying and moving data create a problem in tracking where the data came from, the result, and everything in between. Tracking end-to-end lineage is critical to many of the regulatory controls that govern how organizations can use PII, including GDPR.
Processing your data directly on the data lake provides the key underpinning to supporting and tracking end-to-end lineage. But your data preparation and exploration platform also need to fully understand how data moves through a pipeline, how it is transformed, and how different assets are consumed.
Datameer’s lineage capabilities provide full tracking of dependencies across all artifacts inside the big data analytics platform from the time data enters, through each transformation, calculation, and analytic function, to when data is consumed. Every aspect of an artifact is tracked as jobs are run, and security policies are applied.
Datameer also tracks when artifacts within a pipeline change and the details of how various calculations are adjusted. This information is critical for compliance and proving to regulatory bodies how various datasets were used.
5. Keep Usage Audit Trails
The last item to ensure proper security and governance ensures you have detailed audit trails that show exactly how data was accessed and used. This is important not only for internal tracking purposes but also for the last step in the regulatory compliance process showing how data was used.
Datameer’s auditing capabilities cover all relevant user and system events, including data creation and modification, job executions, authentication and authorization actions, and data downloads. Audit analysis should be available within the platform, and the audit data should be exportable to an external system for additional analysis.
Besides, Datameer auditing contains important data about users and their interaction with the system – not just the data. This includes information about groups and roles, their assignments, artifact sharing, logins and failed login attempts, password updates, enabling and disabling of specific users, and more.
Because many big data use cases work with sensitive and private data – both internal data and information about customers – security and governance become extra critical to managing your big data pipelines. In the case of security and governance, ARCHITECTURE MATTERS, as well as features!
Ensure your platform has the best possible security and governance features by looking at each of these five critical aspects of big data security and governance. With increased regulation and the risk of substantial fines, your company cannot afford breaches or misuse of data. This is why some of the largest organizations in regulated industries trust Datameer.
To learn more about big data security and governance, please visit our website at https://www.datameer.com/data-governance/.