Tutorial07 - Building Data Obfuscation on Import

Introduction

Build in a tool to obfuscate data while importing data into Datameer. Sometimes it isn't sufficient to do obfuscation in a workbook because some information should never go to the cluster in clear text at all.

Currently this isn't supported by Datameer as a shipped feature although customers can provide this functionality by adding a plug-in with custom code that does the obfuscation.

How-to

  1. Create a new plug-in project.
  2. Extend class datameer.dap.sdk.importjob.extensions.ImportFilterExtension to provide functionality.

Example

This example illustrates user entered column names that are obfuscated by applying a SHA1 hash.

ObfuscatingImportFilterExtension.java
/**
 * Adds another property "Obfuscated Columns" to the wizard page and obfuscates all configured
 * columns using SHA1 hash.
 */
public class ObfuscatingImportFilterExtension extends ImportFilterExtension {

    private static final long serialVersionUID = ManifestMetaData.SERIAL_VERSION_UID;
    private static final String KEY = "ObfuscatedColumns";

    @Override
    public String getId() {
        return "EncryptingImportFilterExtension";
    }

    @Override
    public RawRecordCollector decorateRawRecordCollector(Field[] fields, ReadableGenericConfiguration configuration, RawRecordCollector recordCollector) {
        return DecoratingRawRecordCollector.decorate(recordCollector, new ObfuscatingRawRecordDecorator(fields, configuration.getStringProperty(KEY, "")));
    }

    @Override
    public void populateWizardPageImpl(WizardPageDefinition page) {
        PropertyGroupDefinition encryption = page.addGroup("Encryption");
        encryption.addPropertyDefinition(new PropertyDefinition(KEY, "Obfuscated Columns", PropertyType.STRING));
    }
}

The actual obfuscation logic looks like this:

ObfuscatingRawRecordDecorator.java
public class ObfuscatingRawRecordDecorator implements Consumer<RawRecord> {

    private final ImmutableSet<Integer> _columnsToEncrypt;

    public ObfuscatingRawRecordDecorator(Field[] fields, String columnsToEncryptString) {
        String[] columnsToEncypt = columnsToEncryptString.split(" ");
        ImmutableSet.Builder<Integer> columnsToEncrypt = ImmutableSet.builder();
        for (String columnToEncypt : columnsToEncypt) {
            int indexByName = Field.getIndexByName(Field.filterIncludedFields(fields), columnToEncypt.trim(), -1);
            if (indexByName != -1) {
                columnsToEncrypt.add(indexByName);
            }
        }
        _columnsToEncrypt = columnsToEncrypt.build();
    }

    static String obfuscate(String string) {
        if (string == null) {
            return null;
        }

        return IoUtil.serializeBase64(Hashing.sha1().newHasher().putString(string, Charsets.UTF_8).hash().asBytes());
    }

    @Override
    public void accept(RawRecord rawRecord) {
        for (Integer columnToEncrypt : _columnsToEncrypt) {
            rawRecord.setValue(columnToEncrypt, obfuscate(StringUtil.toString(rawRecord.getValue(columnToEncrypt), null)));
        }
    }
}

ObfuscatingRawRecordDecorator gets all the raw records that are read from any kind of input stream and obfuscates all columns that have been configured on the data details page of the wizard. The example implementation is only one approach and uses SHA1 hashing.

In the wizard, an additional Encryption section has been added where the columns that should be obfuscated can be configured.

On the next tab, all values of the name column are now obfuscated using the SHA1 hashing algorithm.

  


Our customer services specialists can assist you with more information if required.