Tutorial02 - Building Custom Functions

Introduction

You can add support for custom functions simply by creating a class that implements datameer.dap.sdk.function.FunctionType. Your plug-in automatically scans for all classes that implement this interface and your functions are automatically registered to Datameer.

Functions Types

There are different function types depending on the interface that you implement. Look at the type hierarchy of the FunctionType interface and see the Java docs for more details.

Function type

Implemented interface

Description

Simple function

datameer.dap.sdk.function.SimpleFunctionType

Any simple function that takes a number of arguments and returns a result value. E.g. SUM, MOD, UPPER, etc.

Aggregation function

datameer.dap.sdk.function.AggregationFunctionType

Aggregates values of a group that has been created by a "group by" function. E.g. GROUPSUM, GROUPAVERAGE, etc.

Group by function

datameer.dap.sdk.function.GroupByFunctionType

Similar to a simple function this takes a number of arguments and returns a result value. The difference is that the returned value is used as a group that only appear once in the result sheet. Group by functions can only used together with aggregation functions. Examples: GROUPBY, GROUPBYBIN

Implementing a Simple Function

To implement a function, subclass datameer.dap.sdk.function.BaseFunctionType or datameer.dap.sdk.function.BaseSimpleFunctionType. BaseSimpleFunctionType is appropriate when a function always needs all arguments to compute the result value and the result value is of a fixed type.

Here is an example:

Hex2Text.java
package datameer.das.plugin.tutorial02;
# package datameer.das.functions.encoding;

import java.io.UnsupportedEncodingException;

import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.binary.Hex;

import datameer.dap.sdk.function.ArgumentInfo;
import datameer.dap.sdk.function.BaseSimpleFunctionType;
import datameer.dap.sdk.function.FieldType;
import datameer.dap.sdk.function.FunctionGroup;
import datameer.dap.sdk.function.ReferencesInfo;
import datameer.dap.sdk.function.argument.TextArgumentType;
import datameer.dap.sdk.util.ManifestMetaData;

@SuppressWarnings("deprecation")
public class Hex2TextFunction extends BaseSimpleFunctionType{
	
private static final long serialVersionUID = ManifestMetaData.SERIAL_VERSION_UID;

private static final String NAME = "HEX2TEXT";
private static final String DESCRIPTION = "Converts a plain hexadecimal encoded string into a human readable UTF-8 string.";

private static final String EMPTY_STRING = "";

	public Hex2TextFunction() {
		super(FunctionGroup.ENCODING, NAME, DESCRIPTION, new TextArgumentType("String (Hexadecimal)"));
    }

    @Override
    public Object compute(Object... arguments) {
    	
    	String hexString = (String) arguments[0];
    	String result = null;
    	
    	if (EMPTY_STRING.equals(hexString) || (hexString.length()%2) != 0) 
    		return EMPTY_STRING;
    	
    	byte[] bytes = null;
    	
		try {
			bytes = Hex.decodeHex(hexString.toCharArray());
		} catch (DecoderException e) {
			e.printStackTrace();
		}
		
        try {
			result = new String(bytes,"UTF-8");
		} catch (UnsupportedEncodingException e) {
			e.printStackTrace();
		}       
        return result;
    }
	
    @Override
    public FieldType computeReturnType(FieldType... argumentTypes) {
        return FieldType.STRING;
    }

    @Override
    public String suggestColumnName(ArgumentInfo argumentInfo, ReferencesInfo referencesInfo) {
        return "Hex2Text";
    }  
	
}

Writing a VALUE function is more complicated. The VALUE function converts null values into non-null default values. This is more complicated because you can't check null values for their type to determine which default value to return. This problem can be solved by extending the BaseFunctionType and using the type information that is passed in when creating a ValueComputor.

ValueFunction.java
package datameer.das.plugin.tutorial02;

import java.util.Date;

import datameer.dap.sdk.function.Arguments;
import datameer.dap.sdk.function.BaseFunctionType;
import datameer.dap.sdk.function.DasNumber;
import datameer.dap.sdk.function.FieldType;
import datameer.dap.sdk.function.SimpleFunctionType;
import datameer.dap.sdk.function.ValueComputor;
import datameer.dap.sdk.function.argument.AnyArgumentType;
import datameer.dap.sdk.job.DasJobEnvironment;
import datameer.dap.sdk.schema.ValueType;

@SuppressWarnings("serial")
public class ValueFunction extends BaseFunctionType implements SimpleFunctionType {

    private static class ValueFunctionComputor implements ValueComputor {

        private final Object _defaultValue;

        public ValueFunctionComputor(Object defaultValue) {
            _defaultValue = defaultValue;
        }

        @Override
        public Object compute(Arguments arguments) {
            if (arguments.get(0) == null) {
                return _defaultValue;
            }
            return arguments.get(0);
        }

    }

    public ValueFunction() {
        super("Tutorial02", "_VALUE", "Converts null values to not null default value.", new AnyArgumentType());
    }

    @Override
    public FieldType computeReturnType(FieldType... argumentTypes) {
        return argumentTypes[0];
    }

    @Override
    public ValueComputor createComputor(DasJobEnvironment env, ValueType... argumentTypes) {
        if (argumentTypes[0].isNumeric()) {
            DasNumber defaultValue = argumentTypes[0].createDasNumber();
            defaultValue.set(DasNumber.ZERO);
            return new ValueFunctionComputor(defaultValue.get());
        }
        switch (argumentTypes[0].getValueTypeId()) {
        case BOOLEAN:
            return new ValueFunctionComputor(false);
        case DATE:
            return new ValueFunctionComputor(new Date(0));
        case STRING:
            return new ValueFunctionComputor("");
        default:
            throw new IllegalArgumentException("Unsupported argument type: " + argumentTypes[0]);
        }
    }
}

Another good example of leveraging the flexibility provided by BaseFunctionType is the implementation of the AND function. In the =AND(COMPLEX_FUNCTION_1(...); COMPLEX_FUNCTION_2(...)) COMPLEX_FUNCTION_2 doesn't have to be computed when COMPLEX_FUNCTION_1 has returned false. This optimization can easily be implemented by subclassing the BaseFunctionType.

AndFunction.java
package datameer.das.plugin.tutorial02;

import datameer.dap.sdk.function.Arguments;
import datameer.dap.sdk.function.BaseFunctionType;
import datameer.dap.sdk.function.FieldType;
import datameer.dap.sdk.function.SimpleFunctionType;
import datameer.dap.sdk.function.ValueComputor;
import datameer.dap.sdk.function.argument.ArgumentType;
import datameer.dap.sdk.function.argument.BooleanArgumentType;
import datameer.dap.sdk.job.DasJobEnvironment;
import datameer.dap.sdk.schema.ValueType;

@SuppressWarnings("serial")
public class AndFunction extends BaseFunctionType implements SimpleFunctionType {

    public AndFunction() {
        super("Tutorial02", "_AND", "Returns TRUE if all of its arguments are TRUE.", new ArgumentType[] { new BooleanArgumentType() }, Integer.MAX_VALUE, new BooleanArgumentType());
    }

    @Override
    public ValueComputor createComputor(DasJobEnvironment env, ValueType... argumentTypes) {
        return new ValueComputor() {

            @Override
            public Object compute(Arguments arguments) {
                for (int i = 0; i < arguments.size(); i++) {
                    if (!(Boolean) arguments.get(i)) {
                        return Boolean.FALSE;
                    }
                }
                return Boolean.TRUE;
            }
        };
    }

    @Override
    public FieldType computeReturnType(FieldType... argumentTypes) {
        return FieldType.BOOLEAN;
    }
}

Comparison of BaseFunctionType and BaseSimpleFunctionType

Advantages of BaseSimpleFunctionType

  • You only have to implement a compute method to get it running. This is appropriate for a lot of simple functions.

Advantages of BaseFunctionType

This base class provides a lot more flexibility, but you have to write more code (compared to BaseSimpleFunctionType).

  • You can control which arguments that are passed to the function should actually be evaluated. If you don't need the evaluate argument 2 to compute the result of the function than the whole expression for argument 2 isn't evaluated.
  • You can return different value computors depending on the argument types passed to your function. E.g. for SUM you can have different implementations for summing integer or float values. If you do that, you don't have to check the type of the arguments when computing the function results.

Implementing an Aggregation Function

An aggregation function aggregates all values of a group and combines them to a single result value. Here is the example for GROUPCOUNT:

GroupCountFunction.java
package datameer.das.plugin.tutorial02;

import datameer.dap.sdk.function.Arguments;
import datameer.dap.sdk.function.BaseFunctionType;
import datameer.dap.sdk.function.FieldType;
import datameer.dap.sdk.function.IntermediateResultAggregationFunctionType;
import datameer.dap.sdk.function.IntermediateResultAggregator;
import datameer.dap.sdk.job.DasJobEnvironment;
import datameer.dap.sdk.schema.ValueType;

@SuppressWarnings("serial")
public class GroupCountFunction extends BaseFunctionType implements IntermediateResultAggregationFunctionType {

    public GroupCountFunction() {
        super("Tutorial02", "_GROUPCOUNT", "Counts the records of a group.");
    }

    private static class GroupCountAggregator extends IntermediateResultAggregator {

        private long _count = 0;

        @Override
        public void aggregate(Arguments arguments) {
            ++_count;
        }

        @Override
        public Long computeAggregationResult() {
            return _count;
        }

        @Override
        public void aggregateIntermediate(Object intermediateResult) {
            _count += (Long) intermediateResult;
        }

        @Override
        public Object computeIntermediateResult() {
            return _count;
        }

    }

    @Override
    public FieldType computeReturnType(FieldType... argumentTypes) {
        return FieldType.INTEGER;
    }

    @Override
    public ValueType computeIntermediateResultType(ValueType... argumentTypes) {
        return ValueType.INTEGER;
    }

    @Override
    public IntermediateResultAggregator createAggregator(DasJobEnvironment env, ValueType... argumentTypes) {
        return new GroupCountAggregator();
    }
}

Note that the number of values that are aggregated can be really large and must not be cached in memory. Doing so causes out of memory situations.

Here is the example for GROUPAVERAGE:

GroupAverageFunction.java
package datameer.das.functions.grouping;

import datameer.dap.sdk.common.Record;
import datameer.dap.sdk.function.ArgumentInfo;
import datameer.dap.sdk.function.Arguments;
import datameer.dap.sdk.function.FunctionGroup;
import datameer.dap.sdk.function.GenericBaseFunctionType;
import datameer.dap.sdk.function.IntermediateResultAggregationFunctionType;
import datameer.dap.sdk.function.IntermediateResultAggregator;
import datameer.dap.sdk.function.argument.FloatArgumentType;
import datameer.dap.sdk.job.DasJobEnvironment;
import datameer.dap.sdk.schema.RecordType;
import datameer.dap.sdk.schema.ValueType;
import datameer.dap.sdk.util.ManifestMetaData;

/**
 * Example {@link IntermediateResultAggregationFunctionType} that implements the
 * {@link IntermediateResultAggregator} protocol properly.
 */
public class ExampleAverageFunction extends GenericBaseFunctionType implements IntermediateResultAggregationFunctionType {

    private static final long serialVersionUID = ManifestMetaData.SERIAL_VERSION_UID;
    private static final RecordType INTERMEDIATE_RESULT_TYPE = RecordType.create(ValueType.FLOAT, ValueType.INTEGER);

    public ExampleAverageFunction() {
        // Only supports values of type FLOAT (i.e., 64-bit Double values).
        super(FunctionGroup.GROUPING, "GROUPAVG", "Returns the average of its arguments.", new FloatArgumentType());
    }

    private static class ExampleAverageAggregator extends IntermediateResultAggregator {
        private double _sum = 0.0D;
        private long _count = 0L;

        @Override
        public void aggregate(Arguments arguments) {
            Double argument = arguments.getFloat(0);
            if (argument != null) {
                _sum += argument;
                _count += 1L;
            }
        }

        @Override
        public Object computeAggregationResult() {
            if (_count == 0) {
                return null;
            }
            return _sum / _count;
        }

        @Override
        public void aggregateIntermediate(Object intermediateResult) {
            Record r = (Record) intermediateResult;
            _sum += r.getDoubleValue(0);
            _count += r.getLongValue(1);
        }

        @Override
        public Object computeIntermediateResult() {
            // Currently we have to create a new record here each time this is called, so the Record
            // instance cannot be cached by the IntermediateResultAggregator
            return new Record(INTERMEDIATE_RESULT_TYPE, _sum, _count);
        }

        @Override
        public void newGroup() {
            _sum = 0.0D;
            _count = 0L;
        }
    }

    @Override
    public ValueType computeReturnSchemaType(ArgumentInfo arguments) {
        return ValueType.FLOAT;
    }

    @Override
    public ValueType computeIntermediateResultType(ArgumentInfo argumentInfo) {
        return INTERMEDIATE_RESULT_TYPE;
    }

    @Override
    public IntermediateResultAggregator createAggregator(DasJobEnvironment env, ArgumentInfo argumentInfo) {
        return new ExampleAverageAggregator();
    }

    @Override
    protected String suggestColumnNameFromOnlyReferenceFromFirstArgument(ArgumentInfo argumentInfo, String referencedColumnName) {
        return "Average_" + referencedColumnName;
    }
}

Implementing a Group By Function

Writing a group by function is as simple as writing any other simple function. You need to implement the GroupByFunctionType marker interface. In most cases it is not really necessary to implement a special group by function. For example, if you want to group on text converted to lowercase, you can just combine GROUPBY and LOWER functions like this: =GROUPBY(LOWER(#sheet!A)).

Here is an example of the GROUPBYBIN function:

GroupByBinFunction.java
package datameer.das.plugin.tutorial02;

import datameer.dap.sdk.function.BaseSimpleFunctionType;
import datameer.dap.sdk.function.FieldType;
import datameer.dap.sdk.function.GroupByFunctionType;
import datameer.dap.sdk.function.argument.ArgumentType;
import datameer.dap.sdk.function.argument.IntegerArgumentType;
import datameer.dap.sdk.function.argument.IntegerAndFloatArgumentType;
import datameer.dap.sdk.util.ManifestMetaData;

public class GroupByBinFunction extends BaseSimpleFunctionType implements GroupByFunctionType {

    private static final long serialVersionUID = ManifestMetaData.SERIAL_VERSION_UID;

    public GroupByBinFunction() {
        super("Tutorial02", "_GROUPBYBIN", "Groups values into bins.", new ArgumentType[] { new IntegerAndFloatArgumentType(), new IntegerArgumentType("Bin size") });
    }

    @Override
    public FieldType computeReturnType(FieldType... argumentTypes) {
        return argumentTypes[0];
    }

    @Override
    public Object compute(Object... arguments) {
        Long value = asLong(arguments[0]);
        Long binSize = (Long) arguments[1];
        return (value / binSize) * binSize;
    }
}

Source Code

This tutorial can by found in the Datameer plug-in SDK under plugin-tutorials/tutorial02.