In case you haven’t noticed, we are always striving to maximize efficiency and ease for our customers. As such, one of my favorite features in Datameer is the ability to incrementally load data from relational sources. This allows you to keep fresh data in your Hadoop system while also putting minimal stress on the upstream relational system since you’re only adding changed or updated records to your environment.
For example, this would be beneficial for retailers running a weekly report analyzing customer purchasing behavior. Instead of re-importing all past customer data, this functionality allows users to only add those new customers that have purchased since the last time the analysis was run, allowing companies to analyze current data and contrast that with all historical customer purchasing behavior data.
To start, there are three options to import data from Datameer:
For this example, I will run an Import job from a Teradata data store:
Once selected, it will show me a preview of the data that is stored in the database and I can make modifications to the data types, if needed.
Note that when it comes to parallel loading and incremental loading, Datameer requires a “split column.” Split columns allow you to have multiple data nodes in your Hadoop cluster so you can pull that data from the Teradata table in parallel.
The split column also allows you to perform these incremental loads. The way Datameer detects fresh records in your table is by looking at the split column and adding those values that are higher than the previous value.
You define the data retention policy in the scheduling tab. Every time an import job is executed, the incoming records are appended to the existing records into your Hadoop File System. Once you click on the append mode you are able to enable the incremental mode, which allows you to only load those records in your table that were not available during the last import based on the split column.
There are also advanced capabilities that allow you to define the number of sample records that are available, the number of mappers (the number of nodes that are running the import in parallel), etc.
The final step is to select how often you would like to have the import conducted (daily, weekly) and then import! See below for the full tutorial.