About Us Icon About Us Icon Business Analyst Icon Business Analyst Icon CEO Icon CEO Icon Datameer Icon Datameer Icon Envelope Icon Envelope Icon Facebook Icon Facebook Icon Google Plus Icon Google Plus Icon Instagram Icon Instagram Icon IT Professional Icon IT Professional Icon Learn Icon Learn Icon Linkedin Icon Linkedin Icon Product Icon Product Icon Partners Icon Partners Icon Search Icon Search Icon Social Networks Icon Social Networks Icon Share Icon Share Icon Support Icon Support Icon Testimonial Icon Testimonial Icon Twitter Icon Twitter Icon

Datameer Blog

How-to: Incrementally Load Data from Relational Sources in Datameer

By on April 18, 2014

In case you haven’t noticed, we are always striving to maximize efficiency and ease for our customers. As such, one of my favorite features in Datameer is the ability to incrementally load data from relational sources. This allows you to keep fresh data in your Hadoop system while also putting minimal stress on the upstream relational system since you’re only adding changed or updated records to your environment.

For example, this would be beneficial for retailers running a weekly report analyzing customer purchasing behavior. Instead of re-importing all past customer data, this functionality allows users to only add those new customers that have purchased since the last time the analysis was run, allowing companies to analyze current data and contrast that with all historical customer purchasing behavior data.

To start, there are three options to import data from Datameer:

How-to: Incrementally Load Data from Relational Sources
  • File upload: a one-time import of data from your computer, data link
  • Data link: fetches the data for each workbook to run dynamically
  • Import job: brings the entire database set into Datameer

For this example, I will run an Import job from a Teradata data store:

How-to: Incrementally Load Data from Relational Sources

Once selected, it will show me a preview of the data that is stored in the database and I can make modifications to the data types, if needed.

How-to: Incrementally Load Data from Relational Sources

Note that when it comes to parallel loading and incremental loading, Datameer requires a “split column.” Split columns allow you to have multiple data nodes in your Hadoop cluster so you can pull that data from the Teradata table in parallel.

How-to: Incrementally Load Data from Relational Sources

The split column also allows you to perform these incremental loads. The way Datameer detects fresh records in your table is by looking at the split column and adding those values that are higher than the previous value.

You define the data retention policy in the scheduling tab. Every time an import job is executed, the incoming records are appended to the existing records into your Hadoop File System. Once you click on the append mode you are able to enable the incremental mode, which allows you to only load those records in your table that were not available during the last import based on the split column.

How-to: Incrementally Load Data from Relational Sources

There are also advanced capabilities that allow you to define the number of sample records that are available, the number of mappers (the number of nodes that are running the import in parallel), etc.

The final step is to select how often you would like to have the import conducted (daily, weekly) and then import! See below for the full tutorial.

Connect with Datameer

Follow us on Twitter
Connect with us on LinkedIn, Google+ and Facebook

Tim Bezold