Page tree

Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  1. Click the + (plus) button and select Import Job or right-click in the browser and select  Create new  > Import Job .
  2. Click Select Connection and choose the name of your Hive connection (here - Connection hive) then click Next.
     
  3. Select a database.

  4. Choose the desired table.

    Supported Hive File Formats

    TEXTFILE - Plain text.

    SEQUENCEFILE - A flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.

    ORC (Optimized Row Columnar) - A file format that provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.

    RCFILE (Record Columnar File) - A data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework.

    Note
    iconfalse

    Hive views are not supported and have been removed from the Table list. This due to the fact that a Hive view is a reference to non-materialized data. Since Datameer doesn't support running a Hive query, a Hive view isn't currently possible.

    If the filtering of the views is causing a performance issue, the property below must be added to the  Custom Properties in the Hive connector to remove the filter.

    Code Block
    das.hive.exclude.views=false

    Removing the filter to increase performance only allows views to be shown in the  Table drop-down list but they are still not supported for import.

    As of Datameer v6.3

    Importing Hive views is supported using the JDBC connector with a HiveServer2 implementation.

  5. Partitions from Hive being imported have additional filtering options. This works for both string and date/time partitions.
    The following options are available: 

    • Filter by values - Select all available partition values for applying a value based filter.
    • Filter by fixed dates - Parse partition values for date and time constants and use start and end date for applying a time based filter for partitions. You have to specify a Java date pattern for each partition that is related to a date.
    • Filter by dynamic dates - Parse partition values for date and time constants and use start and end date expression for applying a filter based on a sliding time window. You have to specify a Java date pattern for each partition that is related to a date.

    This filter feature allows for import of data that has already been partitioned on the Hive server. To view how date/time partitions work in data links, refer to Linking to Data. To create partitions within a Datameer workbook, use the Time-based Partitions feature under the Define Fields section of the import.

  6. Next, you see a preview of the imported data. 

    From the Define Fields page, you can change the data field types and if necessary, set up date parse patterns .
    By default, the preview includes the columns within the Hive partition but not the partition values. If needed, add the partition values to the import job by marking the included box under the column name.  

    Complex data field types (e.g., lists, structs, maps, and any nested data types) are represented as JSON and displayed as strings.
    You can extract and use this data with the JSON functions (i.e. JSON_ELEMENTJSON_ELEMENTSJSON_KEYSJSON_MAP or JSON_VALUE) after loading this data into a workbook.

    Note

    Datameer jobs are compiled outside of Hive and don't have the same restrictions as Hive queries do. A workbook in Datameer isn't a direct analog to a Hive query and there are often concepts that don't translate back and forth as one-to-one features.

    A filter is similar to a where clause. It restricts the results on only include results that match the requested search criteria.

    Anchor
    HiveTypeMapping
    HiveTypeMapping
    Data type mapping when importing from a Hive table

    Hive typesDatameer types

    StringBooleanFloatIntegerBig_integerBig_decimalDateList
    STRING(tick)






    VARCHAR(tick)






    CHAR(tick)






    BOOLEAN
    (tick)





    FLOAT

    (tick)




    DOUBLE

    (tick)




    DECIMAL




    (tick)

    TINYINT


    (tick)



    SMALLINT


    (tick)



    INT


    (tick)



    BIGINT


    (tick)



    DATE





    (tick)
    TIMESTAMP





    (tick)
    COMPLEX TYPES(tick)






    all others(tick)






    Additional advanced features are available to specify how to handle the data.

    Anchor
    time-based
    time-based
    Time-based partitions let workbook users partition data by date. This features allows calculations to run on all or only specific parts of the imported data. See Partitioning Data in Datameer for more information.

  7. Review the schedule, data retention, and advanced properties for the job.

  8. Add a description, click Save, and name the file.

...