Supported External Data Types and Sources

INFO

Datameer X supports the following types of structured, semi-structured, and unstructured data types and sources when importing and exporting.

Supported External Databases

INFO

Relational databases include Oracle, DB2, and MySQL.

Database NameDescription
Amazon Athena
  • a query service to run Sql queries against their data
Amazon Redshift
  • a quick, scalable data warehouse as a service from the cloud
Azure Cosmos DB
  • a fully managed NoSQL database service
Azure Databricks
  • an Apache Spark based analytics service with an interactive workspace
Azure Synapse
  • an unlimited analytics service which enables flexible data queries as you see fit, using on-demand server less resources or provisioned resources at scale
DB2
  • IBM's relational database management system
Greenplum
  • an open-source massively parallel processing (MPP) database
HSQL_file
  • a lightweight, 100% Java SQL Database Engine.
MSSQL
  • a relational database based on structured query language
MySQL
  • a relational database based on structured query language
Netezza
  • a column-oriented database management system
Oracle
  • a relational database management system designed for grid computing inclusive CLOB support for importing data
PostgreSQL
  • an object-relational database management system (ORDBMS)
Sybase IQ
  • a column-based, relational database software
Teradata Aster
  • a relational database based on structured query language
Vertica 5.1+
  • a grid-based and column-oriented analytic database software

Supported External Files

INFO

You can import or upload individual sheets from a spreadsheet by first converting the file to a .CSV file type.

File TypeDescription
Apache log files
  • record of all incoming requests from the Apache server
  • requests are processed to a log file - format of access log is highly configurable
  • location and content of the access log are controlled by the CustomLog directive
Apache Avro
  • Avro file contains data serialized in a compact binary format and schema in JSON format that defines the data types
  • an Avro file may also store markers if the datasets are too large and need to be split into subsets when processed
COBOL Copybook
  • is a section of code that defines the data structures of COBOL programs
CSV (comma-delimited text files)
  • stores tabular data (numbers and text) in plain-text form (sequence of characters, with no data that has to be interpreted instead, as binary numbers)
  • consists of any number of records, separated by line breaks of some kind
  • each record consists of fields, separated by some other character or string (most commonly a literal comma or tab)
Excel Workbooks
  • is a spreadsheet application by Microsoft
Fixed Width
  • a file with a font whose letters and characters each occupy the same amount of horizontal space
HTML file
  • file contains Hyper Text Markup Language
IIS Logs (Internet Information Services)
  • IIS is a web server application and set of feature extension modules for usage with Microsoft Windows
  • IIS 7.5 supports HTTP, HTTPS, FTP, FTPS, SMTP AND NNTP
JSON
  • an unordered collection of 'key:value' pairs, comma-separated and enclosured in curly braces
  • keys must be strings and should be distinct from each other
Key/ value pair
  • a set of two linked data items: key and value
  • key is the unique identifier for some item of data
  • value is the data that is identified
Log4j log file
  • logging package written in Java
Mbox
  • a generic term for a family of related file formats used for holding collections of electronic mail messages
Netfilter/ IP Tables
  • Netfilter is the packet filtering framework inside the Linus kernel
  • IP tables is a user space application that allows the system administrator to configure tables provides by the Linux kernel firewall
ORC (Optimized Row Columnar)
  • a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem
Parquet
  • a columnar storage format available to any project in the Apache Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language
Regex Parsable Text Files
  • specify the file or folder, enter a Regex pattern for processing the data, and specify whether the first row contains the column headers
Sequence File wit Metadata
  • a flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.
Unsecured data
  • such as Twitter data
  • information that either doesn't have a pre-defined data model and/or doesn't fit well into relational tables
  • unstructured information is typically text-heavy, but might contain data such as dates, numbers, and facts as well 
  • this results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated in documents
XML data
  • specify  the file or folder, the root element, container element, and XPath expressions for the fields you would import to Datameer

Supported External File System Connectors

File SystemDescription
Apache Knox WebHDFS
  • is the REST API and application gateway for the Hadoop ecosystem
Amazon S3 (Simple Storage Server)
  • is a simple web services interface that provides scalable, reliable, secure, fast, and inexpensive infrastructure for backup or storage of data
  • choose this connector if you are using Amazon storage services
Azure Blob Storage
  • a Microsoft storage service for large unstructured binary and text data. Available for Datameer X HDP 2.0+ and CDH 4+ users. Please contact our services department for the connector plug-in
Custom Protocol
  • can be assigned the same name as a pre-defined protocol, in order to extend the number of IP addresses or ports associated with the original protocol
Datameer X Server Filesystem
  • the local Datameer X filesystem
FTP (File Transfer Protocol)
  • a standard network protocol used to transfer from one host or to another host over a TCP-based network, such as the internet
HDFS (Hadoop Distributed File System
  • a distributed file system used by Hadoop applications that creates multiple replicas of data blocks and distributes them on nodes throughout a cluster to allow extremely rapid computations
OpenStack Swift
  • offers cloud storage software so that you can store and retrieve lots of data with a simple API
  • is built for scale and optimized for durability, availability, and concurrency across the entire data set
  • is ideal for storing unstructured data that can grow without bound
SFTP (SSH File Transfer Protocol)
  • transfers files and encrypts both commands and data, preventing passwords and sensitive information from being transmitted openly over the network
SSH (Secure Shell)
  • is a set of Unix utilities including SCP and SFTP, based on SSL, which uses a simple Public Key Infrastructure and Encryption to allow you to securely transfer files between Unix file systems

INFO: Datameer X supports Bitverse SSH Server/Client for the Windows platform. The root paths to be specified while creating the connection should look something like: /c:/mydata/folder1

MapR FS
  • a clustered field system that supports both very large-scale and high-performance uses

Datameer X is able to split large files across multiple mappers enabling parallel data ingestion. Two requirements must be fulfilled for this to be possible.

  1. Splitting of the file protocol must be supported. Currently splitting all of the above protocols is supported.
  2. Splitting of the compression type must be supported. Currently LZO and Gzip are splittable, zip and Bz2 aren't supported.

See Importing Data for more information.

Supported Other External Connectors 

External ConnectorDescription
Google BigQuery
  • is Google's fully managed data warehouse for petabyte analytics
HBase
  • is an open-source non-relational distributes database
  • is written in Java and runs on top of HDFS
Hive (JDBC)
  • an open source data warehouse system for querying and analysing large data sets stored in Hadoop
Hive Server2 (JDBC)
  • a service that enables clients to execute queries against Hive
  • it supports multi-client concurrency and authentication
  • provides support for open API clients like JDBC and ODBC
IMAP & POP3 (Internet Message Access Protocol)
  • IMAP is the internet standard protocol used by email clients to retreive email messages from a mail server over a TCP/ IP connection
  • POP3 is a client/ server protocol in which email is received and held
Knox Hive Server2 JDBC
  • the security instance when you have a Hive Server2 JDBC instance running
Power BI
  • a business analytics service provided by Microsoft
  • provides interactive visualizations with self-service business intelligence capabilities
Tableau Server
  • visual analytics platform to host, and hold all tableau workbooks, datasources and more