Datameer supports the following types of structured, semi-structured, and unstructured data. See Supported Data Sources for additional details.
Relational databases include Oracle, DB2, and MySQL.
Amazon Redshift - A hosted data warehouse product, part of the larger cloud computing platform Amazon Web Services.
- DB2 - The IBM relational database management system.
- Google BigQuery - A serverless SQL data warehouse for the cloud.
- Greenplum - An open-source massively parallel processing (MPP) database.
- - A relational database based on structured query language.
- MySQL - A relational database based on structured query language. You need to provide the hostname using a syntax such as 18.104.22.168 or anyhost.com. In addition, you need to provide the database name, user name, and password.
Oracle - A relational database management system designed for grid computing inclusive CLOB support for importing data.
- Netezza - A column-oriented database management system.
- PostgreSQL - An object-relational database management system (ORDBMS).
- Snowflake - A SQL data warehouse for the cloud.
- Sybase IQ - A column-based, relational database software.
- Teradata 13 - A relational database based on structured query language.
- Vertica - A grid-based and column-oriented analytic database software.
HSQL (file) – is a lightweight, 100% Java SQL Database Engine. You need to provide the database name you want to use, the username, and password.
Please look here for more information about importing data from a database.
Before being able to import data from a database an administrator need to Install Database Drivers.
- Apache log files - Apache server records all incoming requests and all requests processed to a log file. The format of the access log is highly configurable. The location and content of the access log are controlled by the CustomLog directive.
- Apache Avro - is a data serialization system that provides rich data structures, compact fast binary data format, container file to store persistent data, remote procedure call, and simple integration with dynamic languages.
- Cobol Copybook - A COBOL copybook is a section of code that defines the data structures of COBOL programs.
- Comma-delimited text files (.CSV) - This type of file stores tabular data (numbers and text) in plain-text form. Plain text means that the file is a sequence of characters, with no data that has to be interpreted instead, as binary numbers. A CSV file consists of any number of records, separated by line breaks of some kind; each record consists of fields, separated by some other character or string, most commonly a literal comma or tab.
- Excel Workbooks - A spreadsheet application developed by Microsoft. Datameer supports Excel 2007 and newer versions and uses the 1900 date system.
- Fixed Width - Fixed-width format is a file with a font whose letters and characters each occupy the same amount of horizontal space.
- HTML File Type - HyperText Markup Language, the markup language used to be display HTML elements in a web browser.
- IIS Logs - IIS (Internet Information Services) is a web server application and set of feature extension modules created by Microsoft for use with Microsoft Windows. IIS 7.5 supports HTTP, HTTPS, FTP, FTPS, SMTP and NNTP.
- JSON - An unordered collection of key:value pairs with the ':' character separating the key and the value, comma-separated and enclosed in curly braces; the keys must be strings and should be distinct from each other.
- Key/Value Pair - A key-value pair (KVP) is a set of two linked data items: a key, which is a unique identifier for some item of data, and the value, which is either the data that is identified.
- Log4j Log File - a popular logging package written in Java.
- Mbox - A generic term for a family of related file formats used for holding collections of electronic mail messages.
- Netfilter / IP-Tables - Netfilter is the packet filtering framework inside the Linux kernel. Iptables is a user space application that allows a sys admin to configure tables provided by the Linux kernel firewall.
- Orc (Optimized Row Columnar) - a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
- Parquet - Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language.
- Regex Parsable Text Files - Specify the file or folder, enter a Regex pattern for processing the data, and specify whether the first row contains the column headers.
- Sequence File with Metadata - A flat file consisting of binary key/value pairs. It is extensively used in MapReduce as input/output formats.
- Unstructured data such as Twitter data. -Information that either doesn't have a pre-defined data model and/or doesn't fit well into relational tables. Unstructured information is typically text-heavy, but might contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated in documents.
- XML data: Specify the file or folder, the root element, container element, and XPath expressions for the fields you would import to Datameer.
- Azure Blob Storage - A Microsoft storage service for large unstructured binary and text data. Available for Datameer HDP 2.0+ and CDH 4+ users. Please contact our services department for the connector plug-in.
You can import or upload individual sheets from a spreadsheet by first converting the file to a .CSV file type.
File System Connectors
- Datameer server filesystem – the local filesystem. Use this choice to set up a local filesystem for use by Datameer.
- FTP - (File Transfer Protocol) is a standard network protocol used to transfer files from one host or to another host over a TCP-based network, such as the Internet.
- HDFS – (Hadoop Distributed File System) a distributed file system used by Hadoop applications that creates multiple replicas of data blocks and distributes them on nodes throughout a cluster to allow extremely rapid computations. You need to provide the location of your HDFS named node such as hdfs://localhost:9000. In addition, you need to indicate the port used by the job tracker, e.g. localhost:9001. The default value is 9000. Learn more about HDFS.
- S3 – (Amazon Simple Storage Service) is a simple web services interface that provides scalable, reliable, secure, fast, and inexpensive infrastructure for backup or storage of data. Choose this selection if you are using Amazon storage services. You need to provide the S3 Bucket, the Access key, and the access secret. Learn more about S3. Datameer supports the Signature Version 2 signing process.
- SFTP - (SSH File Transfer Protocol) Like FTP, it transfers files and has a similar command set, but unlike FTP, it encrypts both commands and data, preventing passwords and sensitive information from being transmitted openly over the network.
- SSH – (Secure Shell) is a set of Unix utilities including SCP and SFTP, based on SSL, which uses a simple Public Key Infrastructure and Encryption to allow you to securely transfer files between Unix file systems. You need to provide the host name, port, username and password. The default port is 22.
Datameer is able to split large files across multiple mappers enabling parallel data ingestion. Two requirements must be fulfilled for this to be possible.
- Splitting of the file protocol must be supported. Currently splitting all of the above protocols is supported.
- Splitting of the compression type must be supported. Currently LZO and Gzip are splittable, zip and Bz2 aren't supported.
See Importing Data for more information.
Hive – a data warehouse infrastructure built on Hadoop that provides data summarization and ad hoc querying. You need to provide the connection type for the connection where the hive puts its data. This is usually a HDFS or S3 connection. In addition, you need to provide the warehouse location and the metastore URI in format such as thrift://host:10000. Learn more about Hive.
HiveServer2 - a server interface that enables remote clients to execute queries against Hive and retrieve the results. The current implementation, based on Thrift RPC, is an improved version of HiveServer and supports multi-client concurrency and authentication. It is designed to provide better support for open API clients like JDBC and ODBC.