1. Apache Drill – Introduction


drill 来源于 goole的Dremel

巨量数据查询快:Dremel can scan 35 billion rows without an index within ten-seconds

因为使用行存储:Dremel stores data in a columnar storage model, which means that it separates a record into column values and then stores each value on a different storage volume. But at the same time, traditional databases store the whole record on one volume. This columnar approach is the main reason that it makes Dremel drastically fast.


Apache Drill is a low latency schema-free query engine for big data. Drill uses a JSON document model internally which allows it to query data of any structure.

Drill works with a variety of non-relational data stores, including Hadoop, NoSQL databases (MongoDB, HBase) and cloud storage like Amazon S3, Azure Blob Storage, etc. Users can query the data using a standard SQL and BI Tools, which doesn’t require to create and manage schemas.


Benefits

Following are some of the most important benefits of Apache Drill:

 Drill can scale data from a single node to thousands of nodes and query petabytes of data within seconds.

 Drill supports user defined functions.

 Drill's symmetrical architecture and simple installation makes it easy to deploy and operate very large clusters.

 Drill has flexible data model and extensible architecture.

 Drill columnar execution model performs SQL processing on complex data without flattening into rows.

 Supports large datasets

Key Features

Following are some of the most significant key features of Apache Drill:

 Drill’s pluggable architecture enables connectivity to multiple datastores.

 Drill has a distributed execution engine for processing queries. Users can submit requests to any node in the cluster.

 Drill supports complex/multi-structured data types.

 Drill uses self-describing data where a schema is specified as a part of the data itself, so no need for centralized schema definitions or management.

 Flexible deployment options either local node or cluster.

 Specialized memory management that reduces the amount of main memory that a program uses or references while running and eliminates garbage collections.

 Decentralized data management.

Use Cases

Apache Drill can work along with a few other softwares, some of which are:

 Cloud JSON and Sensor Analytics: Drill’s columnar approach leverages to access JSON data and expose those data via REST API to apply sensor analytics information.

 Works well with Hive: Apache Drill serves as a complement to Hive deployments with low latency queries. Drill’s hive metastore integration exposes existing datasets at interactive speeds.

 SQL for NoSQL: Drill’s ODBC driver and powerful parallelization capabilities provide interactive query capabilities.


Need for Drill

Apache Drill comes with a flexible JSON-like data model to natively query and process complex/multi-structured data. The data does not need to be flattened or transformed either at the design time or runtime, which provides high performance for queries.

Drill exposes an easy and high performance Java API to build custom functions.

Apache Drill is built to scale to big data needs and is not restricted by memory available on the cluster nodes.


Drill Integration

Drill has to integrate with a variety of data stores like relational data stores or non-relational data stores. It has the flexibility to add new data stores.

Integration with File Systems

 Traditional file system: Local files and NAS (Network Attached Storage)

 Hadoop: HDFS and MAPR-FS (MAPR-File System)

 Cloud storage: Amazon S3, Google Cloud Storage, Azure Blob Storage

Integration with NoSQL Databases

 MongoDB

 HBase

 HIVE

 MapR-DB


2. Apache Drill – Fundamentals

Drill Nested Data Model

Apache Drill supports various data models.

The initial goal is to support the column-based format used by Dremel,

then it is designed to support schema less models such as JSON, BSON (Binary JSON)

and schema based models like Avro and CSV.


JSON

JSON (JavaScript Object Notation) is a lightweight text-based open standard designed for human-readable data interchange.

JSON format is used for serializing and transmitting structured data over network connection. It is primarily used to transmit data between a server and web applications.

JSON is typically perceived as a format whose main advantage is that it is simple and lean.

It can be used without knowing or caring about any underlying schema.


{ "$schema": "http://json-schema.org/draft-04/schema#", "title": "Product", "description": “Classical product catalog", "type": "object", "properties": { "id": { "description": "The unique identifier for a product", "type": "integer" }, "name": { "description": "Name of the product", "type": "string" }, 9 "price": { "type": "number", "minimum": 0, "exclusiveMinimum": true } }, "required": ["id", "name", "price"] }


The JSON Schema has the capability to express basic definitions and constraints for data types contained in objects,

and it also supports some more advanced features such as properties typed as other objects, inheritance, and links.


Apache Avro

Avro is an Apache open source project that provides data serialization and data exchange services for Hadoop.

These services can be used together or independently.

Avro is a schema-based system. A language-independent schema is associated with its read and write operations. Using Avro, big data can be exchanged between programs written in any language. Avro supports a rich set of primitive data types including numeric, binary data and strings, and a number of complex types including arrays, maps, enumerations and records.

A key feature of Avro is the robust support for data schemas that change over time.


Simple Avro Schema

Avro schema is created in JavaScript Object Notation (JSON) document format, which is a lightweight text-based data interchange format.


{ " type " : "record", " namespace " : "AvroSample", " name " : "Employee", " fields " : [ { "name" : " Name" , "type" : "string" }, { "name" : "age" , "type" : "int" } ] }


The above schema contains four attributes, they have been briefly described here:

 type − Describes document type, in this case a “record"

 namespace − Describes the name of the namespace in which the object resides

 name − Describes the schema name

 fields − This is an attribute array which contains the following

 name − Describes the name of field

 type − Describes data type of field


Nested Query Language

Apache Drill supports various query languages.

The initial goal is to support the SQL-like language used by Dremel and Google BigQuery.

DrQL and Mongo query languages are an examples of Drill nested query languages.


DrQL

The DrQL (Drill Query Language) is a nested query language.

DrQL is SQL like query language for nested data.

It is designed to support efficient column-based processing.


Mongo Query Language

The MongoDB is an open-source document database, and leading NoSQL database.

MongoDB is written in C++ and it is a cross-platform, document-oriented database that provides, high performance, high availability, and easy scalability. MongoDB works on the concept of collection and documenting.

Wherein, collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A collection exists within a single database. A document is a set of key-value pairs.



Drill File Format

Drill supports various file formats such as CSV, TSV, PSV, JSON and Parquet.

Wherein, “Parquet” is the special file format which helps Drill to run faster and its data representation is almost identical to Drill data representation.


Parquet

Parquet is a columnar storage format in the Hadoop ecosystem. Compared to a traditional row-oriented format, it is much more efficient in storage and has better query performance.

Parquet stores binary data in a column-oriented way, where the values of each column are organized so that they are all adjacent, enabling better compression.


It has the following important characteristics:

 Self-describing data format

 Columnar format

 Flexible compression options

 Large file size


Flat Files Format

The Apache Drill allows access to structured file types and plain text files (flat files). It consists of the following types –

 CSV files (comma-separated values)

 TSV files (tab-separated values)

 PSV files (pipe-separated values)


CSV file format: A CSV is a comma separated values file, which allows data to be saved in a table structured format. CSV data fields are often separated or delimited by comma (,). The following example refers to a CSV format.

firstname, age

Alice,21

Peter,34

This CSV format can be defined as follows in a drill configuration.

"formats": {

"csv": {

"type": "text",

"extensions": [

“csv2"

],

"delimiter": “,”

}

}


TSV file format: The TSV data fields are often separated or delimited by a tab and saved with an extension of “.tsv" format. The following example refers to a TSV format.

firstname age

Alice 21

Peter 34

The TSV format can be defined as follows in a drill configuration.

"tsv": {

"type": "text",

"extensions": [

"tsv"

],

"delimiter": “\t"

},


PSV file format: The PSV data fields are separated or delimited by a pipe (|) symbol. The following example refers to a PSV format.

firstname|age

Alice|21

Peter|34

The PSV format can be defined as follows in a drill configuration.

"formats": {

"psv": {

"type": "text",

"extensions": [

"tbl"

],

"delimiter": "|"

}

}

These PSV files are saved with an extension of “.tbl” format.


Scalable Data Sources

Managing millions of data from multiple data sources requires a great deal of planning.

When creating your data model, you need to consider the key goals such as the impact on speed of processing, how you can optimize memory usage and performance, scalability when handling growing volumes of data and requests.

Apache Drill provides the flexibility to immediately query complex data in native formats, such as schema-less data, nested data, and data with rapidly evolving schemas.

Following are its key benefits:

 High-performance analysis of data in its native format including self-describing data such as Parquet, JSON files and HBase tables.

 Direct querying of data in HBase tables without defining and maintaining a schema in the Hive metastore.

 SQL to query and work with semi-structured/nested data, such as data from NoSQL stores like MongoDB and online REST APIs.


Drill Clients

Apache Drill can connect to the following clients –

 Multiple interfaces such as JDBC, ODBC, C++ API, REST using JSON

 Drill shell

 Drill web console (http://localhost:8047)

 BI tools such as Tableau, MicroStrategy, etc.

 Excel