Getting Started with Druid 0.17 - Data Access Guide

In the quick start, we demonstrated how to access local sample data, but Druid actually supports very rich data access methods.Such as batch data access and real-time streaming data access.In this paper, we will introduce these data access methods.

  • File Data Access: Loading batch data from a file
  • Accessing stream data from Kafka: Loading stream data from Kafka
  • Hadoop Data Access: Loading batch data from Hadoop
  • Write your own data access specification: customize the new access specification

This paper mainly introduces the first two most commonly used data access methods.

1. Loading a file - Loading a file

Druid provides several ways to load data:

  • Through Page Data Loader

  • Through console

  • From the command line

  • Called through the Curl command

1.1, Data Loader

Druid provides a sample data file containing sample data for the Wiki that occurred on September 12, 2015.

This sample data is located at quickstart/tutorial/wikiticker-2015-09-12-sampled.json.gz

The sample data looks like this:

{
  "timestamp":"2015-09-12T20:03:45.018Z",
  "channel":"#en.wikipedia",
  "namespace":"Main",
  "page":"Spider-Man's powers and equipment",
  "user":"foobar",
  "comment":"/* Artificial web-shooters */",
  "cityName":"New York",
  "regionName":"New York",
  "regionIsoCode":"NY",
  "countryName":"United States",
  "countryIsoCode":"US",
  "isAnonymous":false,
  "isNew":false,
  "isMinor":false,
  "isRobot":false,
  "isUnpatrolled":false,
  "added":99,
  "delta":99,
  "deleted":0,
}

Druid load data is divided into the following categories:

  • load file
  • Loading data from kafka
  • Loading data from hadoop
  • Custom Loading Method

Let's demonstrate loading sample file data like this

1.1.1, go to localhost:8888 and click load data

1.1.2, select local disk

1.1.3, Select Connect data

1.1.4, Preview data

Base directory input quickstart/tutorial/

File filter input wikiticker-2015-09-12-sampled.json.gz

Then click apply Preview to see the data, click Next:parse data to parse the data

1.1.5, parsing data

You can see that the json data has been parsed to continue parsing time

1.1.6, Resolution time

Two steps after the parsing time is successful are transform and filter. Direct next is not demonstrated here

1.1.7, Confirm Schema

This step will confirm that Schema can make some changes

Because of the small amount of data, we turn off Rollup and go straight to the next step

1.1.8, Set Segmentation

Here you can set the data segment we choose hour next

1.1.9, Confirm Publication

1.1.10, Publish Successfully Start Parsing Data

Waiting for the task to succeed

1.1.11, Viewing data

Select datasources to see the data we loaded

You can see that the data source name Fully is fully available and has various information such as size

1.1.12, Query Data

Click the query button

We can write sql queries and download data

1.2 Console

In the task view, click Submit JSON task

This opens the specification submission dialog box and pastes the specification

{
  "type" : "index_parallel",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "wikipedia",
      "dimensionsSpec" : {
        "dimensions" : [
          "channel",
          "cityName",
          "comment",
          "countryIsoCode",
          "countryName",
          "isAnonymous",
          "isMinor",
          "isNew",
          "isRobot",
          "isUnpatrolled",
          "metroCode",
          "namespace",
          "page",
          "regionIsoCode",
          "regionName",
          "user",
          { "name": "added", "type": "long" },
          { "name": "deleted", "type": "long" },
          { "name": "delta", "type": "long" }
        ]
      },
      "timestampSpec": {
        "column": "time",
        "format": "iso"
      },
      "metricsSpec" : [],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-12/2015-09-13"],
        "rollup" : false
      }
    },
    "ioConfig" : {
      "type" : "index_parallel",
      "inputSource" : {
        "type" : "local",
        "baseDir" : "quickstart/tutorial/",
        "filter" : "wikiticker-2015-09-12-sampled.json.gz"
      },
      "inputFormat" :  {
        "type": "json"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index_parallel",
      "maxRowsPerSegment" : 5000000,
      "maxRowsInMemory" : 25000
    }
  }
}

View the load task.

1.3 Command Line

For convenience, Druid provides a script to load data

bin/post-index-task

We can run commands

bin/post-index-task --file quickstart/tutorial/wikipedia-index.json --url http://localhost:8081

You see the following output:

Beginning indexing data for wikipedia
Task started: index_wikipedia_2018-07-27T06:37:44.323Z
Task log:     http://localhost:8081/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/log
Task status:  http://localhost:8081/druid/indexer/v1/task/index_wikipedia_2018-07-27T06:37:44.323Z/status
Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
Task index_wikipedia_2018-07-27T06:37:44.323Z still running...
Task finished with status: SUCCESS
Completed indexing data for wikipedia. Now loading indexed data onto the cluster...
wikipedia loading complete! You may now query your data

View the load task.

1.4 CURL

We can load data by calling CURL directly

curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-index.json http://localhost:8081/druid/indexer/v1/task

Submitted successfully

{"task":"index_wikipedia_2018-06-09T21:30:32.802Z"}

2. Load from Apache Kafka - Loading Stream Data from Apache Kafka

Apache Kafka is a high performance messaging system written by Scala.An open source messaging system project developed by the Apache Software Foundation.

Kafka was originally developed by LinkedIn and was open source in early 2011.Graduated from Apache Incubator in October 2012.The goal of this project is to provide a unified, high-throughput, low-latency platform for processing real-time data.

See more about kafka Kafka Getting Started (Detailed Screenshot)

2.1 Install kafka

We installed a new kafka

curl -O https://archive.apache.org/dist/kafka/2.1.0/kafka_2.12-2.1.0.tgz
tar -xzf kafka_2.12-2.1.0.tgz
cd kafka_2.12-2.1.0

Start kafka

./bin/kafka-server-start.sh config/server.properties

Create a topic

./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia

2.2 Writing data to Kafka

Writing data to kafka's top for wikipedia

cd quickstart/tutorial
gunzip -c wikiticker-2015-09-12-sampled.json.gz > wikiticker-2015-09-12-sampled.json

Run the command {PATH_TO_DRUID} in the kafka directory and replace it with the druid directory

export KAFKA_OPTS="-Dfile.encoding=UTF-8"
./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < {PATH_TO_DRUID}/quickstart/tutorial/wikiticker-2015-09-12-sampled.json

2.3 Loading kafka data to Druid

druid can also load kafka data in a variety of ways

  • Data Loader
  • Console
  • CURL

2.3.1 Data Loader

2.3.1.1 Enter localhost:8888 Click load data

Select Apache Kafka and click Connect data

2.3.1.2 Enter kafka server localhost:9092
Enter topic Wikipedia to preview the data and proceed to the next step

2.3.1.3 Parsing data

2.3.1.4 Resolve timestamp settings transform settings filter

2.3.1.4 This step is important to determine the range of Statistics

2.3.1.5 Release

2.3.1.6 Waiting for the task to complete

2.3.1.7 Go to Query Page View

2.3.2 Console

In the task view, click Submit JSON supervisor to open the dialog box.

Paste in the following instructions

{
  "type": "kafka",
  "spec" : {
    "dataSchema": {
      "dataSource": "wikipedia",
      "timestampSpec": {
        "column": "time",
        "format": "auto"
      },
      "dimensionsSpec": {
        "dimensions": [
          "channel",
          "cityName",
          "comment",
          "countryIsoCode",
          "countryName",
          "isAnonymous",
          "isMinor",
          "isNew",
          "isRobot",
          "isUnpatrolled",
          "metroCode",
          "namespace",
          "page",
          "regionIsoCode",
          "regionName",
          "user",
          { "name": "added", "type": "long" },
          { "name": "deleted", "type": "long" },
          { "name": "delta", "type": "long" }
        ]
      },
      "metricsSpec" : [],
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "NONE",
        "rollup": false
      }
    },
    "tuningConfig": {
      "type": "kafka",
      "reportParseExceptions": false
    },
    "ioConfig": {
      "topic": "wikipedia",
      "inputFormat": {
        "type": "json"
      },
      "replicas": 2,
      "taskDuration": "PT10M",
      "completionTimeout": "PT20M",
      "consumerProperties": {
        "bootstrap.servers": "localhost:9092"
      }
    }
  }
}

2.3.3 CURL

We can also load kafka data by calling CURL directly

curl -XPOST -H'Content-Type: application/json' -d @quickstart/tutorial/wikipedia-kafka-supervisor.json http://localhost:8081/druid/indexer/v1/supervisor

It's never wrong to calm down and work hard to improve yourself.More blog posts about real-time computing, welcome to focus on real-time streaming computing

Tags: Big Data kafka JSON Druid curl

Posted on Mon, 16 Mar 2020 19:16:01 -0700 by shavas