1. Flume overview
Flume is a distributed system for massive log collection, aggregation and transmission. Flume's main function is to read the data of the server's local disk in real time and write the data to HDFS.
Agent: send data from Source to destination in the form of events. Including Source, ...
Posted on Thu, 04 Jun 2020 10:59:09 -0700 by jamz310
Focus on Public Number: Alliance of Java Architects, Daily Technical Updates
This tutorial uses two scenarios
One is hive-1.21 and hadoop is hadoop 2.6.5
Another is mainly about the construction based on hadoop3.x hive
First come first
1. Local (embedded derby)
This storage requires running a mysql server locally and configuring ...
Posted on Thu, 28 May 2020 09:18:13 -0700 by adam119
Create a SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().appName('test_app').getOrCreate()
sc = spark.sparkContext
hc = HiveContext(sc)
1. Spark creates partition table
# You can change append to overwrite, so that if the table already exists, the previous table will be deleted and a ...
Posted on Mon, 11 May 2020 01:18:45 -0700 by [xNet]DrDre
Environment: Dell xps15 (32G memory, 1T solid state, Samsung 1T mobile solid state with external lightning 3 interface, 4T external mechanical hard disk of WD Elements) win10 three Centos7 virtual machines are used to test cdh6.3.2 cluster (the highest version of free community version), self compiled Phoenix 5.1.0, flink1.10.0, elasticsearch6. ...
Posted on Tue, 05 May 2020 05:12:05 -0700 by mwichmann4
The traditional hive computing engine is MapReduce. After Spark 1.3, SparkSql was officially released, and it is basically compatible with apache hive. Based on the powerful computing power of Spark, the data processing speed of using Spark to process hive is far faster than that of traditional hive.Using SparkSql in idea to read the data in H ...
Posted on Mon, 30 Mar 2020 14:23:09 -0700 by bl00dshooter
1, Sub account creation, AK information binding If you are the first time to log in to digital plus platform and use DataWorks with a sub account, you need to confirm the following information: • the business alias of the primary account to which the sub account belongs. • user name and password of the sub account. • AccessKey ID ...
Posted on Tue, 10 Mar 2020 22:54:52 -0700 by physaux
What is Sqoop?
Sqoop (pronunciation: skup) is an open source tool, mainly used in Hadoop(Hive) and traditional databases (mysql, postgresql )For data transfer, you can import data from a relational database (such as mysql, Oracle, Postgres, etc.) into HDFS of Hadoop, or import data from ...
Posted on Tue, 10 Mar 2020 03:58:20 -0700 by drath
Recently, I have been following up on the Flink UU SQL to prepare for a deeper understanding. This article mainly records the process of running the SQL UU client source code~~
For the hadoop, hive and other related environments involved in this article, see the previous article The integration of Flink SQL client 1.10 and hive to read real-ti ...
Posted on Tue, 03 Mar 2020 20:40:29 -0800 by cdhogan
1. What is sqoop
Apache sqoop (TM) is a tool designed for efficient transferring bulk data between Apache Hadoop and structured datastores such as relational databases
Convert the statement of sqoop to mapreduce task (maptask)
Advantages: data integration across platforms
Posted on Thu, 27 Feb 2020 22:22:51 -0800 by athyzafiris
Data content analysis
`User? Log.csv ` file content meaning
`Content meaning of "train.csv" and "test.csv"
Upload the data to Linux system and decompress it
Data set preprocessing
File information interception
Import data into Hive
Confirm that the Had ...
Posted on Tue, 25 Feb 2020 23:24:24 -0800 by jara06