Data content analysis
`User? Log.csv ` file content meaning
`Content meaning of "train.csv" and "test.csv"
Upload the data to Linux system and decompress it
Data set preprocessing
File information interception
Import data into Hive
Confirm that the Had ...
Posted on Tue, 25 Feb 2020 23:24:24 -0800 by jara06
from pyspark import SparkContext
from pyspark import SparkConf
The former function in aggregateByKey is a function calculated in each partition, and the latter fun ...
Posted on Sat, 22 Feb 2020 01:35:39 -0800 by YOUAREtehSCENE
This article introduces the build of Hive component in Hadoop big data platform component (MySQL needs to be built before build Hive)
Use software versionapache-hive-1.1.0-bin.tarmysql-connector-java-5.1.47.jar (Baidu cloud extraction code: vk6v)
Extract hive inst ...
Posted on Fri, 21 Feb 2020 06:06:37 -0800 by chrisuk
What is Checkpoint?
Spark often faces a lot of RDDS of Tranformation in the production environment (for example, a Job contains 10000 RDDS), or the calculation of RDDS generated by specific Tranformation is particularly complex and time-consuming (for example, the calculation often exce ...
Posted on Sat, 15 Feb 2020 21:22:45 -0800 by jf3000
Hadoop cluster installation
Server Hadoop 101
Server Hadoop 102
Server Hadoop 103
Posted on Sat, 15 Feb 2020 07:15:54 -0800 by patrickcurl
Recently, I have just completed a project about big data, and the framework used in this project includes spring boot. Because it is an offline data analysis, Hive is also selected for component selection (Spark or HBase may be used for real-time ) This blog is about how to configure Hive in the spring ...
Posted on Wed, 12 Feb 2020 10:19:54 -0800 by mrmom
Cluster technology overview:
LB (load balancing cluster): LVS, Haproxy, Nginx, F5 BigIP
HA (high availability cluster): maintained, RFCS, Pacemaker, Heartbeat
HP (high performance cluster): Hadoop,sparkKeepalived:
Keepalived is the next lightweight and highly available solution for linux. Similar to the ...
Posted on Wed, 12 Feb 2020 03:48:01 -0800 by Schism
Big data - basic environment construction (1)
This paper uses three Linux servers as a unified environment.
IP settings for three machines
Modify the ip address of three services
Posted on Sun, 09 Feb 2020 23:35:47 -0800 by drisate
Premise: the so-called linux stream processing is that the results of the previous command are stored in the cache area, which is used by the commands behind the pipeline like the stream
Cut's job is to cut. Specifically, it is used to cut data in files. The cut command cuts bytes, characters, ...
Posted on Thu, 06 Feb 2020 00:37:03 -0800 by litarena
It's a very practical tool.
First, you need to set up ssh password free communication between machines.
There are three documents
The three files are all in / home/hadoop/tools Directory;
The first column of the deploy.conf configuration file is the host name of the ser ...
Posted on Mon, 03 Feb 2020 08:00:43 -0800 by jd023