Taobao double 11 big data analysis (data preparation)

Article directory Preface Data content analysis `User? Log.csv ` file content meaning `Content meaning of "train.csv" and "test.csv" Upload the data to Linux system and decompress it Data set preprocessing File information interception Import data into Hive Confirm that the Had ...

Posted on Tue, 25 Feb 2020 23:24:24 -0800 by jara06

The usage of partial * * * ByKey in pyspark

Preparation import pyspark from pyspark import SparkContext from pyspark import SparkConf conf=SparkConf().setAppName("lg").setMaster('local[4]') sc=SparkContext.getOrCreate(conf) 1. aggregateByKey The former function in aggregateByKey is a function calculated in each partition, and the latter fun ...

Posted on Sat, 22 Feb 2020 01:35:39 -0800 by YOUAREtehSCENE

[Hadoop big data platform component building series] - Hive component configuration

brief introduction This article introduces the build of Hive component in Hadoop big data platform component (MySQL needs to be built before build Hive) Use software versionapache-hive-1.1.0-bin.tarmysql-connector-java-5.1.47.jar (Baidu cloud extraction code: vk6v) Install Hive Extract hive inst ...

Posted on Fri, 21 Feb 2020 06:06:37 -0800 by chrisuk

Analysis of spark checkpoint principle and source code

I. overview What is Checkpoint? Spark often faces a lot of RDDS of Tranformation in the production environment (for example, a Job contains 10000 RDDS), or the calculation of RDDS generated by specific Tranformation is particularly complex and time-consuming (for example, the calculation often exce ...

Posted on Sat, 15 Feb 2020 21:22:45 -0800 by jf3000

Hadoop cluster installation

Hadoop cluster installation   Cluster planning   Server Hadoop 101 Server Hadoop 102 Server Hadoop 103 HDFS NameNode DataNode DataNode DataNode SecondaryNameNode Yarn NodeManag ...

Posted on Sat, 15 Feb 2020 07:15:54 -0800 by patrickcurl

How to configure Hive in Springboot? This blog may help you!

Recently, I have just completed a project about big data, and the framework used in this project includes spring boot. Because it is an offline data analysis, Hive is also selected for component selection (Spark or HBase may be used for real-time ) This blog is about how to configure Hive in the spring ...

Posted on Wed, 12 Feb 2020 10:19:54 -0800 by mrmom

Detailed explanation and implementation of LVS + preserved to realize high available load balancing cluster

Cluster technology overview: LB (load balancing cluster): LVS, Haproxy, Nginx, F5 BigIP HA (high availability cluster): maintained, RFCS, Pacemaker, Heartbeat HP (high performance cluster): Hadoop,sparkKeepalived: Keepalived is the next lightweight and highly available solution for linux. Similar to the ...

Posted on Wed, 12 Feb 2020 03:48:01 -0800 by Schism

Big data infrastructure

Big data - basic environment construction (1) server setting This paper uses three Linux servers as a unified environment. IP settings for three machines Modify the ip address of three services vi /etc/sysconfig/network-scripts/ifcfg-ens33 BOOTPROTO="static" IPADDR=192.168.52.100 NETMASK=255.255 ...

Posted on Sun, 09 Feb 2020 23:35:47 -0800 by drisate

Stream processing tools such as cut, SED, sort and awk in linux

Premise: the so-called linux stream processing is that the results of the previous command are stored in the cache area, which is used by the commands behind the pipeline like the stream ** cut** Cut's job is to cut. Specifically, it is used to cut data in files. The cut command cuts bytes, characters, ...

Posted on Thu, 06 Feb 2020 00:37:03 -0800 by litarena

Linux -- Shell multi machine distributing files and executing commands

It's a very practical tool. First, you need to set up ssh password free communication between machines. There are three documents 1.deploy.conf 2.deploy.sh 3.runRemoteCmd.sh Be careful: The three files are all in / home/hadoop/tools Directory; The first column of the deploy.conf configuration file is the host name of the ser ...

Posted on Mon, 03 Feb 2020 08:00:43 -0800 by jd023