hive environment construction and simple use

There's a set of cdh versions of hadoop,hive,zookeeper, all matching.

Link: https://pan.baidu.com/s/1wmyMw9RVNMD4NNOg4u4VZg
Extraction code: m888
Reconfigure the hadoop runtime environment once, and configure it in detail https://blog.csdn.net/kxj19980524/article/details/88954645

<configuration>
	<property>
		<name>fs.defaultFS</name>
		<value>hdfs://hadoop-senior01.buba.com:8020</value>
	</property>

	<property>
		<name>hadoop.tmp.dir</name>
		<value>/opt/modules/hadoop-2.5.0-cdh5.3.6/data</value>
	</property>
</configuration>

<!-- Specifies the number of redundant copies of data -->
		<property>
			<name>dfs.replication</name>
			<value>3</value>
		</property>

		<!-- Close permission checking-->
		<property>
			<name>dfs.permissions.enable</name>
			<value>false</value>
		</property>

		<property>
			<name>dfs.namenode.secondary.http-address</name>
			<value>hadoop-senior03.buba.com:50090</value>
		</property>

		<property>
			<name>dfs.namenode.http-address</name>
			<value>hadoop-senior01.buba.com:50070</value>
		</property>

		<property>
			<name>dfs.webhdfs.enabled</name>
			<value>true</value>
		</property>

		<property>
			<name>yarn.nodemanager.aux-services</name>
			<value>mapreduce_shuffle</value>
		</property>

		<property>
			<name>yarn.resourcemanager.hostname</name>
			<value>hadoop-senior02.buba.com</value>
		</property>

        <!--Opening up History Service-->
		<property>
			<name>yarn.log-aggregation-enable</name>
			<value>true</value>
		</property>

		<property>
			<name>yarn.log-aggregation.retain-seconds</name>
			<value>86400</value>
		</property>

		<!-- Mission History Service -->
	<property> 
		<name>yarn.log.server.url</name> 
		<value>http://hadoop-senior02.buba.com:19888/jobhistory/logs/</value> 
	</property> 

The Open History Service in the above configuration is to access mapreduce tasks previously executed after executing the mapreduce program.

 

<property> 
		<name>mapreduce.framework.name</name> 
		<value>yarn</value> 
	</property>

    <!--These two configuration nodes must be consistent with the above historical service configuration nodes-->
	<property> 
		<name>mapreduce.jobhistory.adress</name> 
		<value>hadoop-senior02.buba.com:10020</value> 
	</property>

	<property> 
		<name>mapreduce.jobhistory.webapp.adress</name> 
		<value>hadoop-senior02.buba.com:19888</value> 
	</property>

Distribute to other nodes, then initialize

bin/hdfs namenode -format

After initialization, write two scripts, one is to open all nodes of the cluster, the other is to close all nodes.

#!/bin/bash
echo "---------------Cluster Services Opening------------"
echo "---------------Opening NameNode node------------"
ssh kxj@hadoop-senior01.buba.com '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/hadoop-daemon.sh start namenode'

echo "---------------Opening SecondaryNamenode node------------"

ssh kxj@hadoop-senior03.buba.com '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/hadoop-daemon.sh start secondarynamenode'


echo "---------------Opening DataNode node------------"

for i in kxj@hadoop-senior01.buba.com kxj@hadoop-senior02.buba.com kxj@hadoop-senior03.buba.com
do 
        ssh $i '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/hadoop-daemon.sh start datanode'
done


echo "---------------Opening ResourceManager node------------"

ssh kxj@hadoop-senior02.buba.com '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/yarn-daemon.sh start resourcemanager'

echo "---------------Opening NodeManager node------------"
for i in kxj@hadoop-senior01.buba.com kxj@hadoop-senior02.buba.com kxj@hadoop-senior03.buba.com
do
         ssh $i '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/yarn-daemon.sh start nodemanager'
done

echo "---------------Opening JobHistoryServer node------------"
ssh kxj@hadoop-senior02.buba.com '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/mr-jobhistory-daemon.sh start historyserver'
#!/bin/bash
echo "---------------Closing Cluster Services------------"
echo "---------------Shutting down JobHistoryServer node------------"
ssh kxj@hadoop-senior02.buba.com '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/mr-jobhistory-daemon.sh stop historyserver'

echo "---------------Shutting down ResourceManager node------------"

ssh kxj@hadoop-senior02.buba.com '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/yarn-daemon.sh stop resourcemanager'

echo "---------------Shutting down NodeManager node------------"
for i in kxj@hadoop-senior01.buba.com kxj@hadoop-senior02.buba.com kxj@hadoop-senior03.buba.com
do
         ssh $i '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/yarn-daemon.sh stop nodemanager'
done

echo "---------------Shutting down NameNode node------------"
ssh kxj@hadoop-senior01.buba.com '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/hadoop-daemon.sh stop namenode'

echo "---------------Shutting down SecondaryNamenode node------------"

ssh kxj@hadoop-senior03.buba.com '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/hadoop-daemon.sh stop secondarynamenode'

echo "---------------Shutting down DataNode node------------"

for i in kxj@hadoop-senior01.buba.com kxj@hadoop-senior02.buba.com kxj@hadoop-senior03.buba.com
do 
        ssh $i '/opt/modules/hadoop-2.5.0-cdh5.3.6/sbin/hadoop-daemon.sh stop datanode'
done

Modify their executable permissions after writing.

The single quotation mark after ssh in the above command indicates that the command in single quotation mark is executed after ssh goes to other nodes. echo `> kxj.txt is used to input the script result in the reverse quotation mark into kxj.txt.

The JobHistory Server node is the service of the history mentioned above.

It can't run after it's written. It involves a knowledge point, shells and no shells.

There are shell
Broadly speaking, when you log on to a Linux system manually using CRT, you have a shell
shell-free
When you use ssh to access a system, there is no shell

When there is no shell, the environment variables of that system cannot be loaded, only the user variables of that system can be loaded.

The system variable is the / etc/profile file, and the user variable is in the user root directory.

User variables

Cat/etc/profile > ~/.bashrc Add the contents of system variables to user environment variables

Execute the other two, and after execution, you can test the script.

 

hive introduction

Characteristics of Hive
1. The operation interface adopts SQL grammar. HQL is very similar to sql.
2. Avoid the tedious process of writing MapReduce. Convert sql statement into MapReduce program and type it into jar package automatically and run it.

Hive Architecture
1. Client client has two kinds.
* Terminal command line
** JDBC -- Not often used, very troublesome (as opposed to the former)
2,metastore
** The bijective relationship between the original data set and field names and data information.
** We are currently stored in Mysql
3. Server server, that is, Hadoop
** While operating Hive, Hadoop's HDFS needs to be turned on, YARN on and MAPRED configured.

Database:
mysql, oracle, sqlserver, DB2, SQLite (small database on mobile phone), MDB
Data Warehouse:
Hive, the data warehouse is the client of MapReduce, which means that it is not necessary to install and deploy Hive on every machine.

Look at the figure below. Now let's assume that there is a txt text in this format, so that each field in each row corresponds to the field name of the table. It's troublesome to write mapreduce program, but if you use hive, you can build a table on hive, and then execute a paragraph of sql statement, and the desired results can be output.

After installing hive, a metastore database will be built in mysql database, which stores the mapping of fields between text files and tables.

Installation steps

After unzipping hive, go to conf directory and change the name of configuration file

mv hive-default.xml.template hive-site.xml

mv hive-env.sh.template hive-env.sh

JAVA_HOME=/opt/modules/jdk1.7.0_67
HADOOP_HOME=/opt/modules/hadoop-2.5.0-cdh5.3.6
export HIVE_CONF_DIR=/opt/modules/hive-0.13.1-cdh5.3.6/conf

Install Mysql

	$ su - root
	# yum -y install mysql mysql-server mysql-devel
	# wget http://dev.mysql.com/get/mysql-community-release-el7-5.noarch.rpm
	# rpm -ivh mysql-community-release-el7-5.noarch.rpm
	# yum -y install mysql-community-server

Start mysql service start mysqld.service

Modify mysql password mysqladmin-uroot password'123456'

Here are two ways to log in to MySQL

Modify some user privileges if other nodes want to access this mysql and give it corresponding privileges.

grant all on *.* to root@'hadoop-senior01.buba.com' identified by '123456';

flush privileges; refresh configuration

grant: authorization
all: all rights
* *: Database name. Table name
root: User operating mysql
@'': Host name
Password: 123456

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://hadoop-senior01.buba.com:3306/metastore?createDatabaseIfNotExist=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>

Pay attention to the lack of a property tag around 2785 lines, and remember to fill in, not under the official website, or you will make a mistake.

Modify Log Profile Name

mysql driver package

Put mysql driver packages in hive's lib directory

Open hadoop cluster before you operate hive, otherwise you can't operate it.

After hive is started, there is a metastore database in mysql. In fact, it's OK not to use MySQL database, but to use anything else. It's OK to configure the corresponding driver in the configuration file.

This directory is the directory that hive generates on hdfs to store data. It can also modify the address in the configuration file.

These two are to turn on some hints

Here's the hint.

Create a table and query the corresponding information. You can see that when you use select * it does not generate mapreduce programs, and the results are between. hive is string type, not varchar.

The latter t means that when double shooting is performed, hive divides fields according to t. If you use commas or something else in your text, hive is not recognized.

create table t1(eid int, name string, sex string) row format delimited fields terminated by '\t';

desc formatted t1; view table details

Cleaning data, such as company data is * separated, when the table is set up by t partition, then need to write a mapreduce program by t partition and then import the data into his

load data local inpath'File path'into table table table table name; if importing data on hdfs, remove local.

If the query statement executed is a conditional query, it is obvious that it has gone through the mapreduce program.

Click on history to see the mapreduce program that was previously executed

It's slow to convert a simple query statement into a MapReduce program every time it's executed. Modifying the following configuration will allow a simple statement not to execute mapreduce.

In this table of mysql's metastore database, the bijective relationship is generated.

Tags: Hadoop hive MySQL ssh

Posted on Mon, 22 Apr 2019 15:18:34 -0700 by Hexen