Dalian Hotel Data Analysis

This project comes from the laboratory building.< Building+Data Analysis and Mining > Phase 6 student Miss_candy. "Building + Data Analysis and Mining Actual Practice" is a course content tailored to meet the needs of data analysis or data mining junior engineer positions in the experimental building. It consists of 35 experiments, 20 challenges, 5 integrated projects and 1 large project. Six weeks to get you started with data analysis and mining.

data fetch

The data were obtained on August 27, 2019. The hotel prices of 08-28-08-29 will fluctuate with the peak season. At present, Dalian is at the junction of seasonal change. The price level tends to be reasonable but still higher than the normal level.

import pandas as pd
import jieba
from tqdm import tqdm_notebook
from wordcloud import WordCloud
import numpy as np
from gensim.models import Word2Vec
import warnings

warnings.filterwarnings('ignore')
df = pd.read_csv('https://s3.huhuhang.com/temporary/b1vzDs.csv')
df.shape

Output:

(2475, 7)

Data cleaning

#The acquired data will be duplicated. First, according to the name of the hotel, an item with the same name will be deleted from the data table.
df = df.drop_duplicates(['HotelName'])
df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2219 entries, 0 to 2474
Data columns (total 7 columns):
Unnamed: 0            2219 non-null int64
index                 2219 non-null int64
HotelName             2219 non-null object
HotelLocation         2219 non-null object
HotelCommentValue     2219 non-null float64
HotelCommentAmount    2219 non-null int64
HotelPrice            2219 non-null float64
dtypes: float64(2), int64(3), object(2)
memory usage: 138.7+ KB

After deleting the duplicate items, the hotel information obtained contains 2219 valid information, of which 5 columns are valid information:

  • "HotelName" Hotel Name
  • Location of Hotel Location
  • "HotelComment Value" Hotel in Scoring
  • HotelComment Amount: Number of Evaluation Items Obtained by HotelComment Amount Hotel
  • Minimum Price of HotelPrice Hotel

Due to the fact that some hotels have no score (no score is assigned "0" in the process of data acquisition) because of new opening (or other reasons), we take this part of the data separately as the new_hotel data set for subsequent analysis and prediction.

df_new_hotel = df[df["HotelCommentValue"]==0].drop(['Unnamed: 0'], axis=1).set_index(['index'])
df_new_hotel.head()

output

For Hotels with existing ratings, they are also separated from the original data set for analysis and modeling.

df_in_ana = df[df["HotelCommentValue"]!=0].drop(["Unnamed: 0", "index"], axis=1)
df_in_ana.shape

Output:

(1669, 5)

Data analysis

import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline

plt.rcParams['font.sans-serif'] = ['SimHei']  # Used for normal display of Chinese labels
plt.rcParams['axes.unicode_minus'] = False  # Used for normal negative sign display
sns.distplot(df_in_ana['HotelPrice'].values)

Output:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7353b9c240>

By visualizing the distribution of hotel prices, we can probably see that most of the hotel prices are concentrated below 500 yuan per night, of which 200-300 yuan per night is the most concentrated, while more than 500 yuan per night is not much. Therefore, according to the price distribution and the actual price level, the hotel is divided into the following grades according to the price situation:

  • ” Cheap "hotel" with price below 100 yuan per night
  • "Economy" hotel, the price is 100-300 yuan per night
  • ” Comfortable hotel, the price is 300-500 yuan per night
  • ” High-end "hotel" at 500-1000 yuan per night
  • "Luxury" hotel, the price is more than 1000 yuan per night
df_in_ana['HotelLabel'] = df_in_ana["HotelPrice"].apply(lambda x: 'Luxurious' if x > 1000 else \
                                                        ('High-end' if x > 500 else \
                                                        ('Comfortable' if x > 300 else \
                                                        ('Economics' if x > 100 else 'cheap'))))

After the division, let's get a general idea of the proportion of different types of hotels:

hotel_label = df_in_ana.groupby('HotelLabel')['HotelName'].count()
plt.pie(hotel_label.values, labels=hotel_label.index, autopct='%.1f%%', explode=[0, 0.1, 0.1, 0.1, 0.1], shadow=True)

Output:

([<matplotlib.patches.Wedge at 0x7f735196bf28>,
  <matplotlib.patches.Wedge at 0x7f7351974978>,
  <matplotlib.patches.Wedge at 0x7f735197d358>,
  <matplotlib.patches.Wedge at 0x7f735197dcf8>,
  <matplotlib.patches.Wedge at 0x7f73519096d8>],
 [Text(1.0995615668223722, 0.0310541586125, 'Luxurious'),
  Text(0.8817809341165916, 0.813917922292212, 'cheap'),
  Text(-1.1653378183544278, -0.28633506441395257, 'Economics'),
  Text(0.9862461234793722, -0.6836070391108557, 'Comfortable'),
  Text(1.1898928807304072, -0.15541857156431768, 'High-end')],
 [Text(0.5997608546303848, 0.016938631970454542, '0.9%'),
  Text(0.5143722115680117, 0.47478545467045696, '21.9%'),
  Text(-0.679780394040083, -0.16702878757480563, '62.0%'),
  Text(0.5753102386963004, -0.3987707728146658, '11.0%'),
  Text(0.6941041804260709, -0.09066083341251863, '4.1%')])

From the results of pie chart, we can see that more than 50% of the hotels are economical, 21.9% of the hotels are cheap, and the proportion of high-end and luxury hotels is relatively small, which is more in line with the general positioning of tourism cities.

Next, let's look at the geographical distribution of the hotel.

from pyecharts import Map
map_hotel = Map("Regional Distribution Map of Dalian Hotel", width=1000, height=600)

hotel_distribution = df_in_ana.groupby('HotelLocation')['HotelName'].count().sort_values(ascending=False)
hotel_distribution = hotel_distribution[:8]

h_values = list(hotel_distribution.values)
district = list(hotel_distribution.index)

map_hotel.add("", district, h_values, maptype='Dalian', is_visualmap=True,
                         visual_range=([min(h_values), max(h_values)]), 
                         visual_text_color="#fff", symbol_size=20, is_label_show=True)
map_hotel.render('dalian_hotel.html')

Here, because the location information of some hotels is not filled in regularly when getting location information from the website, the information acquired presents certain differences. Because these differentiated information is inconvenient for unified planning, and its proportion is not large, it is in a relatively backward position after sort. We only intercept the top eight main areas. Information can be seen that for hotels that have been collected, most of them are located in Shahekou District and Jinzhou District, which are directly related to the distribution of major attractions in Dalian, such as the famous Xinghai Square, the Cross-Sea Bridge in Shahekou District, Jinshitan Beach and Discovery Kingdom in Jinzhou District. (Actually, the high-tech park has no corresponding content on map, because it is not an administrative area, but a technology development zone at the junction of Ganjingzi and Shahekou. Its proportion has no influence on Shahekou and Ganjingzi, and does not prevent us from analyzing the data.

Regional Distribution Map of Dalian Hotel

Dalian is a tourist city. The location and level of hotels in different administrative regions (geographical location) should also be different. Therefore, it is interesting to know the distribution of hotels of different grades in different regions.

hotel_distribution = df_in_ana.groupby('HotelLocation')['HotelName'].count().sort_values(ascending=False)
hotel_distribution = hotel_distribution[:8]
hotel_label_distr = df_in_ana.groupby([ 'HotelLocation','HotelLabel'])['HotelName'].count().sort_values(ascending=False).reset_index()
in_use_district = list(hotel_distribution.index)
hotel_label_distr = hotel_label_distr[hotel_label_distr['HotelLocation'].isin(in_use_district)]

fig, axes = plt.subplots(1, 5, figsize=(17,8))
hotel_label_list = ['High-end', 'Comfortable', 'Economics', 'Luxurious', 'cheap']
for i in range(len(hotel_label_list)):
    current_df = hotel_label_distr[hotel_label_distr['HotelLabel']==hotel_label_list[i]]
    axes[i].set_title('{}Regional Distribution of Type I Hotels'.format(hotel_label_list[i]))
    axes[i].pie(current_df.HotelName, labels=current_df.HotelLocation, autopct='%.1f%%', shadow=True)

It can be seen from the distribution of hotels of various grades in different regions that all types of hotels are advantageously distributed in Shahekou District, Jinzhou district and Zhongshan District. Interestingly, luxury hotels are not distributed in Lvshunkou district. This type of hotels not only concentrate in Shahekou District, but also occupy a large proportion in Zhongshan District. This is due to historical and geographical reasons. Dalian people often say that Zhongshan District is a legendary "wealthy area". Many business travelers will choose their place of residence in Zhongshan District, which also promotes the growth of investment in high-end and luxury hotels in this area.

In addition to the hotel price (grade) requirements, we also consider the hotel evaluation when we travel to determine the hotel. The higher the score, the more the evaluation, we will be more inclined to booking. Therefore, for the scored data set, let's take a look at the situation of these hotels in Dalian.

Firstly, according to the rating situation and consumers'general perception of the rating, the hotel is marked as follows:

  • Over 4.6 points are "super stick"“
  • "4.0-4.6" is not bad.“
  • 3.0-4.0 points are "general"
  • Less than 3.0 points are "bad reviews"
df_in_ana['HotelCommentLevel'] = df_in_ana["HotelCommentValue"].apply(lambda x: 'Super stick' if x > 4.6 \
                                                                      else ('Not bad' if x > 4.0 \
                                                                      else ('So-so' if x > 3.0 else 'Negative comment' )))

According to the grading and hotel grade clustering, we visualize the data.

hotel_label_level = df_in_ana.groupby(['HotelCommentLevel','HotelLabel'])['HotelName'].count().sort_values(ascending=False).reset_index()
fig, axes = plt.subplots(1, 5, figsize=(17,8))
for i in range(len(hotel_label_list)):
    current_df = hotel_label_level[hotel_label_level['HotelLabel'] == hotel_label_list[i]]
    axes[i].set_title('{}Scoring of Type I Hotels'.format(hotel_label_list[i]))
    axes[i].pie(current_df.HotelName, labels=current_df.HotelCommentLevel, autopct='%.1f%%', shadow=True)

According to the evaluation distribution of various types of hotels, the poor evaluation mainly appears in cheap hotels and economic hotels, among which the poor evaluation of low-cost hotels is the worst disaster area. For comfortable, high-end and Luxury Hotels with a minimum price of more than 300 per night, there is almost no bad evaluation, which also confirms the general recognition that "where money is spent is good", among which luxury hotels are good. The rate of evaluation ("super-excellent") is the highest. The proportion of evaluating "super-excellent" has not increased with the increase of hotel grade. For high-end hotels, the proportion of evaluating "super-excellent" has decreased compared with comfortable hotels with lower prices. The reason may be that the service expectation corresponding to hotel price is greater than the service level actually provided by hotels. On the one hand, it reminds consumers not to blindly think that "super-excellent" is expensive. Well, on the one hand, it reminds the hotel to do as much as it can, and the price is too high to be advisable.

Hotel list

According to the current content, we can make a "grass planting list" and "lightning trampling list":

” Grass planting list "mainly collects good evaluation in hotels of different grades, with more evaluation items (multi-person inspection, meeting the requirements), and a hotel list with reasonable prices for friends with different travel needs to choose;"Mine trampling list"mainly collects bad evaluation hotels, reminding everyone not to dare to"trial and error"and"take chances".

Grass list

# Cheap hotel
df_pos_cheap = df_in_ana[(df_in_ana['HotelLabel']=='cheap') \
                         & (df_in_ana['HotelCommentValue']> 4.6) \
                         & (df_in_ana['HotelCommentAmount']> 500)].sort_values(by=['HotelPrice'], ascending=False)
df_pos_cheap

Output:

# Economy Hotel
df_pos_economy = df_in_ana[(df_in_ana['HotelLabel']=='Economics') \
                         & (df_in_ana['HotelCommentValue']> 4.6) \
                         & (df_in_ana['HotelCommentAmount']> 2000)].sort_values(by=['HotelPrice'])
df_pos_economy

Output:

# Comfortable Hotel
df_pos_comfortable = df_in_ana[(df_in_ana['HotelLabel']=='Comfortable') \
                         & (df_in_ana['HotelCommentValue']> 4.6) \
                         & (df_in_ana['HotelCommentAmount']> 1000)].sort_values(by=['HotelPrice'])
df_pos_comfortable

Output:

# High-end hotel
df_pos_hs = df_in_ana[(df_in_ana['HotelLabel']=='High-end') \
                         & (df_in_ana['HotelCommentValue']> 4.6) \
                         & (df_in_ana['HotelCommentAmount']> 1000)].sort_values(by=['HotelPrice'])
df_pos_hs

Output:

# Luxury Hotel
df_pos_luxury = df_in_ana[(df_in_ana['HotelLabel']=='Luxurious') \
                         & (df_in_ana['HotelCommentValue']> 4.6) \
                         & (df_in_ana['HotelCommentAmount']> 500)].sort_values(by=['HotelPrice'])
df_pos_luxury

Output:

Minefield list

df_neg = df_in_ana[(df_in_ana['HotelCommentValue'] < 3.0) \
                         & (df_in_ana['HotelCommentAmount'] > 50)].sort_values(by=['HotelPrice'], ascending=False)
df_neg

Output:

Science of Hotel Names

For the more extreme types of hotels, such as expensive and expensive high-end hotels, usually take the business elegant atmosphere style, the name will sound very "expensive"; and cheaper, relying on price flow, for students or people with poor economic foundation, the name is either small and fresh way, or simple and rude, it is "cost-effective" to listen to, I. We use word clouds to verify whether this theory is suitable for hotels in Dalian area.

wget -nc "http://labfile.oss.aliyuncs.com/courses/1176/fonts.zip"
unzip -o fonts.zip
from wordcloud import WordCloud

def get_word_map(hotel_name_list):
    word_dict ={}
    for hotel_name in tqdm_notebook(hotel_name_list):
        hotel_name = hotel_name.replace('(', '')
        hotel_name = hotel_name.replace(')', '')
        word_list = list(jieba.cut(hotel_name, cut_all=False))
        for word in word_list:
            if word == 'Dalian' or len(word) < 2:
                continue
            if word not in word_dict:
                word_dict[word] = 0
            word_dict[word] += 1

    font_path = 'fonts/SourceHanSerifK-Light.otf'
    wc = WordCloud(font_path=font_path, background_color='white', max_words=1000, 
                            max_font_size=120, random_state=42, width=800, height=600, margin=2)
    wc.generate_from_frequencies(word_dict)

    return wc

In order to ensure the sufficient amount of data for rendering word clouds, instead of selecting data according to the original hotel grade classification criteria, Hotels with prices below 150 and above 500 are chosen as two relatively extreme types to see if they have any typical differences in naming.

part1 = df_in_ana[df_in_ana['HotelPrice'] <= 150]['HotelName'].values
part2 = df_in_ana[df_in_ana['HotelPrice'] > 500]['HotelName'].values
fig, axes = plt.subplots(1, 2, figsize=(15, 8))
axes[0].set_title('The Name Cloud of Lower Price Hotels')
axes[0].imshow(get_word_map(part1), interpolation='bilinear')
axes[1].set_title('The Name Cloud of Higher Price Hotels')
axes[1].imshow(get_word_map(part2), interpolation='bilinear')

Output:

<matplotlib.image.AxesImage at 0x7f73515c1908>

From the results, there are still obvious differences between the two types of hotel. Low-priced hotels, the names of "guest houses", "theme", "youth", "fast hotels", "hotels", "hotels", "hotels" have a higher frequency, in line with our understanding of the location of such hotels; high-end hotels, the names include "Star Sea", "Sea View", "Hot Springs" and "Square" have a higher frequency, because Dalian's relatively well-known landmarks are "Star Sea", "Sea View", "Hot Springs" and "Square". Star Sea Square in Shahekou District, nearby hotels (especially high-end hotels) like to embody the word "Star Sea" in their names. In addition to highlighting the geographical location, it seems that the word can also add some style to the hotel. In addition, high-end hotels do not seem to like to name themselves "xx hotel", preferring to call themselves "hotel" or "hotel apartment". The crazier thing is that both cheaper hotels and more expensive hotels like the word "apartment". This seems to be a trend in the development of hotel industry.

Look at Famous Hotel

Name as a symbol of people or things, the first impression caused by it is very important. We have just analyzed the characteristics of extreme hotel types in name. To a certain extent, we can judge whether the hotel is in a certain grade according to the name. "Look at life at three years old." For the small liquor shops that have just started operation and have no score, we can according to the right. Previously, we analyzed the evaluation characteristics of different hotels, and combined with these known results, we roughly understood whether these small liquor stores have a false high price, or whether it is worth us to be a mouse once and take a "discovery road". However, there is another problem involved here. Because of the environment and the times, the new hotel will have a different name strategy from the previous hotel. This difference will have a more significant impact in the process of modeling and forecasting. Therefore, here we just use the methods we have learned to do an interesting experiment, the results will not be accurate, but the process is very interesting. )

df_in_ana['HotelPrice'].median()

Output:

156.0

Through the analysis of the word cloud and the median price of the hotels evaluated before, we set the price 150 as the threshold value. The hotel whose price is lower than 150 yuan per night is marked as 1, while the hotel whose price is higher is marked as 0. This way of dividing makes the data volume of the two parts basically balanced, and also reflects the difference of the name of the hotel to a certain extent.

df_in_ana['PriceLabel'] = df_in_ana['HotelPrice'].apply(lambda x:1 if x <= 150 else 0)
df_new_hotel['PriceLabel'] = df_new_hotel['HotelPrice'].apply(lambda x:1 if x <= 150 else 0)
# Setting Word Segmentation Method
def word_cut(x):
    x = x.replace('(', '')  # Remove the () appearing in the name.
    x = x.replace(')', '')
    return jieba.lcut(x)
#Setting up training and testing sets
x_train = df_in_ana['HotelName'].apply(word_cut).values
y_train = df_in_ana['PriceLabel'].values
x_test = df_new_hotel['HotelName'].apply(word_cut).values
y_test = df_new_hotel['PriceLabel'].values

The training set contains 1669 pieces of information, 790 pieces of data labeled 1, 550 pieces of information labeled 1 and 195 pieces of data labeled 1.

# A shallow neural network model of word vector is established by Word2Vec method, and the word vector of hotel names after word segmentation is calculated.
from gensim.models import Word2Vec
import warnings

warnings.filterwarnings('ignore')
w2v_model = Word2Vec(size=200, min_count=10)
w2v_model.build_vocab(x_train)
w2v_model.train(x_train, total_examples=w2v_model.corpus_count, epochs=5)

def sum_vec(text):
    vec = np.zeros(200).reshape((1, 200))
    for word in text:
        try:
            vec += w2v_model[word].reshape((1, 200)) 
        except KeyError:
            continue
    return vec 

train_vec = np.concatenate([sum_vec(text) for text in tqdm_notebook(x_train)])
# A neural network classifier model is constructed and trained by training data.
from sklearn.externals import joblib
from sklearn.neural_network import MLPClassifier
from IPython import display 

model = MLPClassifier(hidden_layer_sizes=(100, 50, 20), learning_rate='adaptive')
model.fit(train_vec, y_train)

# Draw loss curve and monitor the change process of loss function
display.clear_output(wait=True)
plt.plot(model.loss_curve_)

Output:

[<matplotlib.lines.Line2D at 0x7f73400b8198>]

ps: Because of the small amount of data and the relatively inadequate information contained in the data itself, the training results are not very good here.

# Then the test set is summed by word vectors.
test_vec = np.concatenate([sum_vec(text) for text in tqdm_notebook(x_test)])
# Predict with the trained model and put the results into the test table
y_pred = model.predict(test_vec)
df_new_hotel['PredLabel'] = pd.Series(y_pred)
# Results of modeling and prediction
from sklearn.metrics import accuracy_score

accuracy_score(y_pred, y_test)

Output:

0.6163636363636363

In fact, the accuracy of the prediction is only about 60%, which is a rather unsatisfactory result. What is the main reason why we expand the data?

new_hotel_questionable = df_new_hotel[(df_new_hotel['PriceLabel'] ==0) & (df_new_hotel['PredLabel']==1)]
new_hotel_questionable = new_hotel_questionable.sort_values(by='HotelPrice', ascending=False)
new_hotel_questionable

Output:

The results show that many new hotels, especially those with high prices, are "villa" type resort hotels. This type is not evident in the evaluated data sets. The classifier modeled is insensitive to it, and the possibility of misclassification will increase greatly.

plt.figure(figsize=(15, 7))
plt.imshow(get_word_map(new_hotel_questionable['HotelName'].values), interpolation='bilinear')

Output:

<matplotlib.image.AxesImage at 0x7f7333b06d68>

Drawing word clouds to look at the new hotels, compared with the data set used for modeling, it added some words that were not used before, such as "number shop", "branch store", "villa", etc., which led to the decline in the accuracy of prediction.

Know the New Hotel

Apart from the understanding of names, we can also find out how the newly opened hotels reflect the geographical distribution and the change of average price.

new_hotel_distri = df_new_hotel.groupby('HotelLocation')['HotelName'].count().sort_values(ascending=False)[:7]
plt.pie(new_hotel_distri.values, labels=new_hotel_distri.index, autopct='%.1f%%', shadow=True)

Output;

([<matplotlib.patches.Wedge at 0x7f7333ae1240>,
  <matplotlib.patches.Wedge at 0x7f7333ae1c50>,
  <matplotlib.patches.Wedge at 0x7f7333ae9630>,
  <matplotlib.patches.Wedge at 0x7f7333ae9fd0>,
  <matplotlib.patches.Wedge at 0x7f7333af29b0>,
  <matplotlib.patches.Wedge at 0x7f7333afd390>,
  <matplotlib.patches.Wedge at 0x7f7333afdd30>],
 [Text(0.4952241217848982, 0.9822184427113841, 'Jinzhou District'),
  Text(-1.0502523061308453, 0.32706282801144143, 'ganjingzi district'),
  Text(-0.7189197652449374, -0.8325589295300148, 'shahekou district'),
  Text(0.10878704263418966, -1.0946074087794706, 'Lushunkou District'),
  Text(0.6457239222346646, -0.8905282793117135, 'Zhongshan District'),
  Text(0.9702662169179598, -0.5182503915171803, 'Xigang District'),
  Text(1.0890040760087287, -0.1551454879665377, 'Pulandian District')],
 [Text(0.2701222482463081, 0.5357555142062095, '35.1%'),
  Text(-0.5728648942531882, 0.17839790618805892, '20.1%'),
  Text(-0.39213805376996586, -0.4541230524709171, '16.8%'),
  Text(0.059338386891376174, -0.597058586606984, '9.0%'),
  Text(0.35221304849163515, -0.4857426978063891, '7.8%'),
  Text(0.5292361183188871, -0.2826820317366438, '6.6%'),
  Text(0.5940022232774883, -0.08462481161811146, '4.5%')])

From the pie chart, it can be found that more than 30% of new hotels choose Jinzhou District, Shahekou District as the old hotel cluster, only 16% of the practitioners will choose the new store here.

df_new_hotel['HotelLabel'] = df_new_hotel["HotelPrice"].apply(lambda x: 'Luxurious' if x > 1000 \
                                                              else ('High-end' if x > 500 \
                                                              else ('Comfortable' if x > 300 \
                                                              else('Economics' if x > 100 \
                                                              else 'cheap')))) 
new_hotel_label = df_new_hotel.groupby('HotelLabel')['HotelName'].count()
plt.pie(new_hotel_label.values, labels=new_hotel_label.index, autopct='%.1f%%', explode=[0, 0.1, 0.1, 0.1, 0.1], shadow=True)

Output:

([<matplotlib.patches.Wedge at 0x7f7333abbdd8>,
  <matplotlib.patches.Wedge at 0x7f7333a44828>,
  <matplotlib.patches.Wedge at 0x7f7333a4d208>,
  <matplotlib.patches.Wedge at 0x7f7333a4dba8>,
  <matplotlib.patches.Wedge at 0x7f7333a59588>],
 [Text(1.0859612910752763, 0.17518011955161772, 'Luxurious'),
  Text(0.6137971106588083, 1.0311416522218948, 'cheap'),
  Text(-1.1999216970224413, 0.01370842860376746, 'Economics'),
  Text(0.46080283077562195, -1.1079985339111122, 'Comfortable'),
  Text(1.1494416996723409, -0.3446502271207151, 'High-end')],
 [Text(0.5923425224046961, 0.09555279248270056, '5.1%'),
  Text(0.3580483145509714, 0.6014992971294385, '22.7%'),
  Text(-0.6999543232630907, 0.007996583352197684, '44.0%'),
  Text(0.26880165128577943, -0.6463324781148153, '18.9%'),
  Text(0.6705076581421987, -0.20104596582041712, '9.3%')])

In addition to the economical and affordable hotels that most travelers will choose, the investment proportion of high-end luxury hotels has also increased significantly in the newly opened hotels. Combined with the previous analysis of the word cloud of the newly opened hotels, more and more hotel practitioners have invested in the construction of high-end hotels, mainly embodied in Villa Resort hotels, reflecting people's quality and more comfortable outcomes. The pursuit of experience.

In terms of price, there are also some interesting results:

df2 = df_new_hotel.groupby('HotelLabel')['HotelPrice'].mean().reset_index()
df1=df_in_ana.groupby('HotelLabel')['HotelPrice'].mean().reset_index()
price_change_percent = (df2['HotelPrice'] -  df1['HotelPrice'])/df1['HotelPrice'] * 100
plt.title('Average Price Change of Newly Opened Hotels')
plt.bar(df1['HotelLabel'] ,price_change_percent, width = 0.35)
plt.ylim(-18, 18)
for x, y in enumerate(price_change_percent):
    if y < 0:
        plt.text(x, y, '{:.1f}%'.format(y), ha='center', fontsize=12, va='top')
    else:
        plt.text(x, y, '{:.1f}%'.format(y), ha='center', fontsize=12, va='bottom')

Compared with the old hotels that have been evaluated, the change of the average price of new hotels is as follows:

  1. The average price of luxury and cheap hotels has fallen.
  2. Intermediate hotels, including economy, comfort and high-end hotels, have increased their average prices.

Both extreme hotels get attention by "lowering their status" to gain occupancy rate, so as to achieve considerable development. Intermediate hotels gain capital by changing their business philosophy and conforming to the trend of the times, but their ultimate development effect still depends on the recognition of passengers.

summary

This experiment takes hotels in Dalian area as analysis data, excavates information including price and regional distribution, and provides the "grass planting list" and "lightning trampling list" of hotels that have been evaluated. (Mom no longer has to worry about friends coming to Dalian to fix Hotel problems!) This paper carries out word cloud analysis of hotel name, excavates the correlation between hotel name and hotel grade, and establishes classification model to predict whether the new and non-evaluated hotel name is suitable for its pricing standard. At the same time, it excavates the regional distribution and grade distribution of new hotels, compares the change of average price with the existing evaluation hotels, and understands some tourism undertakings in Dalian. Thoughts on the situation of development. Because of the small amount of data and the strong correlation between the naming method of hotel name and the region, times and environment, the effect of modeling and forecasting is not good, but it is also interesting to learn these contents and apply them in different aspects to deepen their understanding.

Knowing Column Synchronization: https://zhuanlan.zhihu.com/p/85909205

Tags: Python Lambda network less IPython

Posted on Thu, 10 Oct 2019 05:58:24 -0700 by kevin777