1, Demand analysis
1. Get the taskid from the spark submit script submitted by the user and get the parameters of the task
2. Obtain the data within the specified date range for calculation, and obtain the page slice flow to calculate the conversion ratio of visits between pages.
Like targe tPageFlow: ...
Posted on Fri, 05 Jun 2020 00:03:52 -0700 by defunct
Spark on MaxCompute has access to instances (e.g. ECS, HBase, RDS) within the VPC in the Ali cloud. The default MaxCompute underlying network is isolated from the external network, and Spark on MaxCompute provides a solution through configurationSpark.hadoop.odps.Cupid.vpc.domain.list to access the Hbase of Ali Cloud's VPC network e ...
Posted on Mon, 01 Jun 2020 23:53:10 -0700 by Assorro
Training DeepFM under PAI-Notebook
It should be said that DeepFM is one of the most common CTR prediction models at present. For a recommendation system based on CTR estimation, the most important thing is to learn the combination of features behind user click behavior.In different recommended scenarios, low-order or high-order combinatorial fe ...
Posted on Thu, 14 May 2020 20:06:07 -0700 by ZaZall
Create a SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().appName('test_app').getOrCreate()
sc = spark.sparkContext
hc = HiveContext(sc)
1. Spark creates partition table
# You can change append to overwrite, so that if the table already exists, the previous table will be deleted and a ...
Posted on Mon, 11 May 2020 01:18:45 -0700 by [xNet]DrDre
1. What is fragment upload
Fragment upload is the transmission of a large file into several blocks, one by one. The benefit of doing so can reduce the overhead of re uploading. For example: if the file we upload is a large file, the upload time should be long. In addition to the influence of various factors of network instability, it is easy t ...
Posted on Tue, 05 May 2020 15:23:53 -0700 by affordit
Environment: Dell xps15 (32G memory, 1T solid state, Samsung 1T mobile solid state with external lightning 3 interface, 4T external mechanical hard disk of WD Elements) win10 three Centos7 virtual machines are used to test cdh6.3.2 cluster (the highest version of free community version), self compiled Phoenix 5.1.0, flink1.10.0, elasticsearch6. ...
Posted on Tue, 05 May 2020 05:12:05 -0700 by mwichmann4
The traditional hive computing engine is MapReduce. After Spark 1.3, SparkSql was officially released, and it is basically compatible with apache hive. Based on the powerful computing power of Spark, the data processing speed of using Spark to process hive is far faster than that of traditional hive.Using SparkSql in idea to read the data in H ...
Posted on Mon, 30 Mar 2020 14:23:09 -0700 by bl00dshooter
At present, Maxcompute platform can support running spark jobs. Spark jobs rely on the Cupid platform of Maxcompute, which can be submitted to Maxcompute for running in a community compatible way. It supports reading and writing Maxcompute tables, sharing Project resources with the original SQL/MR Jobs on Maxcompute. Please refer to ...
Posted on Tue, 03 Mar 2020 00:44:54 -0800 by edmore
test.csv and train.csv data preprocessing
Processing of test.csv file
Processing of train.csv file
Spark processes data
Upload files to HDFS
Launch Spark Shell
Prediction of repeat customers by SVM classifier
Output results to mysql ...
Posted on Wed, 26 Feb 2020 22:30:09 -0800 by artic
In this blog post, Alice gives you more details about Spark commands.
Previously, we used spark-shell to submit tasks. spark-shell is the interactive Shell program that comes with Spark, which makes it easy for users to program interactively. Users can write spark programs with ...
Posted on Thu, 20 Feb 2020 17:41:05 -0800 by Cut