Analysis of spark checkpoint principle and source code

I. overview What is Checkpoint? Spark often faces a lot of RDDS of Tranformation in the production environment (for example, a Job contains 10000 RDDS), or the calculation of RDDS generated by specific Tranformation is particularly complex and time-consuming (for example, the calculation often exce ...

Posted on Sat, 15 Feb 2020 21:22:45 -0800 by jf3000

When is the onStart() method of Spark source code analysis Master called?

As we all know, the life cycle method of Master is: constructor - > onStart - > receive * - > onstop; but there is no direct call to onStart in Master's main method, so when is the onStart method called? This is actually related to the underlying Netty communication architecture of Spark. In th ...

Posted on Fri, 07 Feb 2020 08:25:08 -0800 by jib

Machine learning feature Engineering

Python machine learning 3-day quick start python machine learning in 2018 [dark horse programmer] (2) Characteristic Engineering 1. Dictionary feature extraction from sklearn.feature_extraction import DictVectorizer def dict_demo(): ''' //Dictionary feature extraction :return: ' ...

Posted on Mon, 03 Feb 2020 09:03:03 -0800 by MitchEvans

RDD programming learning note 3 data reading and writing

Local read scala> var textFile = sc.textFile("file:///root/1.txt") textFile: org.apache.spark.rdd.RDD[String] = file:///root/1.txt MapPartitionsRDD[57] at textFile at <console>:24 scala> textFile.saveAsTextFile("file:///root/writeback") scala> textFile.foreach(println) hadoop hello bi ...

Posted on Wed, 29 Jan 2020 05:34:55 -0800 by dizel247

"Class - Basic Concept 3" Learned by Scala

Type judgment using pattern matching In practical development, such as spark's source code, a lot of places use pattern matching to make type judgment, which is more concise and clear, and the code is very maintainable and scalable With pattern matching, functionally, just like isInstanceOf, it is sufficient to judge objects that are predomina ...

Posted on Mon, 27 Jan 2020 19:59:45 -0800 by psyion

Forest Rain Case--Analysis of Taobao Fake Data

Data Analysis and Forecast of Taobao Shuang11 Dead work: software tool The system and software involved in this case: Linux System (CENTOS 7) MySQL Tomcat(7.0.9) Hadoop(3.2.0) Hive(2.3.5) Sqoop(1.4.6) ECharts(4.5.0) Idea(2019.1.3) Spark(2.3.1)          ...

Posted on Thu, 23 Jan 2020 01:15:49 -0800 by designedfree4u

Spring Cloud Part 9: distributed service tracking Sleuth

​ This is the ninth article of Spring Cloud column. Understanding the contents of the first eight articles will help you better understand this article: Introduction to Spring Cloud and its common components Spring Cloud Part 2 use and know Eureka registry Spring Cloud Part 3: building a highly available Eureka registry Spring Cloud Par ...

Posted on Thu, 19 Dec 2019 03:08:19 -0800 by cdherold

Use spark rdd to calculate the cell phone stay time in the base station

lac_log.txt 9F36407EAD0629FC166F14DDE7970F68,116.304864,40.050645,6 CC0710CC94ECC657A8561DE549D940E0,116.303955,40.041935,6 16030401EAFB68F1E3CDF819735E1C66,116.296302,40.032296,6 user.log 18611132889,20160327075000,9F36407EAD0629FC166F14DDE7970F68,1 18688888888,20160327075100,9F36407EAD0629FC166F14DDE7970F68,1 18611132889,2016 ...

Posted on Thu, 12 Dec 2019 09:10:21 -0800 by mcbeckel

RDD learning summary

1. Introduction of Spark Spark 1.2.0 uses Scala 2.10 to write applications. You need to use a compatible version of scala (for example: 2.10.X). When writing spark application, you need to add Maven dependency of spark. Spark can be obtained through Maven central warehouse: groupId = org.apache.spark artifactId = spark-core_ ...

Posted on Thu, 12 Dec 2019 06:48:43 -0800 by ch3m1st

spark distinguishes isolated network from complex and unclear network

The relation data is the edge of from - > to Set data format to Long spark computing network: some algorithms cannot be implemented or the cost is too high due to the large amount of data In order to reduce the calculation pressure or optimize the calculation method, the isolated data relations in the whole relationship n ...

Posted on Wed, 11 Dec 2019 09:39:39 -0800 by misheck