pyspark implementation of cumulative sum of columns

pyspark implementation of cumulative sum of columns

The cumsum() function of pandas can realize the cumulative sum of columns. Examples of use are as follows:

import pandas as pd
data = [1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
data = pd.DataFrame(data, columns=['diff'])
data['cumsum_num'] = data['diff'].cumsum()
print(data)

Output results:

    diff  cumsum_num
0      1           1
1      0           1
2      0           1
3      0           1
4      1           2
5      0           2
6      0           2
7      1           3
8      0           3
9      0           3
10     0           3

For pyspark, there is no cumsum() function to add directly. To achieve the cumulative sum, you can sum the sorted windows by creating an ordered column. The code is as follows:

To create a DataFrame object:

import pyspark
from pyspark.sql import functions as fn
from pyspark.sql import SparkSession
from pyspark.sql import Window
import pandas as pd

conf = pyspark.SparkConf().setAll([])
spark_session = SparkSession.builder.appName('test_app').config(conf=conf).getOrCreate()
sc = spark_session.sparkContext
sc.setLogLevel('WARN')

data = [1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0]
data = pd.DataFrame(data, columns=['diff'])
data['number'] = range(len(data))
data = spark_session.createDataFrame(data, schema=['diff', 'number'])
data.show()

Original DataFrame data:

+----+------+
|diff|number|
+----+------+
|   1|     0|
|   0|     1|
|   0|     2|
|   0|     3|
|   1|     4|
|   0|     5|
|   0|     6|
|   1|     7|
|   0|     8|
|   0|     9|
|   0|    10|
+----+------+

The cumulative sum is achieved according to the number sequence:

win = Window.orderBy('number')
data = data.withColumn('cumsum_num', fn.sum(data['diff']).over(win))
data.show()

The result is:

+----+------+----------+
|diff|number|cumsum_num|
+----+------+----------+
|   1|     0|         1|
|   0|     1|         1|
|   0|     2|         1|
|   0|     3|         1|
|   1|     4|         2|
|   0|     5|         2|
|   0|     6|         2|
|   1|     7|         3|
|   0|     8|         3|
|   0|     9|         3|
|   0|    10|         3|
+----+------+----------+

Tags: SQL Windows

Posted on Sat, 02 Nov 2019 14:08:07 -0700 by spheonix