R language introduction: dealing with missing values and data cleaning

R language provides us with some useful functions to clean up the data in Jining to deal with the missing data. Let's see what is the missing data first!

1, Missing value of data

In R language, the missing value of data is represented by NA. Sometimes we will find that some values in a data set display NA, which means that the value is missing. Can the missing value be used for operation?

For example, we can create a vector with the first number as the missing value, the first number as NA, and the following numbers as 1 to 49, so we can get:

> a<-c(NA,1:49)

a is constructed as follows:

> a
 [1] NA  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
[22] 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
[43] 42 43 44 45 46 47 48 49

If we want to calculate the sum of all the numbers in a, it will also be NA, because once there is a missing value in this sequence, the sum of the data is obviously unknown, as shown in the following code:

> sum(a)
[1] NA

So how can we solve this problem? When the sum() function is used, the parameter na.rm=T is added after it. In English, it means that remove the missing value (na) is true (T). The code is as follows:

> sum(a,na.rm = T)
[1] 1225

So we can calculate the sum of all the numbers except the missing value. In the same way, we can apply this rule to mean:

> mean(a,na.rm = T)
[1] 25

So let's think about whether the sum of the numbers divided by the average we solve adds the number of NA or the sum? The answer is no! Because the missing value has been removed by us, the system will not record it. We can use a new sequence to prove this:

> mean(1:49)
[1] 25

In the case of only 49 numbers, the average is 25, so it's true that we have removed the missing value through this parameter.

2, Data cleaning

Let's say we have a data frame variable that contains various variables, some of which have been missing and are NA values. A typical data frame variable that we can use is the sleep data frame in the VIM package. The code introduced is as follows:

> library(VIM)
Loading required package: colorspace
Loading required package: grid
Loading required package: data.table
data.table 1.12.8 using 2 threads (see ?getDTthreads).  Latest news: r-datatable.com
VIM is ready to use. 
 Since version 4.0.0 the GUI is in its own package VIMGUI.

          Please use the package to use the new (and old) GUI.

Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues

Attaching package: 'VIM'

The following object is masked from 'package:datasets':

    sleep

Warning messages:
1: package 'VIM' was built under R version 3.6.3 
2: package 'data.table'  built under R version 3.6.3

Then introduce the sleep data set:

> sleep
    BodyWgt BrainWgt NonD Dream Sleep  Span  Gest Pred Exp Danger
1  6654.000  5712.00   NA    NA   3.3  38.6 645.0    3   5      3
2     1.000     6.60  6.3   2.0   8.3   4.5  42.0    3   1      3
3     3.385    44.50   NA    NA  12.5  14.0  60.0    1   1      1
4     0.920     5.70   NA    NA  16.5    NA  25.0    5   2      3
5  2547.000  4603.00  2.1   1.8   3.9  69.0 624.0    3   5      4
6    10.550   179.50  9.1   0.7   9.8  27.0 180.0    4   4      4
7     0.023     0.30 15.8   3.9  19.7  19.0  35.0    1   1      1
8   160.000   169.00  5.2   1.0   6.2  30.4 392.0    4   5      4
9     3.300    25.60 10.9   3.6  14.5  28.0  63.0    1   2      1
10   52.160   440.00  8.3   1.4   9.7  50.0 230.0    1   1      1
11    0.425     6.40 11.0   1.5  12.5   7.0 112.0    5   4      4
12  465.000   423.00  3.2   0.7   3.9  30.0 281.0    5   5      5
13    0.550     2.40  7.6   2.7  10.3    NA    NA    2   1      2
14  187.100   419.00   NA    NA   3.1  40.0 365.0    5   5      5
15    0.075     1.20  6.3   2.1   8.4   3.5  42.0    1   1      1
16    3.000    25.00  8.6   0.0   8.6  50.0  28.0    2   2      2
17    0.785     3.50  6.6   4.1  10.7   6.0  42.0    2   2      2
18    0.200     5.00  9.5   1.2  10.7  10.4 120.0    2   2      2
19    1.410    17.50  4.8   1.3   6.1  34.0    NA    1   2      1
20   60.000    81.00 12.0   6.1  18.1   7.0    NA    1   1      1
21  529.000   680.00   NA   0.3    NA  28.0 400.0    5   5      5
22   27.660   115.00  3.3   0.5   3.8  20.0 148.0    5   5      5
23    0.120     1.00 11.0   3.4  14.4   3.9  16.0    3   1      2
24  207.000   406.00   NA    NA  12.0  39.3 252.0    1   4      1
25   85.000   325.00  4.7   1.5   6.2  41.0 310.0    1   3      1
26   36.330   119.50   NA    NA  13.0  16.2  63.0    1   1      1
27    0.101     4.00 10.4   3.4  13.8   9.0  28.0    5   1      3
28    1.040     5.50  7.4   0.8   8.2   7.6  68.0    5   3      4
29  521.000   655.00  2.1   0.8   2.9  46.0 336.0    5   5      5
30  100.000   157.00   NA    NA  10.8  22.4 100.0    1   1      1
31   35.000    56.00   NA    NA    NA  16.3  33.0    3   5      4
32    0.005     0.14  7.7   1.4   9.1   2.6  21.5    5   2      4
33    0.010     0.25 17.9   2.0  19.9  24.0  50.0    1   1      1
34   62.000  1320.00  6.1   1.9   8.0 100.0 267.0    1   1      1
35    0.122     3.00  8.2   2.4  10.6    NA  30.0    2   1      1
36    1.350     8.10  8.4   2.8  11.2    NA  45.0    3   1      3
37    0.023     0.40 11.9   1.3  13.2   3.2  19.0    4   1      3
38    0.048     0.33 10.8   2.0  12.8   2.0  30.0    4   1      3
39    1.700     6.30 13.8   5.6  19.4   5.0  12.0    2   1      1
40    3.500    10.80 14.3   3.1  17.4   6.5 120.0    2   1      1
41  250.000   490.00   NA   1.0    NA  23.6 440.0    5   5      5
42    0.480    15.50 15.2   1.8  17.0  12.0 140.0    2   2      2
43   10.000   115.00 10.0   0.9  10.9  20.2 170.0    4   4      4
44    1.620    11.40 11.9   1.8  13.7  13.0  17.0    2   1      2
45  192.000   180.00  6.5   1.9   8.4  27.0 115.0    4   4      4
46    2.500    12.10  7.5   0.9   8.4  18.0  31.0    5   5      5
47    4.288    39.20   NA    NA  12.5  13.7  63.0    2   2      2
48    0.280     1.90 10.6   2.6  13.2   4.7  21.0    3   1      3
49    4.235    50.40  7.4   2.4   9.8   9.8  52.0    1   1      1
50    6.800   179.00  8.4   1.2   9.6  29.0 164.0    2   3      2
51    0.750    12.30  5.7   0.9   6.6   7.0 225.0    2   2      2
52    3.600    21.00  4.9   0.5   5.4   6.0 225.0    3   2      3
53   14.830    98.20   NA    NA   2.6  17.0 150.0    5   5      5
54   55.500   175.00  3.2   0.6   3.8  20.0 151.0    5   5      5
55    1.400    12.50   NA    NA  11.0  12.7  90.0    2   2      2
56    0.060     1.00  8.1   2.2  10.3   3.5    NA    3   1      2
57    0.900     2.60 11.0   2.3  13.3   4.5  60.0    2   1      2
58    2.000    12.30  4.9   0.5   5.4   7.5 200.0    3   1      3
59    0.104     2.50 13.2   2.6  15.8   2.3  46.0    3   2      2
60    4.190    58.00  9.7   0.6  10.3  24.0 210.0    4   3      4
61    3.500     3.90 12.8   6.6  19.4   3.0  14.0    2   1      1
62    4.050    17.00   NA    NA    NA  13.0  38.0    3   1      1

We can see that there are many missing data: NA, how can we clean them up? That is to use the na.omit() function, which will delete all the lines with NA missing values, and only keep the lines without missing values in the data frame. The code is used as follows:

> na.omit(sleep)#Delete all the rows with invalid value NA in the data frame
    BodyWgt BrainWgt NonD Dream Sleep  Span  Gest Pred Exp Danger
2     1.000     6.60  6.3   2.0   8.3   4.5  42.0    3   1      3
5  2547.000  4603.00  2.1   1.8   3.9  69.0 624.0    3   5      4
6    10.550   179.50  9.1   0.7   9.8  27.0 180.0    4   4      4
7     0.023     0.30 15.8   3.9  19.7  19.0  35.0    1   1      1
8   160.000   169.00  5.2   1.0   6.2  30.4 392.0    4   5      4
9     3.300    25.60 10.9   3.6  14.5  28.0  63.0    1   2      1
10   52.160   440.00  8.3   1.4   9.7  50.0 230.0    1   1      1
11    0.425     6.40 11.0   1.5  12.5   7.0 112.0    5   4      4
12  465.000   423.00  3.2   0.7   3.9  30.0 281.0    5   5      5
15    0.075     1.20  6.3   2.1   8.4   3.5  42.0    1   1      1
16    3.000    25.00  8.6   0.0   8.6  50.0  28.0    2   2      2
17    0.785     3.50  6.6   4.1  10.7   6.0  42.0    2   2      2
18    0.200     5.00  9.5   1.2  10.7  10.4 120.0    2   2      2
22   27.660   115.00  3.3   0.5   3.8  20.0 148.0    5   5      5
23    0.120     1.00 11.0   3.4  14.4   3.9  16.0    3   1      2
25   85.000   325.00  4.7   1.5   6.2  41.0 310.0    1   3      1
27    0.101     4.00 10.4   3.4  13.8   9.0  28.0    5   1      3
28    1.040     5.50  7.4   0.8   8.2   7.6  68.0    5   3      4
29  521.000   655.00  2.1   0.8   2.9  46.0 336.0    5   5      5
32    0.005     0.14  7.7   1.4   9.1   2.6  21.5    5   2      4
33    0.010     0.25 17.9   2.0  19.9  24.0  50.0    1   1      1
34   62.000  1320.00  6.1   1.9   8.0 100.0 267.0    1   1      1
37    0.023     0.40 11.9   1.3  13.2   3.2  19.0    4   1      3
38    0.048     0.33 10.8   2.0  12.8   2.0  30.0    4   1      3
39    1.700     6.30 13.8   5.6  19.4   5.0  12.0    2   1      1
40    3.500    10.80 14.3   3.1  17.4   6.5 120.0    2   1      1
42    0.480    15.50 15.2   1.8  17.0  12.0 140.0    2   2      2
43   10.000   115.00 10.0   0.9  10.9  20.2 170.0    4   4      4
44    1.620    11.40 11.9   1.8  13.7  13.0  17.0    2   1      2
45  192.000   180.00  6.5   1.9   8.4  27.0 115.0    4   4      4
46    2.500    12.10  7.5   0.9   8.4  18.0  31.0    5   5      5
48    0.280     1.90 10.6   2.6  13.2   4.7  21.0    3   1      3
49    4.235    50.40  7.4   2.4   9.8   9.8  52.0    1   1      1
50    6.800   179.00  8.4   1.2   9.6  29.0 164.0    2   3      2
51    0.750    12.30  5.7   0.9   6.6   7.0 225.0    2   2      2
52    3.600    21.00  4.9   0.5   5.4   6.0 225.0    3   2      3
54   55.500   175.00  3.2   0.6   3.8  20.0 151.0    5   5      5
57    0.900     2.60 11.0   2.3  13.3   4.5  60.0    2   1      2
58    2.000    12.30  4.9   0.5   5.4   7.5 200.0    3   1      3
59    0.104     2.50 13.2   2.6  15.8   2.3  46.0    3   2      2
60    4.190    58.00  9.7   0.6  10.3  24.0 210.0    4   3      4
61    3.500     3.90 12.8   6.6  19.4   3.0  14.0    2   1      1

In this way, we can see that there is no missing NA value in the processed data! Data cleaning is complete.

Tags: R Language vim github

Posted on Mon, 16 Mar 2020 00:55:14 -0700 by toxictoad