we can recode missing values in vector x with the mean values in x by first subsetting the vector to identify NAs and then assign these elements a value.
# recode missing values with the mean
# vector with missing data
x <- c(1:4, NA, 6:7, NA)
x
## [1] 1 2 3 4 NA 6 7 NA
x[is.na(x)] <- mean(x, na.rm = TRUE)
round(x, 2)
## [1] 1.00 2.00 3.00 4.00 3.83 6.00 7.00 3.83
# data frame that codes missing values as 99
df <- data.frame(col1 = c(1:3, 99), col2 = c(2.5, 4.2, 99, 3.2))
# change 99s to NAs
df[df == 99] <- NA
df
## col1 col2
## 1 1 2.5
## 2 2 4.2
## 3 3 NA
## 4 NA 3.2
Recode value in data frame:
# data frame with missing data
df <- data.frame(col1 = c(1:3, NA),
col2 = c("this", NA,"is", "text"),
col3 = c(TRUE, FALSE, TRUE, TRUE),
col4 = c(2.5, 4.2, 3.2, NA),
stringsAsFactors = FALSE)
df$col4[is.na(df$col4)] <- mean(df$col4, na.rm = TRUE)
df
## col1 col2 col3 col4
## 1 1 this TRUE 2.5
## 2 2 <NA> FALSE 4.2
## 3 3 is TRUE 3.2
## 4 NA text TRUE 3.3
#Exclude missing values
There are many ways excluding value.
*1.
# A vector with missing values
x <- c(1:4, NA, 6:7, NA)
# including NA values will produce an NA output
mean(x)
## [1] NA
# excluding NA values will calculate the mathematical operation
# for all non-missing values
mean(x, na.rm = TRUE)
## [1] 3.833333
*2.
# data frame with missing values
df <- data.frame(col1 = c(1:3, NA),
col2 = c("this", NA,"is", "text"),
col3 = c(TRUE, FALSE, TRUE, TRUE),
col4 = c(2.5, 4.2, 3.2, NA),
stringsAsFactors = FALSE)
df
## col1 col2 col3 col4
## 1 1 this TRUE 2.5
## 2 2 <NA> FALSE 4.2
## 3 3 is TRUE 3.2
## 4 NA text TRUE NA
*3.
complete.cases(df)
## [1] TRUE FALSE TRUE FALSE
# subset with complete.cases to get complete cases
df[complete.cases(df), ]
## col1 col2 col3 col4
## 1 1 this TRUE 2.5
## 3 3 is TRUE 3.2
# or subset with `!` operator to get incomplete cases
df[!complete.cases(df), ]
## col1 col2 col3 col4
## 2 2 <NA> FALSE 4.2
## 4 NA text TRUE NA
*4.
# or use na.omit() to get same as above
na.omit(df)
## col1 col2 col3 col4
## 1 1 this TRUE 2.5
## 3 3 is TRUE 3.2
#Exercises
How many missing values are in the built-in data set airquality?
Which variables are the missing values concentrated in?
How would you impute the mean or median for these values?
How would you omit all rows containing missing values?