-Dealing with NA's

#Test for missing values

# vector with missing data
x <- c(1:4, NA, 6:7, NA)
x
## [1]  1  2  3  4 NA  6  7 NA

is.na(x)
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

# data frame with missing data
df <- data.frame(col1 = c(1:3, NA),
                 col2 = c("this", NA,"is", "text"), 
                 col3 = c(TRUE, FALSE, TRUE, TRUE), 
                 col4 = c(2.5, 4.2, 3.2, NA),
                 stringsAsFactors = FALSE)

# identify NAs in full data frame
is.na(df)
##       col1  col2  col3  col4
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE  TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,]  TRUE FALSE FALSE  TRUE

# identify NAs in specific data frame column
is.na(df$col4)
## [1] FALSE FALSE FALSE  TRUE

To identify the location or the number of NAs we can leverage the which() and sum() functions:

For data frames, a convenient shortcut to compute the total missing values in each column is to use colSums():

#Recode missing values

we can recode missing values in vector x with the mean values in x by first subsetting the vector to identify NAs and then assign these elements a value.

Recode value in data frame:

#Exclude missing values

There are many ways excluding value.

*1.

*2.

*3.

*4.

#Exercises

  1. How many missing values are in the built-in data set airquality?

  2. Which variables are the missing values concentrated in?

  3. How would you impute the mean or median for these values?

  4. How would you omit all rows containing missing values?

Last updated