R Programming
  • The wikipedia of R by me
  • Hello R
    • -What is R & RStudio
    • -Learning sources
    • -R online editor
    • -R environment
  • Data types
    • -Dealing with Number
    • -Dealing with String
    • -Dealing with Dates
    • -Dealing with NA's
    • -Dealing with Logicals
    • -Dealing with Factors
  • R data
    • -R object
    • -Data structures
      • --Basics
      • --Managing Vectors
      • --Managing Matrices
      • --Managing Data Frames
    • -Functions
    • -Importing/exporting data
    • -Shape&Transform data
    • -R management
  • Visualizations
  • Intro to R Bootcamp
    • -01-introduction
    • -02-data preparation
    • -03-data transformation
    • -04-visualization
  • R programming track
    • -a-Introduction to R
      • --1-Intro to basics
      • --2-Vectors
      • --3-Matrices
      • --4-Factors
      • --5-Data frames
      • --6-Lists
    • -b-Intermediate R
      • --1-Conditionals and Control Flow
      • --2-Loops
      • --3-Functions
      • --4-The apply family
      • --5-Utilities
    • -d-Writing Functions in R
      • --1-A quick refresher
      • --2-When and how you should write a function
      • --3-Functional programming
      • --4-Advanced inputs and outputs
      • --5-Robust functions
  • Data Wrangling with R
  • R-tutor
    • #R introduction
    • #Elementary Statistics with R
  • Hands-On Programming with R
  • R for Data Science
  • Advanced R
  • ggplot2
  • R packages
  • Statistik-1
  • Statistik-2
  • Statistik-3
  • Zeitreihen & Prognosen
  • Descriptive Analytics
  • Predictive Analytics
  • Prescriptive Analytics
  • R Graphics Cookbook
    • ggplot2 intro
    • ggplot2 custome
    • ggplot top-50
  • #Exploratory Data Analysis
    • -Data Summary
    • -Checklist Solution
  • #Data Mining
    • Untitled
    • Untitled
  • #Machine Learning
    • Intro to ML
    • Intro alghorithms
    • 1. Supervised Learning
  • Master R for Data Science
    • Learning R
    • Untitled
    • Untitled
  • Data Science Projects
    • Simple linear regression:
Powered by GitBook
On this page
  • #Test for missing values
  • #Recode missing values
  • #Exclude missing values
  • #Exercises
  1. Data types

-Dealing with NA's

#Test for missing values

# vector with missing data
x <- c(1:4, NA, 6:7, NA)
x
## [1]  1  2  3  4 NA  6  7 NA

is.na(x)
## [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

# data frame with missing data
df <- data.frame(col1 = c(1:3, NA),
                 col2 = c("this", NA,"is", "text"), 
                 col3 = c(TRUE, FALSE, TRUE, TRUE), 
                 col4 = c(2.5, 4.2, 3.2, NA),
                 stringsAsFactors = FALSE)

# identify NAs in full data frame
is.na(df)
##       col1  col2  col3  col4
## [1,] FALSE FALSE FALSE FALSE
## [2,] FALSE  TRUE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,]  TRUE FALSE FALSE  TRUE

# identify NAs in specific data frame column
is.na(df$col4)
## [1] FALSE FALSE FALSE  TRUE

To identify the location or the number of NAs we can leverage the which() and sum() functions:

# identify location of NAs in vector
which(is.na(x))
## [1] 5 8

# identify count of NAs in data frame
sum(is.na(df))
## [1] 3

For data frames, a convenient shortcut to compute the total missing values in each column is to use colSums():

colSums(is.na(df))
## col1 col2 col3 col4
##    1    1    0    1

#Recode missing values

we can recode missing values in vector x with the mean values in x by first subsetting the vector to identify NAs and then assign these elements a value.

# recode missing values with the mean
# vector with missing data
x <- c(1:4, NA, 6:7, NA)
x
## [1]  1  2  3  4 NA  6  7 NA

x[is.na(x)] <- mean(x, na.rm = TRUE)

round(x, 2)
## [1] 1.00 2.00 3.00 4.00 3.83 6.00 7.00 3.83

# data frame that codes missing values as 99
df <- data.frame(col1 = c(1:3, 99), col2 = c(2.5, 4.2, 99, 3.2))

# change 99s to NAs
df[df == 99] <- NA
df
##   col1 col2
## 1    1  2.5
## 2    2  4.2
## 3    3   NA
## 4   NA  3.2

Recode value in data frame:

# data frame with missing data
df <- data.frame(col1 = c(1:3, NA),
                 col2 = c("this", NA,"is", "text"), 
                 col3 = c(TRUE, FALSE, TRUE, TRUE), 
                 col4 = c(2.5, 4.2, 3.2, NA),
                 stringsAsFactors = FALSE)
                 
df$col4[is.na(df$col4)] <- mean(df$col4, na.rm = TRUE)
df
##   col1 col2  col3 col4
## 1    1 this  TRUE  2.5
## 2    2 <NA> FALSE  4.2
## 3    3   is  TRUE  3.2
## 4   NA text  TRUE  3.3

#Exclude missing values

There are many ways excluding value.

*1.

# A vector with missing values
x <- c(1:4, NA, 6:7, NA)

# including NA values will produce an NA output
mean(x)
## [1] NA

# excluding NA values will calculate the mathematical operation 
# for all non-missing values
mean(x, na.rm = TRUE)
## [1] 3.833333

*2.

# data frame with missing values
df <- data.frame(col1 = c(1:3, NA),
                 col2 = c("this", NA,"is", "text"), 
                 col3 = c(TRUE, FALSE, TRUE, TRUE), 
                 col4 = c(2.5, 4.2, 3.2, NA),
                 stringsAsFactors = FALSE)

df
##   col1 col2  col3 col4
## 1    1 this  TRUE  2.5
## 2    2 <NA> FALSE  4.2
## 3    3   is  TRUE  3.2
## 4   NA text  TRUE   NA

*3.

complete.cases(df)
## [1]  TRUE FALSE  TRUE FALSE

# subset with complete.cases to get complete cases
df[complete.cases(df), ]
##   col1 col2 col3 col4
## 1    1 this TRUE  2.5
## 3    3   is TRUE  3.2

# or subset with `!` operator to get incomplete cases
df[!complete.cases(df), ]
##   col1 col2  col3 col4
## 2    2 <NA> FALSE  4.2
## 4   NA text  TRUE   NA

*4.

# or use na.omit() to get same as above
na.omit(df)
##   col1 col2 col3 col4
## 1    1 this TRUE  2.5
## 3    3   is TRUE  3.2

#Exercises

  1. How many missing values are in the built-in data set airquality?

  2. Which variables are the missing values concentrated in?

  3. How would you impute the mean or median for these values?

  4. How would you omit all rows containing missing values?

Previous-Dealing with DatesNext-Dealing with Logicals

Last updated 6 years ago