Descriptive Analytics

Descriptive methodologies focus on analyzing historic data for the purpose of identifying patterns or trends.

#Classical Analyses

-Numerical Data

There goal, in essence, is to describe the main features of numerical and categorical information with simple summaries.

Using this dataset. Baseball Salaries 2011.xlsx contains data on 843 MLB players in the 2011 season. For more information on installing, loading, and getting help with packages, See this tutorial. Baseball Salaries 2011.xlsx has 4 variables

  • Central Tendency or "Zentral Wert": What are the most typical values?

There are 3 common central tendency. These are mean (average of all observations), median (middle observation), and mode (appears most often).

mean(salaries$Salary, na.rm = TRUE)
## [1] 3305055

median(salaries$Salary, na.rm = TRUE)
## [1] 1175000

#There is not a built in function to compute the mode of a variable.
#This function gives the mode values
get_mode <- function(v) {
        unique_value <- unique(v)
        unique_value[which.max(tabulate(match(v, unique_value)))]
}

get_mode(salaries$Salary)
## [1] 414000

_____

  • Variability: How do the values vary?

Providing unique understanding of how the values are spread out.

#Range: Defining the maximum and minimum values

# get the minimum value
min(salaries$Salary, na.rm = FALSE)
## [1] 414000

# get the maximum value
max(salaries$Salary, na.rm = FALSE)
## [1] 3.2e+07

# get both the min and max values
range(salaries$Salary, na.rm = FALSE)
## [1]   414000 32000000

# compute the spread between min & max values
max(salaries$Salary, na.rm = FALSE) - min(salaries$Salary, na.rm = FALSE)
## [1] 31586000

#Percentiles

What is the salary value percentage? The first, second, and third quartiles are the percentiles corresponding to p=25%, p=50%, and p=75%. These three values divide the data into four groups, each with (approximately) a quarter of all observations.

# fivenum() function provides min, 25%, 50% (median), 75%, and max
fivenum(salaries$Salary)
## [1]   414000   430325  1175000  4306250 32000000

# default quantile() percentiles are 0%, 25%, 50%, 75%, and 100% 
# provides same output as fivenum()
quantile(salaries$Salary, na.rm = TRUE)
##       0%      25%      50%      75%     100% 
##   414000   430325  1175000  4306250 32000000

# we can customize quantile() for specific percentiles
quantile(salaries$Salary, probs = seq(from = 0, to = 1, by = .1), 
na.rm = TRUE)

# we can quickly compute the difference between the 1st and 3rd quantile
IQR(salaries$Salary)
## [1] 3875925

An alternative approach to is to use the summary() function with is a generic R function used to produce min, 1st quantile, median, mean, 3rd quantile, and max summary measures. However, note that the 1st and 3rd quantiles produced by summary() differ from the 1st and 3rd quantiles produced by fivenum() and the default quantile(). The reason for this is due to the lack of universal agreement on how the 1st and 3rd quartiles should be calculated. Eric Cai provided a good blog post that discusses this difference in the R functions.

summary(salaries$Salary)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   414000   430300  1175000  3305000  4306000 32000000

#Variance

The most common measures to summarize variability are variance and its derivatives (standard deviation and mean/median absolute deviation).

# variance
var(salaries$Salary)
## [1] 2.056389e+13

# standard deviation
sd(salaries$Salary)
## [1] 4534742

# mean absolute deviation
mad(salaries$Salary, center = mean(salaries$Salary))
## [1] 4229672

# median absolute deviation - note that the center 
#argument defaults to median
# so it does not need to be specified, although I do just be clear
mad(salaries$Salary, center = median(salaries$Salary))
## [1] 1126776

_____

  • Shape: Are the values symmetrically or assymetrically distributed?

Two additional measures of a distribution that you will hear occasionally include skewness and kurtosis.

source: https://bit.ly/2xkqOlj

=Skewness: Measure of symmetry

Negative values -> left-skewed -> mean < median

Positive values -> right-skewed -> mean> median

Near-zero value -> Normal distribution

=Kurtosis: Measure of peakedness

Negative values -> platykurtic -> see the picture above

Positive values -> leptokurtic -> see the picture above

Near-zero value -> Normal distribution

_____

  • Outliers: A values that represent abnormalities in the data.

Outliers in data can distort predictions and affect their accuracy. The outliers package provides a number of useful functions to systematically extract outliers. The functions of most use are outlier() and scores() . The outlier() function gets the most extreme observation from the mean. The scores() function computes the normalized (z, t, chisq, etc.) score which you can use to find observation(s) that lie beyond a given value.

install.packages("outliers")
library(outliers)
# gets most extreme right-tail observation
outlier(salaries$Salary)
## [1] 3.2e+07

# gets most extreme left-tail observation
outlier(salaries$Salary, opposite = TRUE)
## [1] 414000

# observations that are outliers based on z-scores
z_scores <- scores(salaries$Salary, type = "z")
which(abs(z_scores) > 1.96)
##  [1]   1  11  22  24  33  34  38  54  64 132 134 138 139 146 151 161 164
## [18] 168 211 236 242 248 249 348 355 370 384 441 452 460 484 496 517 520
## [35] 535 538 576 577 585 601 627 629 649 692 695 728 729 733 737 790 794
## [52] 801 803 816 843

# outliers based on values less than or greater than the "whiskers" on a 
# boxplot (1.5 x IQR or more below 1st quartile or above 3rd quartile)
which(scores(salaries$Salary, type = "iqr", lim = 1.5))
##  [1]   1  11  12  22  24  33  34  38  54  64  83  97 132 134 138 139 146
## [18] 151 161 164 168 189 208 211 236 242 248 249 293 298 300 329 336 348
## [35] 355 370 384 408 440 441 442 452 460 475 484 493 496 517 520 535 538
## [52] 547 576 577 585 601 620 627 629 649 664 679 692 695 707 728 729 733
## [69] 737 760 790 794 801 803 816 818 843

How you deal with outliers is a topic worthy of its own tutorial; however, if you want to simply remove an outlier or replace it with the sample mean or median then I recommend the rm.outlier() function provided also by the outliers package.

More about dealing with outliers -> here

_____

  • Visualization: Histograms and boxplots are graphical representations to illustrate these summary measures for numerical variables.

=Using Histograms

hist(salaries$Salary)

Use ggplot to customize the graphic and create a more presentable product:

install.packages("ggplot2")
library(ggplot2)
ggplot(salaries, aes(Salary)) + 
geom_histogram(colour = "black", fill = "white") +
scale_x_log10(labels = scales::dollar) + # x axis log & $$ labels
geom_vline(aes(xintercept = mean(Salary)), 
color = "red", linetype = "dashed") +  # add line for mean
annotate("text", x = mean(salaries$Salary) * 2, y = 155,
label = paste0("Avg: $", round(mean(salaries$Salary)/1000000, 1),"M"))

An alternative, and highly effective way to visualize data is with dotplots.

ggplot(salaries, aes(x = Salary)) + 
  geom_dotplot() +
  scale_x_continuous(labels = scales::dollar)

=Using Boxplots

Boxplots are an alternative way to illustrate the distribution of a variable and is a concise way to illustrate the standard quantiles, shape, and outliers of data.

# I use a log scale to spread out the data
boxplot(salaries$Salary, horizontal = TRUE, log = "x")

use ggplot to refine the boxplot and add additional features such as a point illustrating the mean and also show the actual distribution of observations:

However, boxplots are more useful when comparing distributions. For instance, if you wanted to compare the distributions of salaries across the different positions boxplots provide a quick comparative assessment:

-Categorical Data

Descriptive statistics are the first pieces of information used to understand and represent a dataset. There goal, in essence, is to describe the main features of numerical and categorical information with simple summaries.

Using this dataset. Supermarket Transactions.xlsx is an artificial supermarket transaction data. For more information on installing, loading, and getting help with packages, See this tutorial. Supermarket Transactions.xlsx has 16 variables

_____

  • Frequencies: The number of observations for a particular category

To produce contingency tables which calculate counts for each combination of categorical variables we can use R’s table() function.

# counts for gender categories
table(supermarket$Gender)
## 
##    F    M 
## 7170 6889

Produce a cross classification table for the number of married and single females and male customers.

# cross classication counts for gender by marital status
table(supermarket$`Marital Status`, supermarket$Gender)
##    
##        F    M
##   M 3602 3264
##   S 3568 3625

Produce multidimensional tables based on three or more categorical variables. For this, we leverage the ftable() function.

# customer counts across location by gender and marital status
table1 <- table(supermarket$`Marital Status`, supermarket$Gender, 
                supermarket$`State or Province`)
ftable(table1)
##        BC   CA   DF Guerrero Jalisco   OR Veracruz   WA Yucatan Zacatecas
##                                                                          
## M F   190  638  188       77      15  510      142 1166     200       476
##   M   197  692  210       94       5  514      108 1160     129       155
## S F   183  686  175      107      30  607      125 1134     164       357
##   M   239  717  242      105      25  631       89 1107     161       309

_____

  • Proportions: Percentage of each category or combination of categories.

Use the table() and the prop.table() function.

percentages of gender categories
table2 <- table(supermarket$Gender)
prop.table(table2)
## 
##         F         M 
## 0.5099936 0.4900064

# percentages for gender by marital status
table3 <- table(supermarket$`Marital Status`, supermarket$Gender)
prop.table(table3)
##    
##             F         M
##   M 0.2562060 0.2321644
##   S 0.2537876 0.2578420

# customer percentages across location by gender and marital status
# using table1 from previous code chunk
ftable(round(prop.table(table1), 3))
##         BC    CA    DF Guerrero Jalisco    OR Veracruz    WA Yucatan Zacatecas
##                                                                               
## M F  0.014 0.045 0.013    0.005   0.001 0.036    0.010 0.083   0.014     0.034
##   M  0.014 0.049 0.015    0.007   0.000 0.037    0.008 0.083   0.009     0.011
## S F  0.013 0.049 0.012    0.008   0.002 0.043    0.009 0.081   0.012     0.025
##   M  0.017 0.051 0.017    0.007   0.002 0.045    0.006 0.079   0.011     0.022

_____

  • Marginals: Show the total counts or percentages across columns or rows in a contingency table.

We can compute the marginal frequencies with margin.table() and the percentages for these marginal frequencies with prop.table() using the margin argument

table3
##    
##        F    M
##   M 3602 3264
##   S 3568 3625

# FREQUENCY MARGINALS
# row marginals - totals for each marital status across gender
margin.table(table3, 1)
## 
##    M    S 
## 6866 7193

# colum marginals - totals for each gender across marital status
margin.table(table3, 2)
## 
##    F    M 
## 7170 6889

# PERCENTAGE MARGINALS
# row marginals - row percentages across gender
prop.table(table3, margin = 1)
##    
##             F         M
##   M 0.5246140 0.4753860
##   S 0.4960378 0.5039622

# colum marginals - column percentages across marital status
prop.table(table3, margin = 2)
##    
##             F         M
##   M 0.5023710 0.4737988
##   S 0.4976290 0.5262012

_____

  • Visualization: Bar charts / Histograms are most often used to visualize categorical variables. Using ggplot or ggplot2 makes the graph more presentable.

ggplot(supermarket, aes(x = `State or Province`)) +
        geom_bar() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
# re-order levels
reorder_size <- function(x) {
        factor(x, levels = names(sort(table(x), decreasing = TRUE)))
}

ggplot(supermarket, aes(x = reorder_size(`State or Province`))) +
        geom_bar() +
        xlab("State or Province") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Proportions bar charts
ggplot(supermarket, aes(x = reorder_size(`State or Province`))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
xlab("State or Province") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

We can also create contingency table-like bar charts by using the facet_grid() function to produce small multiples. Plot customer proportions across location and by Gender.

ggplot(supermarket, aes(x = reorder_size(`State or Province`))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
xlab("State or Province") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
facet_grid(~ Gender) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Do the same plot by Gender and by Marital status.

ggplot(supermarket, aes(x = reorder_size(`State or Province`))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
xlab("State or Province") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
facet_grid(`Marital Status` ~ Gender) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

-Assumption of Normality

When assumptions are broken we stop being able to draw accurate conclusions about reality. Different statistical models assume different things, and if these models are going to reflect reality accurately then these assumptions need to be true. One of the first parametric assumptions most people think of is the assumption of normality. The assumption of normality is important for hypothesis testing and in regression models. In general linear models, the assumption comes in to play with regards to residuals (aka errors).

  • What is normality: The sampling distribution of the mean is normal

  • Dataset and packages: What we need

  • Visualizing Normality: The first step in evaluating normality.

  • Descriptive Statistics for Normality:

  • Shapiro-Wilk Test for Normality:

_____

  • What is normality

Central Limit Theorem: given random and independent samples of N observations each, the distribution of sample means approaches normality as the size of N increases, regardless of the shape of the population distribution.

From the CLT that in big samples the sampling distribution tends to be normal anyway.

_____

  • Dataset and packages

Golf data provided by ESPN. In addition, we need another the packages to such as the following:

library(readxl)         # read in Excel data
library(ggplot2)        # create visualizations
library(psych)          # provide summary statistics
library(pastecs)        # provide summary statistics

For more information on installing, loading, and getting help with packages, See this tutorial. The Golf dataset is data frame which consists of 18 variables.

_____

  • Visualizing Normality

Histograms are the first way to look at the shape of a normal distribution. With the stat_function() argument people can see that the data distributed normal.

library(readxl)
library(ggplot2)
golf <- read_excel("File Path", sheet = "2011")

ggplot(golf, aes(`Driving Accuracy`)) +
geom_histogram(aes(y = ..density..), colour = "black", fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(golf$`Driving Accuracy`, na.rm = T), 
sd = sd(golf$`Driving Accuracy`, na.rm = T))) +
xlab("Driving Accuracy (%)")

Compare to the Earnings variable.

ggplot(golf, aes(Earnings)) +
geom_histogram(aes(y = ..density..), colour = "black", fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(golf$Earnings, na.rm = T), 
sd = sd(golf$Earnings, na.rm = T))) +
scale_x_continuous(label = scales::dollar)

Another useful graph that we can inspect to see if a distribution is normal is the Q-Q plot. If the data are normally distributed the plot will display a straight (or nearly straight) line.

Q-Q plot for the driving accuracy :

qqnorm(golf$`Driving Accuracy`, 
main = "Normal Q-Q Plot for Driving Accuracy")

But Q-Q plot for Earnings accuracy:

qqnorm(golf$Earnings, main = "Normal Q-Q Plot for Earnings")

_____

  • Descriptive Statistics for Normality

Its important to support the visual findings with objective quantifications that describe the shape of the distribution and to look for outliers.

Doing this by using the describe()function from the psych package or the stat.desc() function from the pastecs package.

-With describe()function

install.packages("psych")
library(psych)
# assess one variable
describe(golf$`Driving Accuracy`)
##   vars   n  mean   sd median trimmed  mad min max range skew kurtosis   se
## 1    1 197 61.79 5.46   61.7   61.71 5.78  47  77    30 0.09    -0.04 0.39

# assess multiple variables
describe(golf[, c("Driving Accuracy", "Earnings")])
##                  vars   n       mean         sd   median    trimmed
## Driving Accuracy    1 197      61.79       5.46     61.7      61.71
## Earnings            2 200 1336649.41 1155157.90 941255.0 1120407.53
##                        mad    min     max   range skew kurtosis       se
## Driving Accuracy      5.78     47      77      30 0.09    -0.04     0.39
## Earnings         761431.93 250333 6683214 6432882 1.85     3.88 81682.00

-With stat.desc()

To reduce the amount of statistics we get back we can set the argument basic = FALSE and to get statistics back relating to the distribution we set the argument norm = TRUE.

install.packages("pastecs")
library(pastecs)
# assess one variable
stat.desc(golf$`Driving Accuracy`, basic = FALSE, norm = TRUE)
##       median         mean      SE.mean CI.mean.0.95          var 
##  61.70000000  61.78578680   0.38927466   0.76770460  29.85234797 
##      std.dev     coef.var     skewness     skew.2SE     kurtosis 
##   5.46373023   0.08843021   0.09349445   0.26988823  -0.03637605 
##     kurt.2SE   normtest.W   normtest.p 
##  -0.05275937   0.99626396   0.91530114

# assess multiple variables
stat.desc(golf[, c("Driving Accuracy", "Earnings")], basic = FALSE, norm = TRUE)
##              Driving Accuracy     Earnings
## median            61.70000000 9.412550e+05
## mean              61.78578680 1.336649e+06
## SE.mean            0.38927466 8.168200e+04
## CI.mean.0.95       0.76770460 1.610734e+05
## var               29.85234797 1.334390e+12
## std.dev            5.46373023 1.155158e+06
## coef.var           0.08843021 8.642191e-01
## skewness           0.09349445 1.846104e+00
## skew.2SE           0.26988823 5.368929e+00
## kurtosis          -0.03637605 3.876503e+00
## kurt.2SE          -0.05275937 5.664051e+00
## normtest.W         0.99626396 8.004944e-01
## normtest.p         0.91530114 2.918239e-15

_____

  • Shapiro-Wilk Test for Normality

-Non-significant (p>.05) -> the sample is notsignificantly different from a normal distribution.

- Significant (p < .05) -> the distribution in question is significantly different from a normal distribution.

shapiro.test(golf$`Driving Accuracy`)

## 	Shapiro-Wilk normality test
## data:  golf$`Driving Accuracy`
## W = 0.99626, p-value = 0.9153
shapiro.test(golf$Earnings)

## 	Shapiro-Wilk normality test
## data:  golf$Earnings
## W = 0.80049, p-value = 2.918e-15

Its important to note that there are limitations to the Shapiro-Wilk test. As the dataset being evaluated gets larger, the Shapiro-Wilk test becomes more sensitive to small deviations which leads to a greater probability of rejecting the null hypothesis (null hypothesis being the values come from a normal distribution).

The principal message is that to assess for normality we should not rely on only one approach to assess our data. Rather, we should understand the distribution visually, through descriptive statistics, and formal testing procedures to come to our conclusion of whether our data meets the normality assumption or not.

- Assumptions of Homogeneity

-Correlations

A bivariate analysis from two variables. The value of the correlation coefficient varies between +1 and -1. Near +1 implies a strong positive association and near -1 implies a strong negative association. Near 0 value implying no association between the two variables.

  • Dataset and packages: What we need

  • Visualizing relationships: A picture is worth a thousand words

  • Pearson's correlation: This approach is so widely used.

  • Spearman’s Rank correlation: Great when variables are measured on an ordinal scale. A non-parametric approach

  • Kendall’s tau: Another non-parametric approach. Less sensitive to outliers and more accurate with smaller sample sizes.

  • Partial correlation: Measuring the relationship between two variables while controlling for the effect of one or more covariates.

_____

  • Dataset and packages

Install the following packages:

library(readxl) # reading in data
library(ggplot2)# visualizing data
library(gridExtra) # combining multiple plots
library(corrgram)  # visualizing data       
library(corrplot)  # visualizing data       
library(Hmisc)  # produces correlation matrices with p-values
library(ppcor)  # assesses partial correlations

Download the dataset here and here to. Golf dataset & Survei dataset. The golf data has 18 variables and the survey data has 11 variables.

_____

  • Visualizing bivariate relationships

A scatterplot of two variables provides a vivid illustration of the relationship between two variables.

qplot(x = Events, y = Rounds, data = golf) +
        geom_smooth(method = "lm", se = FALSE) +
        ggtitle("Fig. A: Strong Positive Association")

Contrast this to the following two plots which shows as driving accuracy increases the distance of the player’s drive tends to decrease (Fig. B). In addition we can easily see that as a player’s age increases their greens in regulation percentage does not appear to change (Fig. C).

library(gridExtra)

p1 <- qplot(x = `Driving Accuracy`, y = `Yards/Drive`, data = golf) +
        geom_smooth(method = "lm", se = FALSE) +
        ggtitle("Fig. B: Moderate Negative Association")

p2 <- qplot(x = Age, y = `Greens in Regulation`, data = golf) +
        geom_smooth(method = "lm", se = FALSE) +
        ggtitle("Fig. C: Weak/No Association")

grid.arrange(p1, p2, ncol = 2)

In addition, scatter plots illustrate the linearity of the relationship, which can influence how you approach assessing correlations (i.e. data transformation, using a parametric vs non-parametric test, removing outliers).

Francis Anscombe illustrated this in 19731 when he constructed four data sets that have the same mean, variance, and correlation; however, there are significant differences in the variable relationships. Using the anscombe data, which R has as a built in data set, the plots below demonstrate the importance of graphing data rather than just relying on correlation coefficients. Each x-y combination in the plot below has a correlation of .82 (strong positive) but there are definitely differences in the association between these variables.

library(gridExtra)
library(grid)

p1 <- qplot(x = x1, y = y1, data = anscombe)
p2 <- qplot(x = x2, y = y2, data = anscombe)
p3 <- qplot(x = x3, y = y3, data = anscombe)
p4 <- qplot(x = x4, y = y4, data = anscombe)

grid.arrange(p1, p2, p3, p4, ncol = 2, 
top = textGrob("Anscombe's Quartet"))

Visualization can also give you a quick approach to assessing multiple relationships. We can produce scatter plot matrices multiple ways to visualize and compare relationships across the entire data set we are analyzing. With base R plotting we can use the pairs() function.

Lets look at the first 10 variables of the golf data set (minus the player name variable).

pairs(golf[, c(1, 3:10)])

With the corrgram and corrplot packages. Note that multiple options exist with both these visualizations (i.e. formatting, correlation method applied, illustrating significance and confidence intervals, etc.)

library(corrgram)

par(bg = "#fdfdfd")
corrgram(golf[, c(1, 3:10)], lower.panel = panel.shade, upper.panel = panel.pts)

With corrplot

library(corrplot)
cor_matrix <- cor(golf[, c(1, 3:10)], use = 'complete.obs')
corrplot.mixed(cor_matrix, lower = "circle", upper = "number", tl.pos = "lt", diag = "u")

Once you’ve visualized the data and understand the associations that appear to be present and their attributes (strength, outliers, linearity) you can begin assessing the statistical relationship by applying the appropriate correlation method.

_____

  • Pearson’s Correlation: best for interval scale

The Pearson product-moment correlation coefficient measures the strength of the linear relationship between two variables and is represented by r when referring to a sample or ρ when referring to the population. Unfortunately, the assumptions for Pearson’s correlation are often overlooked.

  • Level of measurement: The variables should be continuous. If one or both of the variables are ordinal in measurement, then a Spearman rank correlation should be conducted.

  • Linear relationship: The variables need to be linearly related. If they are not, the data could be transformed (i.e. logarithmic transformation) or a non-parametric approach such as the Spearman’s or Kendall’s rank correlation tests could be used.

  • Homoscedasticity: If the variance between the two variables is not constant then r will not provide a good measure of association.

  • Bivariate Normality: Technically, Pearson’s $r$ does not require normality when the sample size is fairly large; however, when the variables consist of high levels of skewness or contain significant outliers it is recommended to use Spearman’s rank correlation or, at a minimum, compare Pearson’s and Spearman’s coefficients.

Level of measurement: The variables should be continuous. If one or both of the variables are ordinal in measurement, then a Spearman rank correlation should be conducted.

Linear relationship: The variables need to be linearly related. If they are not, the data could be transformed (i.e. logarithmic transformation) or a non-parametric approach such as the Spearman’s or Kendall’s rank correlation tests could be used.

Homoscedasticity: If the variance between the two variables is not constant then r will not provide a good measure of association.

Bivariate Normality: Technically, Pearson’s $r$ does not require normality when the sample size is fairly large; however, when the variables consist of high levels of skewness or contain significant outliers it is recommended to use Spearman’s rank correlation or, at a minimum, compare Pearson’s and Spearman’s coefficients.

To calculate the correlation between two variables we use cor(). When using cor() there are two arguments (other than the variables) that need to be considered. The first is use = which allows us to decide how to handle missing data. The default is use = everything but if there is missing data in your data set this will cause the output to be NA unless we explicitly state to only use complete observations with use = complete.obs. The second argument is method =which allows us to specify if we want to use “pearson”, “kendall”, or “spearman”. Pearson is the default method so we do not need to specify for that option.

# If we don't filter out NAs we get NA in return
cor(golf$Age, golf$`Yards/Drive`)
## [1] NA

# Filter NAs to get correlation for complete observations
cor(golf$Age, golf$`Yards/Drive`, use = 'complete.obs')
## [1] -0.3960891

# We can also get the correlation matrix for multiple variables but we need
# to exclude any non-numeric values
cor(golf[, c(1, 3:10)], use = 'complete.obs')
##                   Rank        Age      Events      Rounds  Cuts Made
## Rank         1.0000000  0.2178987 -0.25336860 -0.39896409 -0.6340634
## Age          0.2178987  1.0000000 -0.11910802 -0.14056653 -0.1950816
## Events      -0.2533686 -0.1191080  1.00000000  0.96635795  0.7614576
## Rounds      -0.3989641 -0.1405665  0.96635795  1.00000000  0.8913155
## Cuts Made   -0.6340634 -0.1950816  0.76145761  0.89131554  1.0000000
## Top 10s     -0.8033446 -0.2052073  0.17710952  0.30790212  0.5255005
## Wins        -0.5765412 -0.1753386  0.04591368  0.12215216  0.2460359
## Earnings    -0.8585013 -0.2175656  0.15208235  0.29048622  0.5289126
## Yards/Drive -0.3008188 -0.3960891 -0.02052854  0.03470326  0.1465976
##                Top 10s        Wins   Earnings Yards/Drive
## Rank        -0.8033446 -0.57654121 -0.8585013 -0.30081875
## Age         -0.2052073 -0.17533858 -0.2175656 -0.39608914
## Events       0.1771095  0.04591368  0.1520823 -0.02052854
## Rounds       0.3079021  0.12215216  0.2904862  0.03470326
## Cuts Made    0.5255005  0.24603589  0.5289126  0.14659757
## Top 10s      1.0000000  0.49398282  0.8957970  0.19397586
## Wins         0.4939828  1.00000000  0.7313615  0.21563889
## Earnings     0.8957970  0.73136149  1.0000000  0.25041021
## Yards/Drive  0.1939759  0.21563889  0.2504102  1.00000000

Unfortunately cor() only provides the r coefficient(s) and does not test for significance nor provide confidence intervals. To get these parameters for a simple two variable analysis I use cor.test(). In our example we see that the p-value is significant and the 95% confidence interval confirms this as the range does not contain zero. This suggests the correlation between age and yards per drive is r = -0.396 with 95% confidence of being between -0.27 and -0.51.

cor.test(golf$Age, golf$`Yards/Drive`, use = 'complete.obs')
## 
## 	Pearson's product-moment correlation
## 
## data:  golf$Age and golf$`Yards/Drive`
## t = -5.8355, df = 183, p-value = 2.394e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5111490 -0.2670825
## sample estimates:
##cor 
## -0.3960891

We can also get the correlation matrix and the p-values across all variables by using the rcorr() function in the Hmisc package.

library(Hmisc)
rcorr(as.matrix(golf[, c(1, 3:9)]))
##            Rank   Age Events Rounds Cuts Made Top 10s  Wins Earnings
## Rank       1.00  0.21  -0.23  -0.38     -0.63   -0.80 -0.58    -0.86
## Age        0.21  1.00  -0.09  -0.12     -0.17   -0.20 -0.17    -0.21
## Events    -0.23 -0.09   1.00   0.97      0.75    0.16  0.04     0.14
## Rounds    -0.38 -0.12   0.97   1.00      0.88    0.29  0.12     0.28
## Cuts Made -0.63 -0.17   0.75   0.88      1.00    0.52  0.25     0.53
## Top 10s   -0.80 -0.20   0.16   0.29      0.52    1.00  0.50     0.89
## Wins      -0.58 -0.17   0.04   0.12      0.25    0.50  1.00     0.73
## Earnings  -0.86 -0.21   0.14   0.28      0.53    0.89  0.73     1.00
## 
## n
##           Rank Age Events Rounds Cuts Made Top 10s Wins Earnings
## Rank       200 188    200    200       200     200  200      200
## Age        188 188    188    188       188     188  188      188
## Events     200 188    200    200       200     200  200      200
## Rounds     200 188    200    200       200     200  200      200
## Cuts Made  200 188    200    200       200     200  200      200
## Top 10s    200 188    200    200       200     200  200      200
## Wins       200 188    200    200       200     200  200      200
## Earnings   200 188    200    200       200     200  200      200
## 
## P
##           Rank   Age    Events Rounds Cuts Made Top 10s Wins   Earnings
## Rank             0.0045 0.0011 0.0000 0.0000    0.0000  0.0000 0.0000  
## Age       0.0045        0.1978 0.1084 0.0164    0.0056  0.0197 0.0041  
## Events    0.0011 0.1978        0.0000 0.0000    0.0238  0.5953 0.0499  
## Rounds    0.0000 0.1084 0.0000        0.0000    0.0000  0.0953 0.0000  
## Cuts Made 0.0000 0.0164 0.0000 0.0000           0.0000  0.0003 0.0000  
## Top 10s   0.0000 0.0056 0.0238 0.0000 0.0000            0.0000 0.0000  
## Wins      0.0000 0.0197 0.5953 0.0953 0.0003    0.0000         0.0000  
## Earnings  0.0000 0.0041 0.0499 0.0000 0.0000    0.0000  0.0000

_____

  • Spearman’s Rank Correlation: best for ordinal scale

Consequently, common questions that a Spearman correlation answers includes: Is there a statistically significant relationship between the age of a golf player and the number of wins (0, 1, 2)?

The assumptions for Spearman’s correlation include:

  • Level of measurement: The normal assumption is that one or both of the variables are ordinal in measurement; however, Spearman’s is also appropriate to use if both variables are continuous but are heavily skewed or contain sizable outliers.

  • Montonically related: A linear relationship is not necessary, the only requirement is that one variable is montonically related to the other variable.

To assess correlations with Spearman’s rank we can use the same functions introduced for the Pearson correlations and simply change the correlation method.

-Univariate Statistical Inference

After receiving data and completing some initial data exploration, we typically move on to performing some univariate estimation and prediction tasks. Statistical inference helps you estimate parameters of a larger population when the observed data you are working with is a subset of that population.

  • Dataset & packages: What we need

  • Confidence intervals of the mean: Using our sample to create an expectation of the population mean.

  • Reducing margin of error: How can we create greater confidence in our estimates?

  • Confidence intervals of the proportion: Confidence intervals for categorical variables.

  • Hypothesis testing for the mean: Assessing evidence for claims of the mean estimate.

  • Hypothesis testing for the proportion: Assessing evidence for claims of proportions.

_____

  • Dataset & packages

#Text Mining

One of the complete books about this theme: https://www.tidytextmining.com/

-Tidying Text & Word Frequency

A fundamental requirement to perform text mining is to get your text in a tidy format and perform word frequency analysis.

  • Dataset and packages

  • Text Tidying

  • Word Frequency

_____

  • Dataset and packages

We need the data provided in the harrypotter package. Another packages are:

if (packageVersion("devtools") < 1.6) {
  install.packages("devtools")
}

devtools::install_github("bradleyboehmke/harrypotter")
install.packages("tidyverse")
install.packages("stringr")
install.packages("tidytext")
install.packages("harrypotter")
library(tidyverse)  # data manipulation & plotting
library(stringr)    # text cleaning and regular expressions
library(tidytext)   # provides additional text mining functions
library(harrypotter)# provides the first seven novels of the Harry Potter series

The seven novels are

  • philosophers_stone: Harry Potter and the Philosophers Stone (1997)

  • chamber_of_secrets: Harry Potter and the Chamber of Secrets (1998)

  • prisoner_of_azkaban: Harry Potter and the Prisoner of Azkaban (1999)

  • goblet_of_fire: Harry Potter and the Goblet of Fire (2000)

  • order_of_the_phoenix: Harry Potter and the Order of the Phoenix (2003)

  • half_blood_prince: Harry Potter and the Half-Blood Prince (2005)

  • deathly_hallows: Harry Potter and the Deathly Hallows (2007)

_____

  • Text Tidying

Last updated