Descriptive Analytics
Descriptive methodologies focus on analyzing historic data for the purpose of identifying patterns or trends.
#Classical Analyses
-Numerical Data
There goal, in essence, is to describe the main features of numerical and categorical information with simple summaries.
Using this dataset. Baseball Salaries 2011.xlsx contains data on 843 MLB players in the 2011 season. For more information on installing, loading, and getting help with packages, See this tutorial. Baseball Salaries 2011.xlsx has 4 variables
Central Tendency or "Zentral Wert": What are the most typical values?
There are 3 common central tendency. These are mean (average of all observations), median (middle observation), and mode (appears most often).
mean(salaries$Salary, na.rm = TRUE)
## [1] 3305055
median(salaries$Salary, na.rm = TRUE)
## [1] 1175000
#There is not a built in function to compute the mode of a variable.
#This function gives the mode values
get_mode <- function(v) {
unique_value <- unique(v)
unique_value[which.max(tabulate(match(v, unique_value)))]
}
get_mode(salaries$Salary)
## [1] 414000
_____
Variability: How do the values vary?
Providing unique understanding of how the values are spread out.
#Range: Defining the maximum and minimum values
# get the minimum value
min(salaries$Salary, na.rm = FALSE)
## [1] 414000
# get the maximum value
max(salaries$Salary, na.rm = FALSE)
## [1] 3.2e+07
# get both the min and max values
range(salaries$Salary, na.rm = FALSE)
## [1] 414000 32000000
# compute the spread between min & max values
max(salaries$Salary, na.rm = FALSE) - min(salaries$Salary, na.rm = FALSE)
## [1] 31586000
#Percentiles
What is the salary value percentage? The first, second, and third quartiles are the percentiles corresponding to p=25%, p=50%, and p=75%. These three values divide the data into four groups, each with (approximately) a quarter of all observations.
# fivenum() function provides min, 25%, 50% (median), 75%, and max
fivenum(salaries$Salary)
## [1] 414000 430325 1175000 4306250 32000000
# default quantile() percentiles are 0%, 25%, 50%, 75%, and 100%
# provides same output as fivenum()
quantile(salaries$Salary, na.rm = TRUE)
## 0% 25% 50% 75% 100%
## 414000 430325 1175000 4306250 32000000
# we can customize quantile() for specific percentiles
quantile(salaries$Salary, probs = seq(from = 0, to = 1, by = .1),
na.rm = TRUE)
# we can quickly compute the difference between the 1st and 3rd quantile
IQR(salaries$Salary)
## [1] 3875925
An alternative approach to is to use the summary()
function with is a generic R function used to produce min, 1st quantile, median, mean, 3rd quantile, and max summary measures. However, note that the 1st and 3rd quantiles produced by summary()
differ from the 1st and 3rd quantiles produced by fivenum()
and the default quantile()
. The reason for this is due to the lack of universal agreement on how the 1st and 3rd quartiles should be calculated. Eric Cai provided a good blog post that discusses this difference in the R functions.
summary(salaries$Salary)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 414000 430300 1175000 3305000 4306000 32000000
#Variance
The most common measures to summarize variability are variance and its derivatives (standard deviation and mean/median absolute deviation).
# variance
var(salaries$Salary)
## [1] 2.056389e+13
# standard deviation
sd(salaries$Salary)
## [1] 4534742
# mean absolute deviation
mad(salaries$Salary, center = mean(salaries$Salary))
## [1] 4229672
# median absolute deviation - note that the center
#argument defaults to median
# so it does not need to be specified, although I do just be clear
mad(salaries$Salary, center = median(salaries$Salary))
## [1] 1126776
_____
Shape: Are the values symmetrically or assymetrically distributed?
Two additional measures of a distribution that you will hear occasionally include skewness and kurtosis.

source: https://bit.ly/2xkqOlj
=Skewness: Measure of symmetry
Negative values -> left-skewed -> mean < median
Positive values -> right-skewed -> mean> median
Near-zero value -> Normal distribution
=Kurtosis: Measure of peakedness
Negative values -> platykurtic -> see the picture above
Positive values -> leptokurtic -> see the picture above
Near-zero value -> Normal distribution
_____
Outliers: A values that represent abnormalities in the data.
Outliers in data can distort predictions and affect their accuracy. The outliers
package provides a number of useful functions to systematically extract outliers. The functions of most use are outlier()
and scores()
. The outlier()
function gets the most extreme observation from the mean. The scores()
function computes the normalized (z, t, chisq, etc.) score which you can use to find observation(s) that lie beyond a given value.
install.packages("outliers")
library(outliers)
# gets most extreme right-tail observation
outlier(salaries$Salary)
## [1] 3.2e+07
# gets most extreme left-tail observation
outlier(salaries$Salary, opposite = TRUE)
## [1] 414000
# observations that are outliers based on z-scores
z_scores <- scores(salaries$Salary, type = "z")
which(abs(z_scores) > 1.96)
## [1] 1 11 22 24 33 34 38 54 64 132 134 138 139 146 151 161 164
## [18] 168 211 236 242 248 249 348 355 370 384 441 452 460 484 496 517 520
## [35] 535 538 576 577 585 601 627 629 649 692 695 728 729 733 737 790 794
## [52] 801 803 816 843
# outliers based on values less than or greater than the "whiskers" on a
# boxplot (1.5 x IQR or more below 1st quartile or above 3rd quartile)
which(scores(salaries$Salary, type = "iqr", lim = 1.5))
## [1] 1 11 12 22 24 33 34 38 54 64 83 97 132 134 138 139 146
## [18] 151 161 164 168 189 208 211 236 242 248 249 293 298 300 329 336 348
## [35] 355 370 384 408 440 441 442 452 460 475 484 493 496 517 520 535 538
## [52] 547 576 577 585 601 620 627 629 649 664 679 692 695 707 728 729 733
## [69] 737 760 790 794 801 803 816 818 843
How you deal with outliers is a topic worthy of its own tutorial; however, if you want to simply remove an outlier or replace it with the sample mean or median then I recommend the
rm.outlier()
function provided also by theoutliers
package.
More about dealing with outliers -> here
_____
Visualization: Histograms and boxplots are graphical representations to illustrate these summary measures for numerical variables.
=Using Histograms
hist(salaries$Salary)

Use
ggplot
to customize the graphic and create a more presentable product:
install.packages("ggplot2")
library(ggplot2)
ggplot(salaries, aes(Salary)) +
geom_histogram(colour = "black", fill = "white") +
scale_x_log10(labels = scales::dollar) + # x axis log & $$ labels
geom_vline(aes(xintercept = mean(Salary)),
color = "red", linetype = "dashed") + # add line for mean
annotate("text", x = mean(salaries$Salary) * 2, y = 155,
label = paste0("Avg: $", round(mean(salaries$Salary)/1000000, 1),"M"))

ggplot(salaries, aes(x = Salary)) +
geom_dotplot() +
scale_x_continuous(labels = scales::dollar)

=Using Boxplots
Boxplots are an alternative way to illustrate the distribution of a variable and is a concise way to illustrate the standard quantiles, shape, and outliers of data.
# I use a log scale to spread out the data
boxplot(salaries$Salary, horizontal = TRUE, log = "x")

use
ggplot
to refine the boxplot and add additional features such as a point illustrating the mean and also show the actual distribution of observations:

However, boxplots are more useful when comparing distributions. For instance, if you wanted to compare the distributions of salaries across the different positions boxplots provide a quick comparative assessment:

-Categorical Data
Descriptive statistics are the first pieces of information used to understand and represent a dataset. There goal, in essence, is to describe the main features of numerical and categorical information with simple summaries.
Using this dataset. Supermarket Transactions.xlsx is an artificial supermarket transaction data. For more information on installing, loading, and getting help with packages, See this tutorial. Supermarket Transactions.xlsx has 16 variables
_____
Frequencies: The number of observations for a particular category
To produce contingency tables which calculate counts for each combination of categorical variables we can use R’s table()
function.
# counts for gender categories
table(supermarket$Gender)
##
## F M
## 7170 6889
Produce a cross classification table for the number of married and single females and male customers.
# cross classication counts for gender by marital status
table(supermarket$`Marital Status`, supermarket$Gender)
##
## F M
## M 3602 3264
## S 3568 3625
Produce multidimensional tables based on three or more categorical variables. For this, we leverage the ftable()
function.
# customer counts across location by gender and marital status
table1 <- table(supermarket$`Marital Status`, supermarket$Gender,
supermarket$`State or Province`)
ftable(table1)
## BC CA DF Guerrero Jalisco OR Veracruz WA Yucatan Zacatecas
##
## M F 190 638 188 77 15 510 142 1166 200 476
## M 197 692 210 94 5 514 108 1160 129 155
## S F 183 686 175 107 30 607 125 1134 164 357
## M 239 717 242 105 25 631 89 1107 161 309
_____
Proportions: Percentage of each category or combination of categories.
Use the table()
and the prop.table()
function.
percentages of gender categories
table2 <- table(supermarket$Gender)
prop.table(table2)
##
## F M
## 0.5099936 0.4900064
# percentages for gender by marital status
table3 <- table(supermarket$`Marital Status`, supermarket$Gender)
prop.table(table3)
##
## F M
## M 0.2562060 0.2321644
## S 0.2537876 0.2578420
# customer percentages across location by gender and marital status
# using table1 from previous code chunk
ftable(round(prop.table(table1), 3))
## BC CA DF Guerrero Jalisco OR Veracruz WA Yucatan Zacatecas
##
## M F 0.014 0.045 0.013 0.005 0.001 0.036 0.010 0.083 0.014 0.034
## M 0.014 0.049 0.015 0.007 0.000 0.037 0.008 0.083 0.009 0.011
## S F 0.013 0.049 0.012 0.008 0.002 0.043 0.009 0.081 0.012 0.025
## M 0.017 0.051 0.017 0.007 0.002 0.045 0.006 0.079 0.011 0.022
_____
Marginals: Show the total counts or percentages across columns or rows in a contingency table.
We can compute the marginal frequencies with margin.table()
and the percentages for these marginal frequencies with prop.table()
using the margin
argument
table3
##
## F M
## M 3602 3264
## S 3568 3625
# FREQUENCY MARGINALS
# row marginals - totals for each marital status across gender
margin.table(table3, 1)
##
## M S
## 6866 7193
# colum marginals - totals for each gender across marital status
margin.table(table3, 2)
##
## F M
## 7170 6889
# PERCENTAGE MARGINALS
# row marginals - row percentages across gender
prop.table(table3, margin = 1)
##
## F M
## M 0.5246140 0.4753860
## S 0.4960378 0.5039622
# colum marginals - column percentages across marital status
prop.table(table3, margin = 2)
##
## F M
## M 0.5023710 0.4737988
## S 0.4976290 0.5262012
_____
Visualization: Bar charts / Histograms are most often used to visualize categorical variables. Using ggplot or ggplot2 makes the graph more presentable.
ggplot(supermarket, aes(x = `State or Province`)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

# re-order levels
reorder_size <- function(x) {
factor(x, levels = names(sort(table(x), decreasing = TRUE)))
}
ggplot(supermarket, aes(x = reorder_size(`State or Province`))) +
geom_bar() +
xlab("State or Province") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Proportions bar charts
ggplot(supermarket, aes(x = reorder_size(`State or Province`))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
xlab("State or Province") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

We can also create contingency table-like bar charts by using the facet_grid()
function to produce small multiples. Plot customer proportions across location and by Gender.
ggplot(supermarket, aes(x = reorder_size(`State or Province`))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
xlab("State or Province") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
facet_grid(~ Gender) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Do the same plot by Gender and by Marital status.
ggplot(supermarket, aes(x = reorder_size(`State or Province`))) +
geom_bar(aes(y = (..count..)/sum(..count..))) +
xlab("State or Province") +
scale_y_continuous(labels = scales::percent, name = "Proportion") +
facet_grid(`Marital Status` ~ Gender) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

-Assumption of Normality
When assumptions are broken we stop being able to draw accurate conclusions about reality. Different statistical models assume different things, and if these models are going to reflect reality accurately then these assumptions need to be true. One of the first parametric assumptions most people think of is the assumption of normality. The assumption of normality is important for hypothesis testing and in regression models. In general linear models, the assumption comes in to play with regards to residuals (aka errors).
What is normality: The sampling distribution of the mean is normal
Dataset and packages: What we need
Visualizing Normality: The first step in evaluating normality.
Descriptive Statistics for Normality:
Shapiro-Wilk Test for Normality:
_____
What is normality
Central Limit Theorem: given random and independent samples of N observations each, the distribution of sample means approaches normality as the size of N increases, regardless of the shape of the population distribution.
From the CLT that in big samples the sampling distribution tends to be normal anyway.
_____
Dataset and packages
Golf data provided by ESPN. In addition, we need another the packages to such as the following:
library(readxl) # read in Excel data
library(ggplot2) # create visualizations
library(psych) # provide summary statistics
library(pastecs) # provide summary statistics
For more information on installing, loading, and getting help with packages, See this tutorial. The Golf dataset is data frame which consists of 18 variables.
_____
Visualizing Normality
Histograms are the first way to look at the shape of a normal distribution. With the stat_function()
argument people can see that the data distributed normal.
library(readxl)
library(ggplot2)
golf <- read_excel("File Path", sheet = "2011")
ggplot(golf, aes(`Driving Accuracy`)) +
geom_histogram(aes(y = ..density..), colour = "black", fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(golf$`Driving Accuracy`, na.rm = T),
sd = sd(golf$`Driving Accuracy`, na.rm = T))) +
xlab("Driving Accuracy (%)")
Compare to the Earnings
variable.
ggplot(golf, aes(Earnings)) +
geom_histogram(aes(y = ..density..), colour = "black", fill = "white") +
stat_function(fun = dnorm, args = list(mean = mean(golf$Earnings, na.rm = T),
sd = sd(golf$Earnings, na.rm = T))) +
scale_x_continuous(label = scales::dollar)

Another useful graph that we can inspect to see if a distribution is normal is the Q-Q plot. If the data are normally distributed the plot will display a straight (or nearly straight) line.
Q-Q plot for the driving accuracy :
qqnorm(golf$`Driving Accuracy`,
main = "Normal Q-Q Plot for Driving Accuracy")

But Q-Q plot for Earnings accuracy:
qqnorm(golf$Earnings, main = "Normal Q-Q Plot for Earnings")

_____
Descriptive Statistics for Normality
Its important to support the visual findings with objective quantifications that describe the shape of the distribution and to look for outliers.
Doing this by using the describe()
function from the psych
package or the stat.desc()
function from the pastecs
package.
-With describe()
function
install.packages("psych")
library(psych)
# assess one variable
describe(golf$`Driving Accuracy`)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## 1 1 197 61.79 5.46 61.7 61.71 5.78 47 77 30 0.09 -0.04 0.39
# assess multiple variables
describe(golf[, c("Driving Accuracy", "Earnings")])
## vars n mean sd median trimmed
## Driving Accuracy 1 197 61.79 5.46 61.7 61.71
## Earnings 2 200 1336649.41 1155157.90 941255.0 1120407.53
## mad min max range skew kurtosis se
## Driving Accuracy 5.78 47 77 30 0.09 -0.04 0.39
## Earnings 761431.93 250333 6683214 6432882 1.85 3.88 81682.00
-With stat.desc()
To reduce the amount of statistics we get back we can set the argument basic = FALSE
and to get statistics back relating to the distribution we set the argument norm = TRUE
.
install.packages("pastecs")
library(pastecs)
# assess one variable
stat.desc(golf$`Driving Accuracy`, basic = FALSE, norm = TRUE)
## median mean SE.mean CI.mean.0.95 var
## 61.70000000 61.78578680 0.38927466 0.76770460 29.85234797
## std.dev coef.var skewness skew.2SE kurtosis
## 5.46373023 0.08843021 0.09349445 0.26988823 -0.03637605
## kurt.2SE normtest.W normtest.p
## -0.05275937 0.99626396 0.91530114
# assess multiple variables
stat.desc(golf[, c("Driving Accuracy", "Earnings")], basic = FALSE, norm = TRUE)
## Driving Accuracy Earnings
## median 61.70000000 9.412550e+05
## mean 61.78578680 1.336649e+06
## SE.mean 0.38927466 8.168200e+04
## CI.mean.0.95 0.76770460 1.610734e+05
## var 29.85234797 1.334390e+12
## std.dev 5.46373023 1.155158e+06
## coef.var 0.08843021 8.642191e-01
## skewness 0.09349445 1.846104e+00
## skew.2SE 0.26988823 5.368929e+00
## kurtosis -0.03637605 3.876503e+00
## kurt.2SE -0.05275937 5.664051e+00
## normtest.W 0.99626396 8.004944e-01
## normtest.p 0.91530114 2.918239e-15
_____
Shapiro-Wilk Test for Normality
-Non-significant (p>.05) -> the sample is notsignificantly different from a normal distribution.
- Significant (p < .05) -> the distribution in question is significantly different from a normal distribution.
shapiro.test(golf$`Driving Accuracy`)
## Shapiro-Wilk normality test
## data: golf$`Driving Accuracy`
## W = 0.99626, p-value = 0.9153
shapiro.test(golf$Earnings)
## Shapiro-Wilk normality test
## data: golf$Earnings
## W = 0.80049, p-value = 2.918e-15
Its important to note that there are limitations to the Shapiro-Wilk test. As the dataset being evaluated gets larger, the Shapiro-Wilk test becomes more sensitive to small deviations which leads to a greater probability of rejecting the null hypothesis (null hypothesis being the values come from a normal distribution).
The principal message is that to assess for normality we should not rely on only one approach to assess our data. Rather, we should understand the distribution visually, through descriptive statistics, and formal testing procedures to come to our conclusion of whether our data meets the normality assumption or not.
- Assumptions of Homogeneity
-Correlations
A bivariate analysis from two variables. The value of the correlation coefficient varies between +1 and -1. Near +1 implies a strong positive association and near -1 implies a strong negative association. Near 0 value implying no association between the two variables.
Dataset and packages: What we need
Visualizing relationships: A picture is worth a thousand words
Pearson's correlation: This approach is so widely used.
Spearman’s Rank correlation: Great when variables are measured on an ordinal scale. A non-parametric approach
Kendall’s tau: Another non-parametric approach. Less sensitive to outliers and more accurate with smaller sample sizes.
Partial correlation: Measuring the relationship between two variables while controlling for the effect of one or more covariates.
_____
Dataset and packages
Install the following packages:
library(readxl) # reading in data
library(ggplot2)# visualizing data
library(gridExtra) # combining multiple plots
library(corrgram) # visualizing data
library(corrplot) # visualizing data
library(Hmisc) # produces correlation matrices with p-values
library(ppcor) # assesses partial correlations
Download the dataset here and here to. Golf dataset & Survei dataset. The golf data has 18 variables and the survey data has 11 variables.
_____
Visualizing bivariate relationships
A scatterplot of two variables provides a vivid illustration of the relationship between two variables.
qplot(x = Events, y = Rounds, data = golf) +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Fig. A: Strong Positive Association")

Contrast this to the following two plots which shows as driving accuracy increases the distance of the player’s drive tends to decrease (Fig. B). In addition we can easily see that as a player’s age increases their greens in regulation percentage does not appear to change (Fig. C).
library(gridExtra)
p1 <- qplot(x = `Driving Accuracy`, y = `Yards/Drive`, data = golf) +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Fig. B: Moderate Negative Association")
p2 <- qplot(x = Age, y = `Greens in Regulation`, data = golf) +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("Fig. C: Weak/No Association")
grid.arrange(p1, p2, ncol = 2)

In addition, scatter plots illustrate the linearity of the relationship, which can influence how you approach assessing correlations (i.e. data transformation, using a parametric vs non-parametric test, removing outliers).
Francis Anscombe illustrated this in 19731 when he constructed four data sets that have the same mean, variance, and correlation; however, there are significant differences in the variable relationships. Using the anscombe
data, which R has as a built in data set, the plots below demonstrate the importance of graphing data rather than just relying on correlation coefficients. Each x-y combination in the plot below has a correlation of .82 (strong positive) but there are definitely differences in the association between these variables.
library(gridExtra)
library(grid)
p1 <- qplot(x = x1, y = y1, data = anscombe)
p2 <- qplot(x = x2, y = y2, data = anscombe)
p3 <- qplot(x = x3, y = y3, data = anscombe)
p4 <- qplot(x = x4, y = y4, data = anscombe)
grid.arrange(p1, p2, p3, p4, ncol = 2,
top = textGrob("Anscombe's Quartet"))

Visualization can also give you a quick approach to assessing multiple relationships. We can produce scatter plot matrices multiple ways to visualize and compare relationships across the entire data set we are analyzing. With base R plotting we can use the pairs()
function.
Lets look at the first 10 variables of the golf data set (minus the player name variable).
pairs(golf[, c(1, 3:10)])

With the corrgram and corrplot packages. Note that multiple options exist with both these visualizations (i.e. formatting, correlation method applied, illustrating significance and confidence intervals, etc.)
library(corrgram)
par(bg = "#fdfdfd")
corrgram(golf[, c(1, 3:10)], lower.panel = panel.shade, upper.panel = panel.pts)

With corrplot
library(corrplot)
cor_matrix <- cor(golf[, c(1, 3:10)], use = 'complete.obs')
corrplot.mixed(cor_matrix, lower = "circle", upper = "number", tl.pos = "lt", diag = "u")

Once you’ve visualized the data and understand the associations that appear to be present and their attributes (strength, outliers, linearity) you can begin assessing the statistical relationship by applying the appropriate correlation method.
_____
Pearson’s Correlation: best for interval scale
The Pearson product-moment correlation coefficient measures the strength of the linear relationship between two variables and is represented by r when referring to a sample or ρ when referring to the population. Unfortunately, the assumptions for Pearson’s correlation are often overlooked.
Level of measurement: The variables should be continuous. If one or both of the variables are ordinal in measurement, then a Spearman rank correlation should be conducted.
Linear relationship: The variables need to be linearly related. If they are not, the data could be transformed (i.e. logarithmic transformation) or a non-parametric approach such as the Spearman’s or Kendall’s rank correlation tests could be used.
Homoscedasticity: If the variance between the two variables is not constant then r will not provide a good measure of association.
Bivariate Normality: Technically, Pearson’s $r$ does not require normality when the sample size is fairly large; however, when the variables consist of high levels of skewness or contain significant outliers it is recommended to use Spearman’s rank correlation or, at a minimum, compare Pearson’s and Spearman’s coefficients.
Level of measurement: The variables should be continuous. If one or both of the variables are ordinal in measurement, then a Spearman rank correlation should be conducted.
Linear relationship: The variables need to be linearly related. If they are not, the data could be transformed (i.e. logarithmic transformation) or a non-parametric approach such as the Spearman’s or Kendall’s rank correlation tests could be used.
Homoscedasticity: If the variance between the two variables is not constant then r will not provide a good measure of association.
Bivariate Normality: Technically, Pearson’s $r$ does not require normality when the sample size is fairly large; however, when the variables consist of high levels of skewness or contain significant outliers it is recommended to use Spearman’s rank correlation or, at a minimum, compare Pearson’s and Spearman’s coefficients.
To calculate the correlation between two variables we use cor()
. When using cor()
there are two arguments (other than the variables) that need to be considered. The first is use =
which allows us to decide how to handle missing data. The default is use = everything
but if there is missing data in your data set this will cause the output to be NA
unless we explicitly state to only use complete observations with use = complete.obs
. The second argument is method =
which allows us to specify if we want to use “pearson”, “kendall”, or “spearman”. Pearson is the default method so we do not need to specify for that option.
# If we don't filter out NAs we get NA in return
cor(golf$Age, golf$`Yards/Drive`)
## [1] NA
# Filter NAs to get correlation for complete observations
cor(golf$Age, golf$`Yards/Drive`, use = 'complete.obs')
## [1] -0.3960891
# We can also get the correlation matrix for multiple variables but we need
# to exclude any non-numeric values
cor(golf[, c(1, 3:10)], use = 'complete.obs')
## Rank Age Events Rounds Cuts Made
## Rank 1.0000000 0.2178987 -0.25336860 -0.39896409 -0.6340634
## Age 0.2178987 1.0000000 -0.11910802 -0.14056653 -0.1950816
## Events -0.2533686 -0.1191080 1.00000000 0.96635795 0.7614576
## Rounds -0.3989641 -0.1405665 0.96635795 1.00000000 0.8913155
## Cuts Made -0.6340634 -0.1950816 0.76145761 0.89131554 1.0000000
## Top 10s -0.8033446 -0.2052073 0.17710952 0.30790212 0.5255005
## Wins -0.5765412 -0.1753386 0.04591368 0.12215216 0.2460359
## Earnings -0.8585013 -0.2175656 0.15208235 0.29048622 0.5289126
## Yards/Drive -0.3008188 -0.3960891 -0.02052854 0.03470326 0.1465976
## Top 10s Wins Earnings Yards/Drive
## Rank -0.8033446 -0.57654121 -0.8585013 -0.30081875
## Age -0.2052073 -0.17533858 -0.2175656 -0.39608914
## Events 0.1771095 0.04591368 0.1520823 -0.02052854
## Rounds 0.3079021 0.12215216 0.2904862 0.03470326
## Cuts Made 0.5255005 0.24603589 0.5289126 0.14659757
## Top 10s 1.0000000 0.49398282 0.8957970 0.19397586
## Wins 0.4939828 1.00000000 0.7313615 0.21563889
## Earnings 0.8957970 0.73136149 1.0000000 0.25041021
## Yards/Drive 0.1939759 0.21563889 0.2504102 1.00000000
Unfortunately cor()
only provides the r coefficient(s) and does not test for significance nor provide confidence intervals. To get these parameters for a simple two variable analysis I use cor.test()
. In our example we see that the p-value is significant and the 95% confidence interval confirms this as the range does not contain zero. This suggests the correlation between age and yards per drive is r = -0.396 with 95% confidence of being between -0.27 and -0.51.
cor.test(golf$Age, golf$`Yards/Drive`, use = 'complete.obs')
##
## Pearson's product-moment correlation
##
## data: golf$Age and golf$`Yards/Drive`
## t = -5.8355, df = 183, p-value = 2.394e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5111490 -0.2670825
## sample estimates:
##cor
## -0.3960891
We can also get the correlation matrix and the p-values across all variables by using the rcorr()
function in the Hmisc package.
library(Hmisc)
rcorr(as.matrix(golf[, c(1, 3:9)]))
## Rank Age Events Rounds Cuts Made Top 10s Wins Earnings
## Rank 1.00 0.21 -0.23 -0.38 -0.63 -0.80 -0.58 -0.86
## Age 0.21 1.00 -0.09 -0.12 -0.17 -0.20 -0.17 -0.21
## Events -0.23 -0.09 1.00 0.97 0.75 0.16 0.04 0.14
## Rounds -0.38 -0.12 0.97 1.00 0.88 0.29 0.12 0.28
## Cuts Made -0.63 -0.17 0.75 0.88 1.00 0.52 0.25 0.53
## Top 10s -0.80 -0.20 0.16 0.29 0.52 1.00 0.50 0.89
## Wins -0.58 -0.17 0.04 0.12 0.25 0.50 1.00 0.73
## Earnings -0.86 -0.21 0.14 0.28 0.53 0.89 0.73 1.00
##
## n
## Rank Age Events Rounds Cuts Made Top 10s Wins Earnings
## Rank 200 188 200 200 200 200 200 200
## Age 188 188 188 188 188 188 188 188
## Events 200 188 200 200 200 200 200 200
## Rounds 200 188 200 200 200 200 200 200
## Cuts Made 200 188 200 200 200 200 200 200
## Top 10s 200 188 200 200 200 200 200 200
## Wins 200 188 200 200 200 200 200 200
## Earnings 200 188 200 200 200 200 200 200
##
## P
## Rank Age Events Rounds Cuts Made Top 10s Wins Earnings
## Rank 0.0045 0.0011 0.0000 0.0000 0.0000 0.0000 0.0000
## Age 0.0045 0.1978 0.1084 0.0164 0.0056 0.0197 0.0041
## Events 0.0011 0.1978 0.0000 0.0000 0.0238 0.5953 0.0499
## Rounds 0.0000 0.1084 0.0000 0.0000 0.0000 0.0953 0.0000
## Cuts Made 0.0000 0.0164 0.0000 0.0000 0.0000 0.0003 0.0000
## Top 10s 0.0000 0.0056 0.0238 0.0000 0.0000 0.0000 0.0000
## Wins 0.0000 0.0197 0.5953 0.0953 0.0003 0.0000 0.0000
## Earnings 0.0000 0.0041 0.0499 0.0000 0.0000 0.0000 0.0000
_____
Spearman’s Rank Correlation: best for ordinal scale
Consequently, common questions that a Spearman correlation answers includes: Is there a statistically significant relationship between the age of a golf player and the number of wins (0, 1, 2)?
The assumptions for Spearman’s correlation include:
Level of measurement: The normal assumption is that one or both of the variables are ordinal in measurement; however, Spearman’s is also appropriate to use if both variables are continuous but are heavily skewed or contain sizable outliers.
Montonically related: A linear relationship is not necessary, the only requirement is that one variable is montonically related to the other variable.
To assess correlations with Spearman’s rank we can use the same functions introduced for the Pearson correlations and simply change the correlation method.
-Univariate Statistical Inference
After receiving data and completing some initial data exploration, we typically move on to performing some univariate estimation and prediction tasks. Statistical inference helps you estimate parameters of a larger population when the observed data you are working with is a subset of that population.
Dataset & packages: What we need
Confidence intervals of the mean: Using our sample to create an expectation of the population mean.
Reducing margin of error: How can we create greater confidence in our estimates?
Confidence intervals of the proportion: Confidence intervals for categorical variables.
Hypothesis testing for the mean: Assessing evidence for claims of the mean estimate.
Hypothesis testing for the proportion: Assessing evidence for claims of proportions.
_____
Dataset & packages
#Text Mining
One of the complete books about this theme: https://www.tidytextmining.com/
-Tidying Text & Word Frequency
A fundamental requirement to perform text mining is to get your text in a tidy format and perform word frequency analysis.
Dataset and packages
Text Tidying
Word Frequency
_____
Dataset and packages
We need the data provided in the harrypotter
package. Another packages are:
if (packageVersion("devtools") < 1.6) {
install.packages("devtools")
}
devtools::install_github("bradleyboehmke/harrypotter")
install.packages("tidyverse")
install.packages("stringr")
install.packages("tidytext")
install.packages("harrypotter")
library(tidyverse) # data manipulation & plotting
library(stringr) # text cleaning and regular expressions
library(tidytext) # provides additional text mining functions
library(harrypotter)# provides the first seven novels of the Harry Potter series
The seven novels are
philosophers_stone
: Harry Potter and the Philosophers Stone (1997)chamber_of_secrets
: Harry Potter and the Chamber of Secrets (1998)prisoner_of_azkaban
: Harry Potter and the Prisoner of Azkaban (1999)goblet_of_fire
: Harry Potter and the Goblet of Fire (2000)order_of_the_phoenix
: Harry Potter and the Order of the Phoenix (2003)half_blood_prince
: Harry Potter and the Half-Blood Prince (2005)deathly_hallows
: Harry Potter and the Deathly Hallows (2007)
_____
Text Tidying
Last updated