Descriptive Analytics
Descriptive methodologies focus on analyzing historic data for the purpose of identifying patterns or trends.
#Classical Analyses
-Numerical Data
There goal, in essence, is to describe the main features of numerical and categorical information with simple summaries.
Using this dataset. Baseball Salaries 2011.xlsx contains data on 843 MLB players in the 2011 season. For more information on installing, loading, and getting help with packages, See this tutorial. Baseball Salaries 2011.xlsx has 4 variables
Central Tendency or "Zentral Wert": What are the most typical values?
There are 3 common central tendency. These are mean (average of all observations), median (middle observation), and mode (appears most often).
_____
Variability: How do the values vary?
Providing unique understanding of how the values are spread out.
#Range: Defining the maximum and minimum values
#Percentiles
What is the salary value percentage? The first, second, and third quartiles are the percentiles corresponding to p=25%, p=50%, and p=75%. These three values divide the data into four groups, each with (approximately) a quarter of all observations.
An alternative approach to is to use the summary()
function with is a generic R function used to produce min, 1st quantile, median, mean, 3rd quantile, and max summary measures. However, note that the 1st and 3rd quantiles produced by summary()
differ from the 1st and 3rd quantiles produced by fivenum()
and the default quantile()
. The reason for this is due to the lack of universal agreement on how the 1st and 3rd quartiles should be calculated. Eric Cai provided a good blog post that discusses this difference in the R functions.
#Variance
The most common measures to summarize variability are variance and its derivatives (standard deviation and mean/median absolute deviation).
_____
Shape: Are the values symmetrically or assymetrically distributed?
Two additional measures of a distribution that you will hear occasionally include skewness and kurtosis.
source: https://bit.ly/2xkqOlj
=Skewness: Measure of symmetry
Negative values -> left-skewed -> mean < median
Positive values -> right-skewed -> mean> median
Near-zero value -> Normal distribution
=Kurtosis: Measure of peakedness
Negative values -> platykurtic -> see the picture above
Positive values -> leptokurtic -> see the picture above
Near-zero value -> Normal distribution
_____
Outliers: A values that represent abnormalities in the data.
Outliers in data can distort predictions and affect their accuracy. The outliers
package provides a number of useful functions to systematically extract outliers. The functions of most use are outlier()
and scores()
. The outlier()
function gets the most extreme observation from the mean. The scores()
function computes the normalized (z, t, chisq, etc.) score which you can use to find observation(s) that lie beyond a given value.
How you deal with outliers is a topic worthy of its own tutorial; however, if you want to simply remove an outlier or replace it with the sample mean or median then I recommend the
rm.outlier()
function provided also by theoutliers
package.
More about dealing with outliers -> here
_____
Visualization: Histograms and boxplots are graphical representations to illustrate these summary measures for numerical variables.
=Using Histograms
Use
ggplot
to customize the graphic and create a more presentable product:
An alternative, and highly effective way to visualize data is with dotplots
.
=Using Boxplots
Boxplots are an alternative way to illustrate the distribution of a variable and is a concise way to illustrate the standard quantiles, shape, and outliers of data.
use
ggplot
to refine the boxplot and add additional features such as a point illustrating the mean and also show the actual distribution of observations:
However, boxplots are more useful when comparing distributions. For instance, if you wanted to compare the distributions of salaries across the different positions boxplots provide a quick comparative assessment:
-Categorical Data
Descriptive statistics are the first pieces of information used to understand and represent a dataset. There goal, in essence, is to describe the main features of numerical and categorical information with simple summaries.
Using this dataset. Supermarket Transactions.xlsx is an artificial supermarket transaction data. For more information on installing, loading, and getting help with packages, See this tutorial. Supermarket Transactions.xlsx has 16 variables
_____
Frequencies: The number of observations for a particular category
To produce contingency tables which calculate counts for each combination of categorical variables we can use R’s table()
function.
Produce a cross classification table for the number of married and single females and male customers.
Produce multidimensional tables based on three or more categorical variables. For this, we leverage the ftable()
function.
_____
Proportions: Percentage of each category or combination of categories.
Use the table()
and the prop.table()
function.
_____
Marginals: Show the total counts or percentages across columns or rows in a contingency table.
We can compute the marginal frequencies with margin.table()
and the percentages for these marginal frequencies with prop.table()
using the margin
argument
_____
Visualization: Bar charts / Histograms are most often used to visualize categorical variables. Using ggplot or ggplot2 makes the graph more presentable.
We can also create contingency table-like bar charts by using the facet_grid()
function to produce small multiples. Plot customer proportions across location and by Gender.
Do the same plot by Gender and by Marital status.
-Assumption of Normality
When assumptions are broken we stop being able to draw accurate conclusions about reality. Different statistical models assume different things, and if these models are going to reflect reality accurately then these assumptions need to be true. One of the first parametric assumptions most people think of is the assumption of normality. The assumption of normality is important for hypothesis testing and in regression models. In general linear models, the assumption comes in to play with regards to residuals (aka errors).
What is normality: The sampling distribution of the mean is normal
Dataset and packages: What we need
Visualizing Normality: The first step in evaluating normality.
Descriptive Statistics for Normality:
Shapiro-Wilk Test for Normality:
_____
What is normality
Central Limit Theorem: given random and independent samples of N observations each, the distribution of sample means approaches normality as the size of N increases, regardless of the shape of the population distribution.
From the CLT that in big samples the sampling distribution tends to be normal anyway.
_____
Dataset and packages
Golf data provided by ESPN. In addition, we need another the packages to such as the following:
For more information on installing, loading, and getting help with packages, See this tutorial. The Golf dataset is data frame which consists of 18 variables.
_____
Visualizing Normality
Histograms are the first way to look at the shape of a normal distribution. With the stat_function()
argument people can see that the data distributed normal.
Compare to the Earnings
variable.
Another useful graph that we can inspect to see if a distribution is normal is the Q-Q plot. If the data are normally distributed the plot will display a straight (or nearly straight) line.
Q-Q plot for the driving accuracy :
But Q-Q plot for Earnings accuracy:
_____
Descriptive Statistics for Normality
Its important to support the visual findings with objective quantifications that describe the shape of the distribution and to look for outliers.
Doing this by using the describe()
function from the psych
package or the stat.desc()
function from the pastecs
package.
-With describe()
function
-With stat.desc()
To reduce the amount of statistics we get back we can set the argument basic = FALSE
and to get statistics back relating to the distribution we set the argument norm = TRUE
.
_____
Shapiro-Wilk Test for Normality
-Non-significant (p>.05) -> the sample is notsignificantly different from a normal distribution.
- Significant (p < .05) -> the distribution in question is significantly different from a normal distribution.
Its important to note that there are limitations to the Shapiro-Wilk test. As the dataset being evaluated gets larger, the Shapiro-Wilk test becomes more sensitive to small deviations which leads to a greater probability of rejecting the null hypothesis (null hypothesis being the values come from a normal distribution).
The principal message is that to assess for normality we should not rely on only one approach to assess our data. Rather, we should understand the distribution visually, through descriptive statistics, and formal testing procedures to come to our conclusion of whether our data meets the normality assumption or not.
- Assumptions of Homogeneity
-Correlations
A bivariate analysis from two variables. The value of the correlation coefficient varies between +1 and -1. Near +1 implies a strong positive association and near -1 implies a strong negative association. Near 0 value implying no association between the two variables.
Dataset and packages: What we need
Visualizing relationships: A picture is worth a thousand words
Pearson's correlation: This approach is so widely used.
Spearman’s Rank correlation: Great when variables are measured on an ordinal scale. A non-parametric approach
Kendall’s tau: Another non-parametric approach. Less sensitive to outliers and more accurate with smaller sample sizes.
Partial correlation: Measuring the relationship between two variables while controlling for the effect of one or more covariates.
_____
Dataset and packages
Install the following packages:
Download the dataset here and here to. Golf dataset & Survei dataset. The golf data has 18 variables and the survey data has 11 variables.
_____
Visualizing bivariate relationships
A scatterplot of two variables provides a vivid illustration of the relationship between two variables.
Contrast this to the following two plots which shows as driving accuracy increases the distance of the player’s drive tends to decrease (Fig. B). In addition we can easily see that as a player’s age increases their greens in regulation percentage does not appear to change (Fig. C).
In addition, scatter plots illustrate the linearity of the relationship, which can influence how you approach assessing correlations (i.e. data transformation, using a parametric vs non-parametric test, removing outliers).
Francis Anscombe illustrated this in 19731 when he constructed four data sets that have the same mean, variance, and correlation; however, there are significant differences in the variable relationships. Using the anscombe
data, which R has as a built in data set, the plots below demonstrate the importance of graphing data rather than just relying on correlation coefficients. Each x-y combination in the plot below has a correlation of .82 (strong positive) but there are definitely differences in the association between these variables.
Visualization can also give you a quick approach to assessing multiple relationships. We can produce scatter plot matrices multiple ways to visualize and compare relationships across the entire data set we are analyzing. With base R plotting we can use the pairs()
function.
Lets look at the first 10 variables of the golf data set (minus the player name variable).
With the corrgram and corrplot packages. Note that multiple options exist with both these visualizations (i.e. formatting, correlation method applied, illustrating significance and confidence intervals, etc.)
With corrplot
Once you’ve visualized the data and understand the associations that appear to be present and their attributes (strength, outliers, linearity) you can begin assessing the statistical relationship by applying the appropriate correlation method.
_____
Pearson’s Correlation: best for interval scale
The Pearson product-moment correlation coefficient measures the strength of the linear relationship between two variables and is represented by r when referring to a sample or ρ when referring to the population. Unfortunately, the assumptions for Pearson’s correlation are often overlooked.
Level of measurement: The variables should be continuous. If one or both of the variables are ordinal in measurement, then a Spearman rank correlation should be conducted.
Linear relationship: The variables need to be linearly related. If they are not, the data could be transformed (i.e. logarithmic transformation) or a non-parametric approach such as the Spearman’s or Kendall’s rank correlation tests could be used.
Homoscedasticity: If the variance between the two variables is not constant then r will not provide a good measure of association.
Bivariate Normality: Technically, Pearson’s $r$ does not require normality when the sample size is fairly large; however, when the variables consist of high levels of skewness or contain significant outliers it is recommended to use Spearman’s rank correlation or, at a minimum, compare Pearson’s and Spearman’s coefficients.
Level of measurement: The variables should be continuous. If one or both of the variables are ordinal in measurement, then a Spearman rank correlation should be conducted.
Linear relationship: The variables need to be linearly related. If they are not, the data could be transformed (i.e. logarithmic transformation) or a non-parametric approach such as the Spearman’s or Kendall’s rank correlation tests could be used.
Homoscedasticity: If the variance between the two variables is not constant then r will not provide a good measure of association.
Bivariate Normality: Technically, Pearson’s $r$ does not require normality when the sample size is fairly large; however, when the variables consist of high levels of skewness or contain significant outliers it is recommended to use Spearman’s rank correlation or, at a minimum, compare Pearson’s and Spearman’s coefficients.
To calculate the correlation between two variables we use cor()
. When using cor()
there are two arguments (other than the variables) that need to be considered. The first is use =
which allows us to decide how to handle missing data. The default is use = everything
but if there is missing data in your data set this will cause the output to be NA
unless we explicitly state to only use complete observations with use = complete.obs
. The second argument is method =
which allows us to specify if we want to use “pearson”, “kendall”, or “spearman”. Pearson is the default method so we do not need to specify for that option.
Unfortunately cor()
only provides the r coefficient(s) and does not test for significance nor provide confidence intervals. To get these parameters for a simple two variable analysis I use cor.test()
. In our example we see that the p-value is significant and the 95% confidence interval confirms this as the range does not contain zero. This suggests the correlation between age and yards per drive is r = -0.396 with 95% confidence of being between -0.27 and -0.51.
We can also get the correlation matrix and the p-values across all variables by using the rcorr()
function in the Hmisc package.
_____
Spearman’s Rank Correlation: best for ordinal scale
Consequently, common questions that a Spearman correlation answers includes: Is there a statistically significant relationship between the age of a golf player and the number of wins (0, 1, 2)?
The assumptions for Spearman’s correlation include:
Level of measurement: The normal assumption is that one or both of the variables are ordinal in measurement; however, Spearman’s is also appropriate to use if both variables are continuous but are heavily skewed or contain sizable outliers.
Montonically related: A linear relationship is not necessary, the only requirement is that one variable is montonically related to the other variable.
To assess correlations with Spearman’s rank we can use the same functions introduced for the Pearson correlations and simply change the correlation method.
-Univariate Statistical Inference
After receiving data and completing some initial data exploration, we typically move on to performing some univariate estimation and prediction tasks. Statistical inference helps you estimate parameters of a larger population when the observed data you are working with is a subset of that population.
Dataset & packages: What we need
Confidence intervals of the mean: Using our sample to create an expectation of the population mean.
Reducing margin of error: How can we create greater confidence in our estimates?
Confidence intervals of the proportion: Confidence intervals for categorical variables.
Hypothesis testing for the mean: Assessing evidence for claims of the mean estimate.
Hypothesis testing for the proportion: Assessing evidence for claims of proportions.
_____
Dataset & packages
#Text Mining
One of the complete books about this theme: https://www.tidytextmining.com/
-Tidying Text & Word Frequency
A fundamental requirement to perform text mining is to get your text in a tidy format and perform word frequency analysis.
Dataset and packages
Text Tidying
Word Frequency
_____
Dataset and packages
We need the data provided in the harrypotter
package. Another packages are:
The seven novels are
philosophers_stone
: Harry Potter and the Philosophers Stone (1997)chamber_of_secrets
: Harry Potter and the Chamber of Secrets (1998)prisoner_of_azkaban
: Harry Potter and the Prisoner of Azkaban (1999)goblet_of_fire
: Harry Potter and the Goblet of Fire (2000)order_of_the_phoenix
: Harry Potter and the Order of the Phoenix (2003)half_blood_prince
: Harry Potter and the Half-Blood Prince (2005)deathly_hallows
: Harry Potter and the Deathly Hallows (2007)
_____
Text Tidying
Last updated