-Data Summary

Before do anything else, it is important to understand the structure of the data:

•missing data

•cleaning / tidying

•plotting

•correleations

•outliers

•summary stats

All the list are the functions in R. Some of them need additional packages.

General identifying

  • View(data)

  • glimpse(data)

  • spec(data) for csv file

  • attributes(data)

  • class(data)

In depth summarizing

1-With Summary() from Base

2-skim(), from the skimr package

3-describe, from the Hmisc package

4-stat.desc(), from the pastecs package

5-describe and describeBy, from the psych package

6-descr and dfSummary, from the summarytools package

7-CreateTableOne, from the tableone package

8-desctable, from the desctable package

9-ggpairs, from the GGally package

10-ds_summary_stats from descriptr

11-With dlookr: An automated report (as pdf or html)

12-With DataExplorer package:

Specifics identifying

1-Identify Duplicates values:

Find the duplicates values (only) in primary key

2-Identify NA values (Not Available):

http://naniar.njtierney.com/

3-Identify outliers:

4-Plausibility check: numeric & non numeric

Plausibility check can includes checking orders of magnitude, looking for implausible values (negative body weight), among others. A good starter is to differentiate between numeric and non-numeric variables.

5-Highly correlated & covariance of variables:

6-Mode: Unimodal or Bimodal distribution:

7-Principal Components Analysis:

8-Factor Analysis:

9-Bootstrap Resampling:

FULL SUMMARY:

Last updated