Tidy data practise

2015/05/22

Tidy data

First example shows basic way to clean dataset to tidy data.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# Religion data, make it tidy
pew <- read.delim(
  file = "http://stat405.had.co.nz/data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = FALSE
)
# How many variables there is?
head(pew)
##             religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k
## 1           Agnostic    27      34      60      81      76     137      122
## 2            Atheist    12      27      37      52      35      70       73
## 3           Buddhist    27      21      30      34      33      58       62
## 4           Catholic   418     617     732     670     638    1116      949
## 5 Don’t know/refused    15      14      15      11      10      35       21
## 6   Evangelical Prot   575     869    1064     982     881    1486      949
##   $100-150k >150k Don't know/refused
## 1       109    84                 96
## 2        59    74                 76
## 3        39    53                 54
## 4       792   633               1489
## 5        17    18                116
## 6       723   414               1529
# Three variables: religion, income and frequency.

# To make this dataset to tidy, we need to melt or stack it (make wide to long)
pew_tidy <- gather(pew, income, freq, `<$10k`:`Don't know/refused`)
head(pew_tidy)
##             religion income freq
## 1           Agnostic  <$10k   27
## 2            Atheist  <$10k   12
## 3           Buddhist  <$10k   27
## 4           Catholic  <$10k  418
## 5 Don’t know/refused  <$10k   15
## 6   Evangelical Prot  <$10k  575

Second example is a bit more complicated dataset. This data seems to have two problems. First, it has variables in the rows in the column ‘element’ and second, it has a variable ‘d’ in the column header spread across multiple columns.

# This more complicated example.
weather <- read.delim(
  file = "http://stat405.had.co.nz/data/weather.txt",
  strip.white = FALSE
)
# How does dataset look like?
head(weather)
##            id year month element d1  d2  d3 d4  d5 d6 d7 d8 d9 d10 d11 d12 d13
## 1 MX000017004 2010     1    TMAX NA  NA  NA NA  NA NA NA NA NA  NA  NA  NA  NA
## 2 MX000017004 2010     1    TMIN NA  NA  NA NA  NA NA NA NA NA  NA  NA  NA  NA
## 3 MX000017004 2010     2    TMAX NA 273 241 NA  NA NA NA NA NA  NA 297  NA  NA
## 4 MX000017004 2010     2    TMIN NA 144 144 NA  NA NA NA NA NA  NA 134  NA  NA
## 5 MX000017004 2010     3    TMAX NA  NA  NA NA 321 NA NA NA NA 345  NA  NA  NA
## 6 MX000017004 2010     3    TMIN NA  NA  NA NA 142 NA NA NA NA 168  NA  NA  NA
##   d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31
## 1  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 278  NA
## 2  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA 145  NA
## 3  NA  NA  NA  NA  NA  NA  NA  NA  NA 299  NA  NA  NA  NA  NA  NA  NA  NA
## 4  NA  NA  NA  NA  NA  NA  NA  NA  NA 107  NA  NA  NA  NA  NA  NA  NA  NA
## 5  NA  NA 311  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
## 6  NA  NA 176  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA  NA
# First melt, put variables 'd1:d31' name to variable 'day' and value it consist to variable 'value'
weather_tidy <- gather(weather, day, value, d1:d31)
# make date variable
weather_tidy$date <- as.Date( paste0(weather_tidy$year, "-", weather_tidy$month, "-", substr(weather_tidy$day, 2, nchar(weather_tidy$day))) )
# Remove variables year, month, day
weather_tidy <- weather_tidy[ , c("id", "date", "element", "value")]
# Remove NA's
weather_tidy <-  weather_tidy[!is.na(weather_tidy$value),]

head(weather_tidy) # check data
##             id       date element value
## 21 MX000017004 2010-12-01    TMAX   299
## 22 MX000017004 2010-12-01    TMIN   138
## 25 MX000017004 2010-02-02    TMAX   273
## 26 MX000017004 2010-02-02    TMIN   144
## 41 MX000017004 2010-11-02    TMAX   313
## 42 MX000017004 2010-11-02    TMIN   163
# Second melt, variable element is actually two variable TMAX and TMIN. So we need to expand that to own variables.
weather_tidy <- spread(weather_tidy, element, value)

head(weather_tidy) # done!
##            id       date TMAX TMIN
## 1 MX000017004 2010-01-30  278  145
## 2 MX000017004 2010-02-02  273  144
## 3 MX000017004 2010-02-03  241  144
## 4 MX000017004 2010-02-11  297  134
## 5 MX000017004 2010-02-23  299  107
## 6 MX000017004 2010-03-05  321  142

There are vairous features of messy data