Tidy data
First example shows basic way to clean dataset to tidy data.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# Religion data, make it tidy
pew <- read.delim(
file = "http://stat405.had.co.nz/data/pew.txt",
header = TRUE,
stringsAsFactors = FALSE,
check.names = FALSE
)
# How many variables there is?
head(pew)
## religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t know/refused 15 14 15 11 10 35 21
## 6 Evangelical Prot 575 869 1064 982 881 1486 949
## $100-150k >150k Don't know/refused
## 1 109 84 96
## 2 59 74 76
## 3 39 53 54
## 4 792 633 1489
## 5 17 18 116
## 6 723 414 1529
# Three variables: religion, income and frequency.
# To make this dataset to tidy, we need to melt or stack it (make wide to long)
pew_tidy <- gather(pew, income, freq, `<$10k`:`Don't know/refused`)
head(pew_tidy)
## religion income freq
## 1 Agnostic <$10k 27
## 2 Atheist <$10k 12
## 3 Buddhist <$10k 27
## 4 Catholic <$10k 418
## 5 Don’t know/refused <$10k 15
## 6 Evangelical Prot <$10k 575
Second example is a bit more complicated dataset. This data seems to have two problems. First, it has variables in the rows in the column ‘element’ and second, it has a variable ‘d’ in the column header spread across multiple columns.
# This more complicated example.
weather <- read.delim(
file = "http://stat405.had.co.nz/data/weather.txt",
strip.white = FALSE
)
# How does dataset look like?
head(weather)
## id year month element d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13
## 1 MX000017004 2010 1 TMAX NA NA NA NA NA NA NA NA NA NA NA NA NA
## 2 MX000017004 2010 1 TMIN NA NA NA NA NA NA NA NA NA NA NA NA NA
## 3 MX000017004 2010 2 TMAX NA 273 241 NA NA NA NA NA NA NA 297 NA NA
## 4 MX000017004 2010 2 TMIN NA 144 144 NA NA NA NA NA NA NA 134 NA NA
## 5 MX000017004 2010 3 TMAX NA NA NA NA 321 NA NA NA NA 345 NA NA NA
## 6 MX000017004 2010 3 TMIN NA NA NA NA 142 NA NA NA NA 168 NA NA NA
## d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31
## 1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 278 NA
## 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 145 NA
## 3 NA NA NA NA NA NA NA NA NA 299 NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA NA 107 NA NA NA NA NA NA NA NA
## 5 NA NA 311 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 6 NA NA 176 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# First melt, put variables 'd1:d31' name to variable 'day' and value it consist to variable 'value'
weather_tidy <- gather(weather, day, value, d1:d31)
# make date variable
weather_tidy$date <- as.Date( paste0(weather_tidy$year, "-", weather_tidy$month, "-", substr(weather_tidy$day, 2, nchar(weather_tidy$day))) )
# Remove variables year, month, day
weather_tidy <- weather_tidy[ , c("id", "date", "element", "value")]
# Remove NA's
weather_tidy <- weather_tidy[!is.na(weather_tidy$value),]
head(weather_tidy) # check data
## id date element value
## 21 MX000017004 2010-12-01 TMAX 299
## 22 MX000017004 2010-12-01 TMIN 138
## 25 MX000017004 2010-02-02 TMAX 273
## 26 MX000017004 2010-02-02 TMIN 144
## 41 MX000017004 2010-11-02 TMAX 313
## 42 MX000017004 2010-11-02 TMIN 163
# Second melt, variable element is actually two variable TMAX and TMIN. So we need to expand that to own variables.
weather_tidy <- spread(weather_tidy, element, value)
head(weather_tidy) # done!
## id date TMAX TMIN
## 1 MX000017004 2010-01-30 278 145
## 2 MX000017004 2010-02-02 273 144
## 3 MX000017004 2010-02-03 241 144
## 4 MX000017004 2010-02-11 297 134
## 5 MX000017004 2010-02-23 299 107
## 6 MX000017004 2010-03-05 321 142
There are vairous features of messy data