Working with random DATE variable
Intro
Creating a Random date between two date intervals can be a bit challenging in R. Here is my learning process how to do it.
We will be working on creating synthetic dataset, which has date variables such as
- DATE_BIRTH can be a date value between any two dates
- DATE_DEATH is values above the DATE_BIRTH, but can be also NA
- DATE_MIGRATION is values above the DATE_BIRTH and below the DATE_DEATH, but can be also NA
First we need few packages to this scenario:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Easiest is creating a value between two dates:
population_synth <- tibble::tibble(
ID = seq(1,1000,1),
DATE_BIRTH = sample(seq(as.Date('1932/01/01'), as.Date('1946/12/31'), by="day"), 1000),
DATE_DEATH = sample(seq(as.Date('1932/01/01'), as.Date('1946/12/31'), by="day"), 1000),
DATE_MIGRATION = sample(seq(as.Date('1932/01/01'), as.Date('1946/12/31'), by="day"), 1000)
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH
## Min. : 1.0 Min. :1932-01-02 Min. :1932-01-04
## 1st Qu.: 250.8 1st Qu.:1935-09-04 1st Qu.:1935-07-07
## Median : 500.5 Median :1939-05-28 Median :1939-03-18
## Mean : 500.5 Mean :1939-06-08 Mean :1939-05-07
## 3rd Qu.: 750.2 3rd Qu.:1943-02-22 3rd Qu.:1943-05-15
## Max. :1000.0 Max. :1946-12-23 Max. :1946-12-27
## DATE_MIGRATION
## Min. :1932-01-03
## 1st Qu.:1935-08-23
## Median :1939-02-07
## Mean :1939-05-01
## 3rd Qu.:1943-01-26
## Max. :1946-12-31
Creating a random DATE between two dates
Also we want possibility it to have NA
value, in this case 54,7 % of time
## [1] "1946-06-17"
And we can see that it will provide random values when looping it 10 times
for(i in 1:10){
print(sample(c(sample(seq(as.Date("1936/12/21"), as.Date("2023/12/21"), 1), 453), rep(NA, 547)),1))
}
## [1] NA
## [1] "1955-12-12"
## [1] "1967-06-05"
## [1] "1957-10-09"
## [1] "1977-11-15"
## [1] "1936-12-26"
## [1] NA
## [1] "2015-06-01"
## [1] "1963-08-05"
## [1] NA
But if we include this function to data frame mutate()-function we find a problem:
population_synth <- population_synth %>%
mutate(
DATE_DEATH = sample(c(sample(seq(as.Date("1936/12/21"), as.Date("2023/12/21"), 1), 453), rep(NA, 547)),1)
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH DATE_MIGRATION
## Min. : 1.0 Min. :1932-01-02 Min. :NA Min. :1932-01-03
## 1st Qu.: 250.8 1st Qu.:1935-09-04 1st Qu.:NA 1st Qu.:1935-08-23
## Median : 500.5 Median :1939-05-28 Median :NA Median :1939-02-07
## Mean : 500.5 Mean :1939-06-08 Mean :NaN Mean :1939-05-01
## 3rd Qu.: 750.2 3rd Qu.:1943-02-22 3rd Qu.:NA 3rd Qu.:1943-01-26
## Max. :1000.0 Max. :1946-12-23 Max. :NA Max. :1946-12-31
## NA's :1000
In this scenario sample()
provided only ONE and SAME date for each observation. This is due using the same seed number, most likely?
Adding a function
Let’s work with random date on a function way, so that we have a function, which takes few parameters and creates a random date variable. Here variables are:
- start = start point where to pick date value
- end = end point where to pick date value
- na = how likely is to create a NA value
- n = how many dates we are picking
How things change if we pick another seed number for each time the function runs?
random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), na=.56 ,n=1){
set.seed(i) ## SET NEW SEED!
start <- as.Date(start)
end <- as.Date(end)
sekvenssi <- seq(start, end, by = "day")
sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
sekvenssi <- c(sekvenssi, rep(NA, 100*na))
sample(sekvenssi, n)
}
for(i in 1:10){
print(random_date())
}
## [1] "1968-09-06"
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] "2017-07-17"
## [1] "2001-10-03"
## [1] "1956-02-02"
Now we are using different seed number and we can observe how things change when we include this function to data creation process:
population_synth <- population_synth %>%
mutate(
DATE_DEATH = random_date()
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH
## Min. : 1.0 Min. :1932-01-02 Min. :1956-02-02
## 1st Qu.: 250.8 1st Qu.:1935-09-04 1st Qu.:1956-02-02
## Median : 500.5 Median :1939-05-28 Median :1956-02-02
## Mean : 500.5 Mean :1939-06-08 Mean :1956-02-02
## 3rd Qu.: 750.2 3rd Qu.:1943-02-22 3rd Qu.:1956-02-02
## Max. :1000.0 Max. :1946-12-23 Max. :1956-02-02
## DATE_MIGRATION
## Min. :1932-01-03
## 1st Qu.:1935-08-23
## Median :1939-02-07
## Mean :1939-05-01
## 3rd Qu.:1943-01-26
## Max. :1946-12-31
Noup! We cannot get changing date value, because as run-wisely function run each time using seed number 1! Think about what time the function runs when creating a data and what are the values.
Even when making setting seed more sophisticated, creating a random value each time, it works well creating a random value in a loop, but when including the function to data creation, same thing happens.
random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), na=.56 ,n=1){
set.seed(runif(1, 1, 10000)) ## SET SEED number!
start <- as.Date(start)
end <- as.Date(end)
sekvenssi <- seq(start, end, by = "day")
sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
sekvenssi <- c(sekvenssi, rep(NA, 100*na))
sample(sekvenssi, n)
}
## Run once
random_date()
## [1] NA
## Run function 10 times
for(i in 1:10){
print(random_date())
}
## [1] "1967-10-30"
## [1] NA
## [1] "2002-12-21"
## [1] NA
## [1] "2008-06-18"
## [1] NA
## [1] "1962-01-30"
## [1] NA
## [1] "2013-10-18"
## [1] NA
# Run function in data creation
population_synth <- population_synth %>%
mutate(
DATE_DEATH = random_date()
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH DATE_MIGRATION
## Min. : 1.0 Min. :1932-01-02 Min. :NA Min. :1932-01-03
## 1st Qu.: 250.8 1st Qu.:1935-09-04 1st Qu.:NA 1st Qu.:1935-08-23
## Median : 500.5 Median :1939-05-28 Median :NA Median :1939-02-07
## Mean : 500.5 Mean :1939-06-08 Mean :NaN Mean :1939-05-01
## 3rd Qu.: 750.2 3rd Qu.:1943-02-22 3rd Qu.:NA 3rd Qu.:1943-01-26
## Max. :1000.0 Max. :1946-12-23 Max. :NA Max. :1946-12-31
## NA's :1000
Magic to solve the problem
These problems may occur because of underlying way each function runs in tidyverse, so that using a function rowwise()
clears this problem!
random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), na=.56 ,n=1){
set.seed(runif(1, 1, 10000)) ## SET SEED number!
start <- as.Date(start)
end <- as.Date(end)
sekvenssi <- seq(start, end, by = "day")
sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
sekvenssi <- c(sekvenssi, rep(NA, 100*na))
sample(sekvenssi, n)
}
# Run function in data creation
population_synth <- population_synth %>%
rowwise() %>%
mutate(
DATE_DEATH = random_date()
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH
## Min. : 1.0 Min. :1932-01-02 Min. :1957-11-17
## 1st Qu.: 250.8 1st Qu.:1935-09-04 1st Qu.:1978-05-30
## Median : 500.5 Median :1939-05-28 Median :1989-08-07
## Mean : 500.5 Mean :1939-06-08 Mean :1989-02-28
## 3rd Qu.: 750.2 3rd Qu.:1943-02-22 3rd Qu.:1994-05-27
## Max. :1000.0 Max. :1946-12-23 Max. :2021-06-21
## NA's :530
## DATE_MIGRATION
## Min. :1932-01-03
## 1st Qu.:1935-08-23
## Median :1939-02-07
## Mean :1939-05-01
## 3rd Qu.:1943-01-26
## Max. :1946-12-31
##
Clean version
We can see that this works well on creating DATE_DEATH
, but for DATE_MIGRATION
we need to add another rule to the function, when DATE_DEATH is NA. This is covered in by adding a simple rule to the function
random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), master_end = as.Date("2023/12/31"), na=.56 ,n=1){
set.seed(runif(1, 1, 10000)) ## SET SEED number!
start <- as.Date(start)
end <- as.Date(end)
if(is.na(end)){
end <- as.Date(master_end)
}
sekvenssi <- seq(start, end, by = "day")
sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
sekvenssi <- c(sekvenssi, rep(NA, 100*na))
sample(sekvenssi, n)
}
# Run function in data creation
population_synth <- population_synth %>%
rowwise() %>%
mutate(
DATE_DEATH = random_date(start=DATE_BIRTH),
DATE_MIGRATION = random_date(start = DATE_BIRTH, end = DATE_DEATH, na = .97)
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH
## Min. : 1.0 Min. :1932-01-02 Min. :1932-06-28
## 1st Qu.: 250.8 1st Qu.:1935-09-04 1st Qu.:1961-04-25
## Median : 500.5 Median :1939-05-28 Median :1982-03-31
## Mean : 500.5 Mean :1939-06-08 Mean :1981-01-14
## 3rd Qu.: 750.2 3rd Qu.:1943-02-22 3rd Qu.:2002-01-26
## Max. :1000.0 Max. :1946-12-23 Max. :2023-11-18
## NA's :573
## DATE_MIGRATION
## Min. :1934-09-09
## 1st Qu.:1955-01-02
## Median :1973-12-30
## Mean :1973-10-02
## 3rd Qu.:1991-07-28
## Max. :2013-10-22
## NA's :963
Normally distributed random date function
Now if we want to random date from specific normal distribution, we can create a function to provide this. In this example I have provided mean
and sd
from OSTPRE population data as base values.
## Random date by distribution
random_date_dist <- function(n=1, mean = -11643, sd = 1387, origin = "1970-01-01"){
# as.Date(round(rnorm(n = 1, mean = as.integer(mean(population$DATE_BIRTH)), sd = as.integer(sd(population$DATE_BIRTH))), 0), origin = "1970-01-01")
as.Date(round(rnorm(n = n, mean = mean, sd = sd), 0), origin = "1970-01-01")
}
random_date_dist()
## [1] "1936-08-30"