Working with random DATE variable
Intro
Creating a Random date between two date intervals can be a bit challenging in R. Here is my learning process how to do it.
We will be working on creating synthetic dataset, which has date variables such as
- DATE_BIRTH can be a date value between any two dates
- DATE_DEATH is values above the DATE_BIRTH, but can be also NA
- DATE_MIGRATION is values above the DATE_BIRTH and below the DATE_DEATH, but can be also NA
First we need few packages to this scenario:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
Easiest is creating a value between two dates:
population_synth <- tibble::tibble(
ID = seq(1,1000,1),
DATE_BIRTH = sample(seq(as.Date('1932/01/01'), as.Date('1946/12/31'), by="day"), 1000),
DATE_DEATH = sample(seq(as.Date('1932/01/01'), as.Date('1946/12/31'), by="day"), 1000),
DATE_MIGRATION = sample(seq(as.Date('1932/01/01'), as.Date('1946/12/31'), by="day"), 1000)
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH
## Min. : 1.0 Min. :1932-01-02 Min. :1932-01-01
## 1st Qu.: 250.8 1st Qu.:1935-10-22 1st Qu.:1935-09-20
## Median : 500.5 Median :1939-04-01 Median :1939-06-17
## Mean : 500.5 Mean :1939-05-25 Mean :1939-07-23
## 3rd Qu.: 750.2 3rd Qu.:1943-01-31 3rd Qu.:1943-06-01
## Max. :1000.0 Max. :1946-12-30 Max. :1946-12-30
## DATE_MIGRATION
## Min. :1932-01-01
## 1st Qu.:1935-09-30
## Median :1939-08-20
## Mean :1939-07-26
## 3rd Qu.:1943-05-02
## Max. :1946-12-31
Creating a random DATE between two dates
Also we want possibility it to have NA value, in this case 54,7 % of time
## [1] NA
And we can see that it will provide random values when looping it 10 times
for(i in 1:10){
print(sample(c(sample(seq(as.Date("1936/12/21"), as.Date("2023/12/21"), 1), 453), rep(NA, 547)),1))
}
## [1] "1961-03-19"
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] "1943-10-03"
## [1] "1979-09-20"
## [1] "1980-01-01"
But if we include this function to data frame mutate()-function we find a problem:
population_synth <- population_synth %>%
mutate(
DATE_DEATH = sample(c(sample(seq(as.Date("1936/12/21"), as.Date("2023/12/21"), 1), 453), rep(NA, 547)),1)
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH
## Min. : 1.0 Min. :1932-01-02 Min. :1967-03-04
## 1st Qu.: 250.8 1st Qu.:1935-10-22 1st Qu.:1967-03-04
## Median : 500.5 Median :1939-04-01 Median :1967-03-04
## Mean : 500.5 Mean :1939-05-25 Mean :1967-03-04
## 3rd Qu.: 750.2 3rd Qu.:1943-01-31 3rd Qu.:1967-03-04
## Max. :1000.0 Max. :1946-12-30 Max. :1967-03-04
## DATE_MIGRATION
## Min. :1932-01-01
## 1st Qu.:1935-09-30
## Median :1939-08-20
## Mean :1939-07-26
## 3rd Qu.:1943-05-02
## Max. :1946-12-31
In this scenario sample() provided only ONE and SAME date for each observation. This is due using the same seed number, most likely?
Adding a function
Let’s work with random date on a function way, so that we have a function, which takes few parameters and creates a random date variable. Here variables are:
- start = start point where to pick date value
- end = end point where to pick date value
- na = how likely is to create a NA value
- n = how many dates we are picking
How things change if we pick another seed number for each time the function runs?
random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), na=.56 ,n=1){
set.seed(i) ## SET NEW SEED!
start <- as.Date(start)
end <- as.Date(end)
sekvenssi <- seq(start, end, by = "day")
sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
sekvenssi <- c(sekvenssi, rep(NA, 100*na))
sample(sekvenssi, n)
}
for(i in 1:10){
print(random_date())
}
## [1] "1968-09-06"
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] "2017-07-17"
## [1] "2001-10-03"
## [1] "1956-02-02"
Now we are using different seed number and we can observe how things change when we include this function to data creation process:
population_synth <- population_synth %>%
mutate(
DATE_DEATH = random_date()
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH
## Min. : 1.0 Min. :1932-01-02 Min. :1956-02-02
## 1st Qu.: 250.8 1st Qu.:1935-10-22 1st Qu.:1956-02-02
## Median : 500.5 Median :1939-04-01 Median :1956-02-02
## Mean : 500.5 Mean :1939-05-25 Mean :1956-02-02
## 3rd Qu.: 750.2 3rd Qu.:1943-01-31 3rd Qu.:1956-02-02
## Max. :1000.0 Max. :1946-12-30 Max. :1956-02-02
## DATE_MIGRATION
## Min. :1932-01-01
## 1st Qu.:1935-09-30
## Median :1939-08-20
## Mean :1939-07-26
## 3rd Qu.:1943-05-02
## Max. :1946-12-31
Noup! We cannot get changing date value, because as run-wisely function run each time using seed number 1! Think about what time the function runs when creating a data and what are the values.
Even when making setting seed more sophisticated, creating a random value each time, it works well creating a random value in a loop, but when including the function to data creation, same thing happens.
random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), na=.56 ,n=1){
set.seed(runif(1, 1, 10000)) ## SET SEED number!
start <- as.Date(start)
end <- as.Date(end)
sekvenssi <- seq(start, end, by = "day")
sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
sekvenssi <- c(sekvenssi, rep(NA, 100*na))
sample(sekvenssi, n)
}
## Run once
random_date()
## [1] NA
## Run function 10 times
for(i in 1:10){
print(random_date())
}
## [1] "1967-10-30"
## [1] NA
## [1] "2002-12-21"
## [1] NA
## [1] "2008-06-18"
## [1] NA
## [1] "1962-01-30"
## [1] NA
## [1] "2013-10-18"
## [1] NA
# Run function in data creation
population_synth <- population_synth %>%
mutate(
DATE_DEATH = random_date()
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH DATE_MIGRATION
## Min. : 1.0 Min. :1932-01-02 Min. :NA Min. :1932-01-01
## 1st Qu.: 250.8 1st Qu.:1935-10-22 1st Qu.:NA 1st Qu.:1935-09-30
## Median : 500.5 Median :1939-04-01 Median :NA Median :1939-08-20
## Mean : 500.5 Mean :1939-05-25 Mean :NaN Mean :1939-07-26
## 3rd Qu.: 750.2 3rd Qu.:1943-01-31 3rd Qu.:NA 3rd Qu.:1943-05-02
## Max. :1000.0 Max. :1946-12-30 Max. :NA Max. :1946-12-31
## NA's :1000
Magic to solve the problem
These problems may occur because of underlying way each function runs in tidyverse, so that using a function rowwise() clears this problem!
random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), na=.56 ,n=1){
set.seed(runif(1, 1, 10000)) ## SET SEED number!
start <- as.Date(start)
end <- as.Date(end)
sekvenssi <- seq(start, end, by = "day")
sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
sekvenssi <- c(sekvenssi, rep(NA, 100*na))
sample(sekvenssi, n)
}
# Run function in data creation
population_synth <- population_synth %>%
rowwise() %>%
mutate(
DATE_DEATH = random_date()
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH
## Min. : 1.0 Min. :1932-01-02 Min. :1957-11-17
## 1st Qu.: 250.8 1st Qu.:1935-10-22 1st Qu.:1978-05-30
## Median : 500.5 Median :1939-04-01 Median :1989-08-07
## Mean : 500.5 Mean :1939-05-25 Mean :1989-02-28
## 3rd Qu.: 750.2 3rd Qu.:1943-01-31 3rd Qu.:1994-05-27
## Max. :1000.0 Max. :1946-12-30 Max. :2021-06-21
## NA's :530
## DATE_MIGRATION
## Min. :1932-01-01
## 1st Qu.:1935-09-30
## Median :1939-08-20
## Mean :1939-07-26
## 3rd Qu.:1943-05-02
## Max. :1946-12-31
##
Clean version
We can see that this works well on creating DATE_DEATH, but for DATE_MIGRATION we need to add another rule to the function, when DATE_DEATH is NA. This is covered in by adding a simple rule to the function
set.seed(1)
random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), master_end = as.Date("2023/12/31"), na=.56 ,n=1){
start <- as.Date(start)
end <- as.Date(end)
if(is.na(end)){
end <- as.Date(master_end)
}
sekvenssi <- seq(start, end, by = "day")
sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
sekvenssi <- c(sekvenssi, rep(NA, 100*na))
sample(sekvenssi, n)
}
# Run function in data creation
population_synth <- population_synth %>%
rowwise() %>%
mutate(
DATE_DEATH = random_date(start=DATE_BIRTH),
DATE_MIGRATION = random_date(start = DATE_BIRTH, end = DATE_DEATH, na = .97)
)
summary(population_synth)
## ID DATE_BIRTH DATE_DEATH
## Min. : 1.0 Min. :1932-01-02 Min. :1934-02-17
## 1st Qu.: 250.8 1st Qu.:1935-10-22 1st Qu.:1958-07-26
## Median : 500.5 Median :1939-04-01 Median :1980-08-23
## Mean : 500.5 Mean :1939-05-25 Mean :1981-07-22
## 3rd Qu.: 750.2 3rd Qu.:1943-01-31 3rd Qu.:2003-07-26
## Max. :1000.0 Max. :1946-12-30 Max. :2023-11-17
## NA's :563
## DATE_MIGRATION
## Min. :1933-11-10
## 1st Qu.:1951-04-05
## Median :1972-05-16
## Mean :1975-06-20
## 3rd Qu.:1994-10-23
## Max. :2023-08-03
## NA's :966
Normally distributed random date function
Now if we want to random date from specific normal distribution, we can create a function to provide this. In this example I have provided mean and sd from OSTPRE population data as base values.
## Random date by distribution
random_date_dist <- function(n=1, mean = -11643, sd = 1387, origin = "1970-01-01"){
# as.Date(round(rnorm(n = 1, mean = as.integer(mean(population$DATE_BIRTH)), sd = as.integer(sd(population$DATE_BIRTH))), 0), origin = "1970-01-01")
as.Date(round(rnorm(n = n, mean = mean, sd = sd), 0), origin = "1970-01-01")
}
random_date_dist()
## [1] "1933-12-13"