Working with random DATE variable

Intro

Creating a Random date between two date intervals can be a bit challenging in R. Here is my learning process how to do it.

We will be working on creating synthetic dataset, which has date variables such as

DATE_BIRTH can be a date value between any two dates
DATE_DEATH is values above the DATE_BIRTH, but can be also NA
DATE_MIGRATION is values above the DATE_BIRTH and below the DATE_DEATH, but can be also NA

First we need few packages to this scenario:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Easiest is creating a value between two dates:

population_synth <- tibble::tibble(
  ID = seq(1,1000,1),
  DATE_BIRTH = sample(seq(as.Date('1932/01/01'), as.Date('1946/12/31'), by="day"), 1000),
  DATE_DEATH = sample(seq(as.Date('1932/01/01'), as.Date('1946/12/31'), by="day"), 1000),
  DATE_MIGRATION = sample(seq(as.Date('1932/01/01'), as.Date('1946/12/31'), by="day"), 1000)
)
summary(population_synth)

##        ID           DATE_BIRTH           DATE_DEATH        
##  Min.   :   1.0   Min.   :1932-01-02   Min.   :1932-01-04  
##  1st Qu.: 250.8   1st Qu.:1935-09-04   1st Qu.:1935-07-07  
##  Median : 500.5   Median :1939-05-28   Median :1939-03-18  
##  Mean   : 500.5   Mean   :1939-06-08   Mean   :1939-05-07  
##  3rd Qu.: 750.2   3rd Qu.:1943-02-22   3rd Qu.:1943-05-15  
##  Max.   :1000.0   Max.   :1946-12-23   Max.   :1946-12-27  
##  DATE_MIGRATION      
##  Min.   :1932-01-03  
##  1st Qu.:1935-08-23  
##  Median :1939-02-07  
##  Mean   :1939-05-01  
##  3rd Qu.:1943-01-26  
##  Max.   :1946-12-31

Creating a random DATE between two dates

Also we want possibility it to have NA value, in this case 54,7 % of time

## [1] "1946-06-17"

And we can see that it will provide random values when looping it 10 times

for(i in 1:10){
  print(sample(c(sample(seq(as.Date("1936/12/21"), as.Date("2023/12/21"), 1), 453), rep(NA, 547)),1))
}

## [1] NA
## [1] "1955-12-12"
## [1] "1967-06-05"
## [1] "1957-10-09"
## [1] "1977-11-15"
## [1] "1936-12-26"
## [1] NA
## [1] "2015-06-01"
## [1] "1963-08-05"
## [1] NA

But if we include this function to data frame mutate()-function we find a problem:

population_synth <- population_synth %>% 
  mutate(
    DATE_DEATH = sample(c(sample(seq(as.Date("1936/12/21"), as.Date("2023/12/21"), 1), 453), rep(NA, 547)),1)
  ) 
summary(population_synth)

##        ID           DATE_BIRTH           DATE_DEATH   DATE_MIGRATION      
##  Min.   :   1.0   Min.   :1932-01-02   Min.   :NA     Min.   :1932-01-03  
##  1st Qu.: 250.8   1st Qu.:1935-09-04   1st Qu.:NA     1st Qu.:1935-08-23  
##  Median : 500.5   Median :1939-05-28   Median :NA     Median :1939-02-07  
##  Mean   : 500.5   Mean   :1939-06-08   Mean   :NaN    Mean   :1939-05-01  
##  3rd Qu.: 750.2   3rd Qu.:1943-02-22   3rd Qu.:NA     3rd Qu.:1943-01-26  
##  Max.   :1000.0   Max.   :1946-12-23   Max.   :NA     Max.   :1946-12-31  
##                                        NA's   :1000

In this scenario sample() provided only ONE and SAME date for each observation. This is due using the same seed number, most likely?

Adding a function

Let’s work with random date on a function way, so that we have a function, which takes few parameters and creates a random date variable. Here variables are:

start = start point where to pick date value
end = end point where to pick date value
na = how likely is to create a NA value
n = how many dates we are picking

How things change if we pick another seed number for each time the function runs?

random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), na=.56 ,n=1){
  set.seed(i) ## SET NEW SEED!
  start <- as.Date(start)
  end <- as.Date(end)
  sekvenssi <-  seq(start, end, by = "day")
  sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
  sekvenssi <- c(sekvenssi, rep(NA, 100*na))
  sample(sekvenssi, n)
}
for(i in 1:10){
  print(random_date())
}

## [1] "1968-09-06"
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] NA
## [1] "2017-07-17"
## [1] "2001-10-03"
## [1] "1956-02-02"

Now we are using different seed number and we can observe how things change when we include this function to data creation process:

population_synth <- population_synth %>% 
  mutate(
    DATE_DEATH = random_date()
  ) 
summary(population_synth)

##        ID           DATE_BIRTH           DATE_DEATH        
##  Min.   :   1.0   Min.   :1932-01-02   Min.   :1956-02-02  
##  1st Qu.: 250.8   1st Qu.:1935-09-04   1st Qu.:1956-02-02  
##  Median : 500.5   Median :1939-05-28   Median :1956-02-02  
##  Mean   : 500.5   Mean   :1939-06-08   Mean   :1956-02-02  
##  3rd Qu.: 750.2   3rd Qu.:1943-02-22   3rd Qu.:1956-02-02  
##  Max.   :1000.0   Max.   :1946-12-23   Max.   :1956-02-02  
##  DATE_MIGRATION      
##  Min.   :1932-01-03  
##  1st Qu.:1935-08-23  
##  Median :1939-02-07  
##  Mean   :1939-05-01  
##  3rd Qu.:1943-01-26  
##  Max.   :1946-12-31

Noup! We cannot get changing date value, because as run-wisely function run each time using seed number 1! Think about what time the function runs when creating a data and what are the values.

Even when making setting seed more sophisticated, creating a random value each time, it works well creating a random value in a loop, but when including the function to data creation, same thing happens.

random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), na=.56 ,n=1){
  set.seed(runif(1, 1, 10000)) ## SET SEED number!
  start <- as.Date(start)
  end <- as.Date(end)
  sekvenssi <-  seq(start, end, by = "day")
  sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
  sekvenssi <- c(sekvenssi, rep(NA, 100*na))
  sample(sekvenssi, n)
}
## Run once
random_date()

## [1] NA

## Run function 10 times
for(i in 1:10){
  print(random_date())
}

## [1] "1967-10-30"
## [1] NA
## [1] "2002-12-21"
## [1] NA
## [1] "2008-06-18"
## [1] NA
## [1] "1962-01-30"
## [1] NA
## [1] "2013-10-18"
## [1] NA

# Run function in data creation
population_synth <- population_synth %>% 
  mutate(
    DATE_DEATH = random_date()
  ) 
summary(population_synth)

##        ID           DATE_BIRTH           DATE_DEATH   DATE_MIGRATION      
##  Min.   :   1.0   Min.   :1932-01-02   Min.   :NA     Min.   :1932-01-03  
##  1st Qu.: 250.8   1st Qu.:1935-09-04   1st Qu.:NA     1st Qu.:1935-08-23  
##  Median : 500.5   Median :1939-05-28   Median :NA     Median :1939-02-07  
##  Mean   : 500.5   Mean   :1939-06-08   Mean   :NaN    Mean   :1939-05-01  
##  3rd Qu.: 750.2   3rd Qu.:1943-02-22   3rd Qu.:NA     3rd Qu.:1943-01-26  
##  Max.   :1000.0   Max.   :1946-12-23   Max.   :NA     Max.   :1946-12-31  
##                                        NA's   :1000

Magic to solve the problem

These problems may occur because of underlying way each function runs in tidyverse, so that using a function rowwise() clears this problem!

random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), na=.56 ,n=1){
  set.seed(runif(1, 1, 10000)) ## SET SEED number!
  start <- as.Date(start)
  end <- as.Date(end)
  sekvenssi <-  seq(start, end, by = "day")
  sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
  sekvenssi <- c(sekvenssi, rep(NA, 100*na))
  sample(sekvenssi, n)
}
# Run function in data creation
population_synth <- population_synth %>% 
  rowwise() %>% 
  mutate(
    DATE_DEATH = random_date()
  ) 
summary(population_synth)

##        ID           DATE_BIRTH           DATE_DEATH        
##  Min.   :   1.0   Min.   :1932-01-02   Min.   :1957-11-17  
##  1st Qu.: 250.8   1st Qu.:1935-09-04   1st Qu.:1978-05-30  
##  Median : 500.5   Median :1939-05-28   Median :1989-08-07  
##  Mean   : 500.5   Mean   :1939-06-08   Mean   :1989-02-28  
##  3rd Qu.: 750.2   3rd Qu.:1943-02-22   3rd Qu.:1994-05-27  
##  Max.   :1000.0   Max.   :1946-12-23   Max.   :2021-06-21  
##                                        NA's   :530         
##  DATE_MIGRATION      
##  Min.   :1932-01-03  
##  1st Qu.:1935-08-23  
##  Median :1939-02-07  
##  Mean   :1939-05-01  
##  3rd Qu.:1943-01-26  
##  Max.   :1946-12-31  
##

Clean version

We can see that this works well on creating DATE_DEATH, but for DATE_MIGRATION we need to add another rule to the function, when DATE_DEATH is NA. This is covered in by adding a simple rule to the function

random_date <- function(start=as.Date("1956/01/01"), end = as.Date("2023/12/31"), master_end = as.Date("2023/12/31"), na=.56 ,n=1){
  set.seed(runif(1, 1, 10000)) ## SET SEED number!
  start <- as.Date(start)
  end <- as.Date(end)
  if(is.na(end)){
    end <- as.Date(master_end)
  }
  sekvenssi <-  seq(start, end, by = "day")
  sekvenssi <- sample(sekvenssi, 100 * (1-na), replace = T)
  sekvenssi <- c(sekvenssi, rep(NA, 100*na))
  sample(sekvenssi, n)
}
# Run function in data creation
population_synth <- population_synth %>% 
  rowwise() %>% 
  mutate(
    DATE_DEATH = random_date(start=DATE_BIRTH),
    DATE_MIGRATION = random_date(start = DATE_BIRTH, end = DATE_DEATH, na = .97)
  ) 
summary(population_synth)

##        ID           DATE_BIRTH           DATE_DEATH        
##  Min.   :   1.0   Min.   :1932-01-02   Min.   :1932-06-28  
##  1st Qu.: 250.8   1st Qu.:1935-09-04   1st Qu.:1961-04-25  
##  Median : 500.5   Median :1939-05-28   Median :1982-03-31  
##  Mean   : 500.5   Mean   :1939-06-08   Mean   :1981-01-14  
##  3rd Qu.: 750.2   3rd Qu.:1943-02-22   3rd Qu.:2002-01-26  
##  Max.   :1000.0   Max.   :1946-12-23   Max.   :2023-11-18  
##                                        NA's   :573         
##  DATE_MIGRATION      
##  Min.   :1934-09-09  
##  1st Qu.:1955-01-02  
##  Median :1973-12-30  
##  Mean   :1973-10-02  
##  3rd Qu.:1991-07-28  
##  Max.   :2013-10-22  
##  NA's   :963

Normally distributed random date function

Now if we want to random date from specific normal distribution, we can create a function to provide this. In this example I have provided mean and sd from OSTPRE population data as base values.

## Random date by distribution
random_date_dist <- function(n=1, mean = -11643, sd = 1387, origin = "1970-01-01"){
  # as.Date(round(rnorm(n = 1, mean = as.integer(mean(population$DATE_BIRTH)), sd = as.integer(sd(population$DATE_BIRTH))), 0), origin = "1970-01-01")
  as.Date(round(rnorm(n = n, mean = mean, sd = sd), 0), origin = "1970-01-01")
}
random_date_dist()

## [1] "1936-08-30"

Random Date Function