How to use dplyr::filter() to find ID’s from list variable?

2020/06/11

This example is useful in Shiny-programming, where we have long timeline datasets what we want to filter by ID’s which are included in dataset as list-variable.

I have come across few times on obstacle, where I want to find ID’s from a list variable in a dataset. List variable includes all ID’s corresponding row, for example a date. Let’s create some test data to demonstrate.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following object is masked from '.env':
## 
##     n
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(purrr)
library(ggplot2)

Create test data

We are going to create simple dataset with two variables id and date

d <- data.frame(
  id = sample(
    seq(1000, 9999, 1),
    20000,
    replace = TRUE
  ),
  date = sample(
    seq(as.Date("2000-01-01"), as.Date("2002-01-01"), by = 1),
    20000,
    replace = TRUE
  )
)
## This is sample data which need to be wrangled
head(d)
##     id       date
## 1 1665 2000-05-26
## 2 3393 2001-10-17
## 3 1266 2001-06-29
## 4 9135 2000-12-26
## 5 1812 2000-09-08
## 6 6942 2000-03-08

Then we need to check data by dates. For example we are interested how many users has been in hospital by date.

date_data <- d %>% 
  group_by(date) %>% 
  summarise(
    n = n(),
    ids = list(unique(id))
  )
## `summarise()` ungrouping output (override with `.groups` argument)
## sample data grouped by date
head(date_data)
## # A tibble: 6 x 3
##   date           n ids       
##   <date>     <int> <list>    
## 1 2000-01-01    24 <dbl [24]>
## 2 2000-01-02    32 <dbl [32]>
## 3 2000-01-03    28 <dbl [28]>
## 4 2000-01-04    36 <dbl [36]>
## 5 2000-01-05    31 <dbl [31]>
## 6 2000-01-06    25 <dbl [24]>

We can plot this data to some cool graphs

ggplot(date_data) + 
  geom_line(aes(x=date, y=n))

Filter function

But if we want to use same dataset, but only check small number of ID’s, getting dplyr verb filter() done can be tricky. Let’s first take random 5 ID’s from our dataset.

## get some id's what we want to filter by
myids <-sample(unlist(date_data$ids), 5)
## these are the id's what we want to find
myids
## [1] 1116 6701 7900 6559 9354

For filtering the date_data we can use purrr::map_lgl() to get inside the list in dataset and find if any of our listed ID’s occur in date row,

# use map_lgl() function from purr package
d_sample <- date_data %>% 
  filter(map_lgl(ids, ~ any(.x %in% myids)))
## Plot filtered dataset
ggplot(d_sample) + 
  geom_line(aes(x=date, y=n))

This function is especially useful in Shiny-programming, where we have calculated predatasets and we want to filter some long timeline data by ID’s.