r - How to use dplyr to find unique entries in the previous rows -
i have long dataframe, more or less following structure:
df <- data.frame( dates = c("2011-10-01","2011-10-01","2011-10-01","2011-10-02","2011-10-03","2011-10-05","2011-10-06","2011-10-06"), ids = c("a","a","b","c","d","a","e","d"), values = c(10,1,25,2,5,10,4,1)) > df dates ids values 1 2011-10-01 10 2 2011-10-01 1 3 2011-10-01 b 25 4 2011-10-02 c 2 5 2011-10-03 d 5 6 2011-10-05 10 7 2011-10-06 e 4 8 2011-10-06 d 1
i following output:
dates unique_ids sum_values 1 2011-10-01 2 36 2 2011-10-02 3 38 3 2011-10-03 4 43 4 2011-10-04 4 43 5 2011-10-05 4 53 6 2011-10-06 5 58
i.e. each date unique_ids gives number of unique ids corresponding earlier dates , sum_values gives sum of values corresponding earlier dates.
i want avoid cycles because original df big. thinking use dplyr.
i know how obtain sum_value
df %>% group_by(dates) %>% summarize(sum_values_daily = sum(values)) %>% mutate(sum_values = cumsum(sum_values_daily)) %>% select(dates, sum_values)
i don't know how obtains unique_ids column.
any idea?
because trying calculate number of distinct ids
across groups, first we'll need define boolean column allow sum unique values.
secondly, want include missing dates original df
in expected output, we'll need perform right_join
full sequence of dates. assume here dates
column of class date
. produce na
values replace
0
.
finally calculate cumsum
both unique_ids
, sum_values
.
library(dplyr) df %>% mutate(unique_ids = !duplicated(ids)) %>% group_by(dates) %>% summarise(unique_ids = sum(unique_ids), sum_values = sum(values)) %>% right_join(data.frame(dates = seq(min(df$date), max(df$dates), = 1))) %>% mutate_each(funs(replace(., is.na(.), 0)), -dates) %>% mutate_each(funs(cumsum), -dates) # dates unique_ids sum_values # <date> <dbl> <dbl> #1 2011-10-01 2 36 #2 2011-10-02 3 38 #3 2011-10-03 4 43 #4 2011-10-04 4 43 #5 2011-10-05 4 53 #6 2011-10-06 5 58
Comments
Post a Comment