r - How to use dplyr to find unique entries in the previous rows -
i have long dataframe, more or less following structure:
df <- data.frame( dates = c("2011-10-01","2011-10-01","2011-10-01","2011-10-02","2011-10-03","2011-10-05","2011-10-06","2011-10-06"), ids = c("a","a","b","c","d","a","e","d"), values = c(10,1,25,2,5,10,4,1)) > df dates ids values 1 2011-10-01 10 2 2011-10-01 1 3 2011-10-01 b 25 4 2011-10-02 c 2 5 2011-10-03 d 5 6 2011-10-05 10 7 2011-10-06 e 4 8 2011-10-06 d 1 i following output:
dates unique_ids sum_values 1 2011-10-01 2 36 2 2011-10-02 3 38 3 2011-10-03 4 43 4 2011-10-04 4 43 5 2011-10-05 4 53 6 2011-10-06 5 58 i.e. each date unique_ids gives number of unique ids corresponding earlier dates , sum_values gives sum of values corresponding earlier dates.
i want avoid cycles because original df big. thinking use dplyr.
i know how obtain sum_value
df %>% group_by(dates) %>% summarize(sum_values_daily = sum(values)) %>% mutate(sum_values = cumsum(sum_values_daily)) %>% select(dates, sum_values) i don't know how obtains unique_ids column.
any idea?
because trying calculate number of distinct ids across groups, first we'll need define boolean column allow sum unique values.
secondly, want include missing dates original df in expected output, we'll need perform right_join full sequence of dates. assume here dates column of class date. produce na values replace 0.
finally calculate cumsum both unique_ids , sum_values.
library(dplyr) df %>% mutate(unique_ids = !duplicated(ids)) %>% group_by(dates) %>% summarise(unique_ids = sum(unique_ids), sum_values = sum(values)) %>% right_join(data.frame(dates = seq(min(df$date), max(df$dates), = 1))) %>% mutate_each(funs(replace(., is.na(.), 0)), -dates) %>% mutate_each(funs(cumsum), -dates) # dates unique_ids sum_values # <date> <dbl> <dbl> #1 2011-10-01 2 36 #2 2011-10-02 3 38 #3 2011-10-03 4 43 #4 2011-10-04 4 43 #5 2011-10-05 4 53 #6 2011-10-06 5 58
Comments
Post a Comment