r - How to use dplyr to find unique entries in the previous rows -


i have long dataframe, more or less following structure:

df <- data.frame( dates = c("2011-10-01","2011-10-01","2011-10-01","2011-10-02","2011-10-03","2011-10-05","2011-10-06","2011-10-06"), ids = c("a","a","b","c","d","a","e","d"), values = c(10,1,25,2,5,10,4,1))  > df        dates ids values 1 2011-10-01       10 2 2011-10-01        1 3 2011-10-01   b     25 4 2011-10-02   c      2 5 2011-10-03   d      5 6 2011-10-05       10 7 2011-10-06   e      4 8 2011-10-06   d      1 

i following output:

       dates   unique_ids sum_values 1 2011-10-01            2         36 2 2011-10-02            3         38 3 2011-10-03            4         43 4 2011-10-04            4         43 5 2011-10-05            4         53 6 2011-10-06            5         58 

i.e. each date unique_ids gives number of unique ids corresponding earlier dates , sum_values gives sum of values corresponding earlier dates.

i want avoid cycles because original df big. thinking use dplyr.

i know how obtain sum_value

df %>% group_by(dates) %>% summarize(sum_values_daily = sum(values)) %>% mutate(sum_values = cumsum(sum_values_daily)) %>% select(dates, sum_values) 

i don't know how obtains unique_ids column.

any idea?

because trying calculate number of distinct ids across groups, first we'll need define boolean column allow sum unique values.

secondly, want include missing dates original df in expected output, we'll need perform right_join full sequence of dates. assume here dates column of class date. produce na values replace 0.

finally calculate cumsum both unique_ids , sum_values.

library(dplyr)  df %>% mutate(unique_ids = !duplicated(ids)) %>%         group_by(dates) %>%         summarise(unique_ids = sum(unique_ids),                   sum_values = sum(values)) %>%         right_join(data.frame(dates = seq(min(df$date),                                            max(df$dates),                                            = 1))) %>%         mutate_each(funs(replace(., is.na(.), 0)), -dates)  %>%         mutate_each(funs(cumsum), -dates) #       dates unique_ids sum_values #      <date>      <dbl>      <dbl> #1 2011-10-01          2         36 #2 2011-10-02          3         38 #3 2011-10-03          4         43 #4 2011-10-04          4         43 #5 2011-10-05          4         53 #6 2011-10-06          5         58 

Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -