r - Adding a column of corresponding seasons to dataframe -


here example of dataframe. working in r.

date          name       count 2016-11-12    joe         5 2016-11-15    bob         5 2016-06-15    nick        12 2016-10-16    cate        6 

i add column data frame tell me season corresponds date. this:

date          name       count      season 2016-11-12    joe         5          winter 2016-11-15    bob         5          winter 2017-06-15    nick        12         summer 2017-10-16    cate        6          fall  

i have started code:

startwinter <- c(month.name[1], month.name[12], month.name[11]) startsummer <- c(month.name[5], month.name[6], month.name[7]) startspring <- c(month.name[2], month.name[3], month.name[4])  # create function find correct season based on month monthseason <- function(month) {   # !is.na()  # ignores values na   # match()   # returns vector of positions of matches    # if starting month matches spring season, print "spring". if starting month matches summer season, print "summer" etc.     ifelse(!is.na(match(month, startspring)),          return("spring"),          return(ifelse(!is.na(match(month, startwinter)),                        "winter",                        ifelse(!is.na(match(month, startsummer)),                               "summer","fall")))) } 

this code gives me season month. im not sure if going problem in right way. can me out? thanks!

there couple of hacks, , usability depends on whether want use meteorological or astronomical seasons. i'll offer both, think offer sufficient flexibility.

i'm going use second data provided, since provides more "winter".

txt <- "date          name       count 2016-11-12    joe         5 2016-11-15    bob         5 2017-06-15    nick        12 2017-10-16    cate        6" dat <- read.table(text = txt, header = true, stringsasfactors = false) dat$date <- as.date(dat$date) 

the quickest method works when seasons defined strictly month.

metseasons <- c(   "01" = "winter", "02" = "winter",   "03" = "spring", "04" = "spring", "05" = "spring",   "06" = "summer", "07" = "summer", "08" = "summer",   "09" = "fall", "10" = "fall", "11" = "fall",   "12" = "winter" ) metseasons[format(dat$date, "%m")] #       11       11       06       10  #   "fall"   "fall" "summer"   "fall"  

if choose use date ranges seasons not defined month start/stop such astronomical seasons, here's 'hack':

astroseasons <- as.integer(c("0000", "0320", "0620", "0922", "1221", "1232")) astroseasons_labels <- c("winter", "spring", "summer", "fall", "winter") 

if use proper date or posix types, including years, makes things little less-generic. 1 might think of using julian dates, during leap years produces anomalies. so, assumption feb 28 never seasonal boundary, i'm "numericizing" month-day. though r character-comparisons fine, cut expects numbers, convert them integers.

two safe-guards: because cut either right-open (and left-closed) or right-closed (and left-open), our 2 book-ends need extend beyond legal dates, ergo "0000" , "1232". there other techniques work equally here (e.g., using -inf , inf, post-integerization).

astroseasons_labels[ cut(as.integer(format(dat$date, "%m%d")), astroseasons, labels = false) ] # [1] "fall"   "fall"   "spring" "fall"   

notice third date in spring when using astronomical seasons , summer otherwise.

this solution can adjusted account southern hemisphere or other seasonal preferences/beliefs.

edit: motivated @kristofersen's answer (thanks), looked benchmarks. lubridate::month uses posixct-to-posixlt conversion extract month, can on 10x faster format(x, "%m") method. such:

metseasons2 <- c(   "winter", "winter",   "spring", "spring", "spring",   "summer", "summer", "summer",   "fall", "fall", "fall",   "winter" ) 

noting as.posixlt returns 0-based months, add 1:

metseasons2[ 1 + as.posixlt(dat$date)$mon ] # [1] "fall"   "fall"   "summer" "fall"   

comparison:

library(lubridate) library(microbenchmark) set.seed(42) x <- sys.date() + sample(1e3) xlt <- as.posixlt(x)  microbenchmark(   metfmt = metseasons[ format(x, "%m") ],   metlt  = metseasons2[ 1 + xlt$mon ],   astrofmt = astroseasons_labels[ cut(as.integer(format(x, "%m%d")), astroseasons, labels = false) ],   astrolt  = astroseasons_labels[ cut(100*(1+xlt$mon) + xlt$mday, astroseasons, labels = false) ],   lubridate = sapply(month(x), seasons) ) # unit: microseconds #       expr      min       lq       mean    median        uq       max neval #     metfmt 1952.091 2135.157 2289.63943 2212.1025 2308.1945  3748.832   100 #      metlt   14.223   16.411   22.51550   20.0575   24.7980    68.924   100 #   astrofmt 2240.547 2454.245 2622.73109 2507.8520 2674.5080  3923.874   100 #    astrolt   42.303   54.702   72.98619   66.1885   89.7095   163.373   100 #  lubridate 5906.963 6473.298 7018.11535 6783.2700 7508.0565 11474.050   100 

so methods using as.posixlt(...)$mon faster. (@kristofersen's answer improved vectorizing it, perhaps ifelse, still won't compare speed of vector lookups or without cut.)


Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -