r - 2-gram and 3-gram instead of 1-gram using RWeka -


i trying extract 1-gram, 2-gram , 3-gram train corpus, using rweka ngramtokenizer function. unfortunately, getting 1-grams. there code:

train_corpus # clean-up cleanset1<- tm_map(train_corpus, tolower) cleanset2<- tm_map(cleanset1, removenumbers) cleanset3<- tm_map(cleanset2, removewords, stopwords("english")) cleanset4<- tm_map(cleanset3, removepunctuation) cleanset5<- tm_map(cleanset4, stemdocument, language="english") cleanset6<- tm_map(cleanset5, stripwhitespace)  # 1-gram ngramtokenizer1 <- function(x) ngramtokenizer(x, weka_control(min = 1, max = 1)) train_dtm_tf_1g <- documenttermmatrix(cleanset6, control=list(tokenize=ngramtokenizer1)) dim(train_dtm_tf_1g) [1]  5905 15322  # 2-gram ngramtokenizer2 <- function(x) ngramtokenizer(x, weka_control(min = 2, max = 2)) train_dtm_tf_2g <- documenttermmatrix(cleanset6, control=list(tokenize=ngramtokenizer2)) dim(train_dtm_tf_2g) [1]  5905 15322  # 3-gram ngramtokenizer3 <- function(x) ngramtokenizer(x, weka_control(min = 3, max = 3)) train_dtm_tf_3g <- documenttermmatrix(cleanset6, control=list(tokenize=ngramtokenizer3)) dim(train_dtm_tf_3g) [1]  5905 15322 

every time getting same result, obviously, wrong.

# combining 1-gram, 2-gram , 3-gram corpus      ngramtokenizer <- function(x) ngramtokenizer(x, weka_control(min = 1, max = 3)) train_dtm_tf_ng <- documenttermmatrix(cleanset6, control=list(tokenize=ngramtokenizer)) dim(train_dtm_tf_ng) [1]  5905 15322  # numeric maximal allowed sparsity in range bigger 0 smaller 1 train_rmspa_m_tf_ng_95<-removesparseterms(train_dtm_tf_ng, 0.95)     [1] 5905  172  # creat bag of words (bow) vector of these terms use later train_bow_3g_95 <- findfreqterms(train_rmspa_m_tf_3g_95)  # take @ terms appear in last 5% of instances train_bow_3g_95    [1] "avg"        "februari"   "januari"    "level"      "nation"     "per"        "price"        [8] "rate"       "report"     "reserv"     "reuter"     "also"       "board"      "export"       [15] "march"      "may"        "month"      "oil"        "product"    "total"      "annual"       [22] "approv"     "april"      "capit"      "common"     "compani"    "five"       "inc"          [29] "increas"    "meet"       "mln"        "record"     "said"       "share"      "sharehold"    [36] "stock"      "acquir"     "addit"      "buy"        "chang"      "complet"    "continu"        ... 

only 1-grams. tried rewrite comand in following way:

ngramtokenizer <- function(x) ngramtokenizer(x, weka_control(min = 1, max = 3)) 

but did not work out. tried add line:

options(mc.cores=1) 

before ngramtokenizer comand, no changes. help?

i came across same issue today. seems "tm_map" not working simplecorpus reasons.

i changed code

corpus = corpus(vectorsource(pd_cmnt$qrating_explaination)) 

to

corpus = vcorpus(vectorsource(pd_cmnt$qrating_explaination)) 

and works , gives 2-gram properly.


Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -