r - 2-gram and 3-gram instead of 1-gram using RWeka -
i trying extract 1-gram, 2-gram , 3-gram train corpus, using rweka ngramtokenizer function. unfortunately, getting 1-grams. there code:
train_corpus # clean-up cleanset1<- tm_map(train_corpus, tolower) cleanset2<- tm_map(cleanset1, removenumbers) cleanset3<- tm_map(cleanset2, removewords, stopwords("english")) cleanset4<- tm_map(cleanset3, removepunctuation) cleanset5<- tm_map(cleanset4, stemdocument, language="english") cleanset6<- tm_map(cleanset5, stripwhitespace) # 1-gram ngramtokenizer1 <- function(x) ngramtokenizer(x, weka_control(min = 1, max = 1)) train_dtm_tf_1g <- documenttermmatrix(cleanset6, control=list(tokenize=ngramtokenizer1)) dim(train_dtm_tf_1g) [1] 5905 15322 # 2-gram ngramtokenizer2 <- function(x) ngramtokenizer(x, weka_control(min = 2, max = 2)) train_dtm_tf_2g <- documenttermmatrix(cleanset6, control=list(tokenize=ngramtokenizer2)) dim(train_dtm_tf_2g) [1] 5905 15322 # 3-gram ngramtokenizer3 <- function(x) ngramtokenizer(x, weka_control(min = 3, max = 3)) train_dtm_tf_3g <- documenttermmatrix(cleanset6, control=list(tokenize=ngramtokenizer3)) dim(train_dtm_tf_3g) [1] 5905 15322
every time getting same result, obviously, wrong.
# combining 1-gram, 2-gram , 3-gram corpus ngramtokenizer <- function(x) ngramtokenizer(x, weka_control(min = 1, max = 3)) train_dtm_tf_ng <- documenttermmatrix(cleanset6, control=list(tokenize=ngramtokenizer)) dim(train_dtm_tf_ng) [1] 5905 15322 # numeric maximal allowed sparsity in range bigger 0 smaller 1 train_rmspa_m_tf_ng_95<-removesparseterms(train_dtm_tf_ng, 0.95) [1] 5905 172 # creat bag of words (bow) vector of these terms use later train_bow_3g_95 <- findfreqterms(train_rmspa_m_tf_3g_95) # take @ terms appear in last 5% of instances train_bow_3g_95 [1] "avg" "februari" "januari" "level" "nation" "per" "price" [8] "rate" "report" "reserv" "reuter" "also" "board" "export" [15] "march" "may" "month" "oil" "product" "total" "annual" [22] "approv" "april" "capit" "common" "compani" "five" "inc" [29] "increas" "meet" "mln" "record" "said" "share" "sharehold" [36] "stock" "acquir" "addit" "buy" "chang" "complet" "continu" ...
only 1-grams. tried rewrite comand in following way:
ngramtokenizer <- function(x) ngramtokenizer(x, weka_control(min = 1, max = 3))
but did not work out. tried add line:
options(mc.cores=1)
before ngramtokenizer comand, no changes. help?
i came across same issue today. seems "tm_map" not working simplecorpus reasons.
i changed code
corpus = corpus(vectorsource(pd_cmnt$qrating_explaination))
to
corpus = vcorpus(vectorsource(pd_cmnt$qrating_explaination))
and works , gives 2-gram properly.
Comments
Post a Comment