regex - Extract before and after lines based on keyword in Pdf using R programming -
i want extract information related keyword "cancer" list of pdf using r.
i want extract before , after lines or paragraph containing word cancer in text file.
abstracts <- lapply(mytxtfiles, function(i) { j <- paste0(scan(i, = character()), collapse = " ") regmatches(j, gregexpr("(?m)(^[^\\r\\n]*\\r+){4}[cancer][^\\r\\n]*\\r+(^[^\\r\\n]*\\r+){4}", j, perl=true))})
above regex not working
here's 1 approach:
library(textreadr) library(tidyverse) loc <- function(var, regex, n = 1, ignore.case = true){ locs <- grep(regex, var, ignore.case = ignore.case) out <- sort(unique(c(locs - 1, locs, locs + 1))) out <- out[out > 0] out[out <= length(var)] } doc <- 'https://www.in.kpmg.com/pdf/indian%20pharma%20outlook.pdf' %>% read_pdf() %>% slice(loc(text, 'cancer')) doc ## page_id element_id text ## 1 24 28 ranjit shahani applauds national pharmaceuticals policy's proposal of public/private ## 2 24 29 partnerships (ppps) tackle life-threatening diseases such cancer , hiv/aids, ## 3 24 30 stresses that, in order them work, should voluntary, , government ## 4 25 8 availability of medicines treat life-threatening diseases. notes, example, ## 5 25 9 while average estimate of value of drugs treat country's cancer patients ## 6 25 10 $1.11 billion, market in fact worth $33.5 million. “the big gap indicates ## 7 25 12 because of high cost of these medicines,” says policy, calls tax , ## 8 25 13 excise exemptions anti-cancer drugs. ## 9 25 14 area ppps proposed drugs treat hiv/aids, india's biggest health ## 10 32 19 variegate trading, ub subsidiary. firm's major products in anti-infective, ## 11 32 20 anti-inflammatory, cancer, diabetes , allergy market segments and, year ended ## 12 32 21 december 31, 2005, reported net sales (excluding excise duty) 9.9 percent $181.1
Comments
Post a Comment