regex - Extract before and after lines based on keyword in Pdf using R programming -

i want extract information related keyword "cancer" list of pdf using r.

i want extract before , after lines or paragraph containing word cancer in text file.

abstracts <- lapply(mytxtfiles, function(i) { j <- paste0(scan(i, = character()), collapse = " ") regmatches(j, gregexpr("(?m)(^[^\\r\\n]*\\r+){4}[cancer][^\\r\\n]*\\r+(^[^\\r\\n]*\\r+){4}", j, perl=true))})

above regex not working

here's 1 approach:

library(textreadr) library(tidyverse)  loc <- function(var, regex, n = 1, ignore.case = true){     locs <- grep(regex, var, ignore.case = ignore.case)     out <- sort(unique(c(locs - 1, locs, locs + 1)))     out <- out[out > 0]     out[out <= length(var)] }  doc <- 'https://www.in.kpmg.com/pdf/indian%20pharma%20outlook.pdf' %>%     read_pdf() %>%     slice(loc(text, 'cancer'))  doc  ##    page_id element_id                                                                                                                  text ## 1       24         28                              ranjit shahani applauds national pharmaceuticals policy's proposal of public/private ## 2       24         29                              partnerships (ppps) tackle life-threatening diseases such cancer , hiv/aids, ## 3       24         30                                stresses that, in order them work, should voluntary, , government ## 4       25          8                         availability of medicines treat life-threatening diseases. notes, example, ## 5       25          9                             while average estimate of value of drugs treat country's cancer patients ## 6       25         10                             $1.11 billion, market in fact worth $33.5 million. “the big gap indicates ## 7       25         12                           because of high cost of these medicines,” says policy, calls tax , ## 8       25         13                                                                              excise exemptions anti-cancer drugs. ## 9       25         14                       area ppps proposed drugs treat hiv/aids, india's biggest health ## 10      32         19                              variegate trading, ub subsidiary. firm's major products in anti-infective, ## 11      32         20                               anti-inflammatory, cancer, diabetes , allergy market segments and, year ended ## 12      32         21                             december 31, 2005, reported net sales (excluding excise duty) 9.9 percent $181.1

Search This Blog

New Generation Education

regex - Extract before and after lines based on keyword in Pdf using R programming -

Comments

Post a Comment

Popular posts from this blog

cookies - Yii2 Advanced - Share session between frontend and mainsite (duplicate of frontend for www) -

angular - password and confirm password field validation angular2 reactive forms -

javascript - Angular2 intelliJ config error.. Cannot find module '@angular/core' -