2

I am using the pdf tool to extract data from the scanned file by transforming to png first. Since the pdf tool read from png, there were some punctuations showing up for no reason. I can remove most of them except for "|".

Here is my data:

c("| January 2,310,501 2,342,654 + 14%", "| February 2,221,036 2,316,278 + 4.3%", )

I want my data can be like that:

c("January 2,310,501 2,342,654 + 14%", "February 2,221,036 2,316,278 + 4.3%",)

As you can see from the picture attached, "|" has changed the structure of my data and I cannot simply read the data from the second column. What I want is to remove the "|" element at all. Then the rest elements can move forward. You can also find the file attached. Thank you for your help.

  • Please add data using `dput` and show the expected output for the same. Please read the info about [how to ask a good question](http://stackoverflow.com/help/how-to-ask) and how to give a [reproducible example](http://stackoverflow.com/questions/5963269). – Ronak Shah Mar 24 '20 at 00:49

2 Answers2

2

You could use lapply to remove elements which are "|".

lapply(test2, function(x) x[x != '|'])

#[[1]]
#[1] "January" "test"   

#[[2]]
#[1] "February"  "2, 602,33"

Similarly, using map in purrr

purrr::map(test2,  ~.x[.x != '|'])

For the updated data we can use gsub

test <- trimws(gsub('\\|', '', test))
test

# [1] "January 2,310,501 2,342,654 + 14%"        "February 2,221,036 2,316,278 + 4.3%"     
# [3] "March 2,602,503 2,571,661 ( -1.2% )"      "April 2,471,788 2,485,989 i 0.6%"        
# [5] "May 2,418,547 2,512,922 + 3.9%"           "June 2,412,882 2,430,232 + 0.7%"         
# [7] "July 2,462,907 2,535,594 + 3.0%"          "August 2,526,211 2,638,753 + 4.5%"       
# [9] "September 2,434,132 2,480,466 * + 1.9%"   "October 2,552,215 2,642,990 * + 3.6%"    
#[11] "November 2,306,106 2,428,806 + 5.3%"      "December _ 2,283,294 2,250,016 ( -1.5% )"

data

test2 <- list(c('|', 'January', 'test'), c('February', '2, 602,33', '|'))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

We can use setdiff

lapply(test2, setdiff, "|")
#[[1]]
#[1] "January" "test"   

#[[2]]
#[1] "February"  "2, 602,33"

data

test2 <- list(c('|', 'January', 'test'), c('February', '2, 602,33', '|'))
akrun
  • 874,273
  • 37
  • 540
  • 662