I have a delimited file that has errors when it was dumped to a flat file. Multiple records where wrapped and the overflow was saved as a new line.
Table A
has dimensionsn x 25
- Every record is enclosed by
"
- Some records split into 3 lines of 8, 17 and 3 columns.
Visuals for better understanding what I'm trying to explain:
"|A|B|C|D|E|F|...|X|Y|"
[1] "|1|2|3|2|1|1|...|4|1|"
[2] "|2|3|4|4|5|1|...|0|0|"
[3] "|8|7|6|7|...
[4] |7|9|1|....
[5] |2|3|7|"
Notice what should've been TableA[3] is now split into rows 3 through 5 and only rows 3 and 5 are quoted.
I've used read_delim from the package readr
with the following parameters
##Attempt 1
read_delim("data/TableA.txt",delim = "|",col_names = T)
## Strangely resulting in an nx1 table, without any parsing done.
##Attempt 2
read_delim("data/TableA.txt",delim = "|",col_names = T,quote="")
## Results in a parsed nx25 tibble like this:
\"A B C D E F ... X Y\"
\"1 2 3 2 1 1 ... 4 1\"
\"2 3 4 4 5 1 ... 0 0\"
\"8 7 6 7
7 9 1
2 3 7\" "
problems(TableA) ##outputs:
# A tibble: 6 x 5
row col expected actual file
<int> <chr> <chr> <chr> <chr>
1 26 NA 26 columns 8 columns 'data/TableA.txt'
2 27 NA 26 columns 17 columns 'data/TableA.txt'
3 28 NA 26 columns 3 columns 'data/TableA.txt'
4 160 NA 26 columns 8 columns 'data/TableA.txt'
5 161 NA 26 columns 17 columns 'data/TableA.txt'
6 162 NA 26 columns 3 columns 'data/TableA.txt'
Note the q-marks at the start and end of the first and last variable names respectively and the same for its values. When i tried quote='"'
or quote="\""
got back the same results as Attempt 1
.
I would like to know if there's a way to solve this by parametrizing
readr::read_delim
correctly
EDIT:
This is the output for dput(readLines("data/TableA.txt"))
I "hashed" some of the data for privacy reasons.
Element 1 and 5 are correct in form. 2 -> 4 present the problem as described.
c("\"8x9|x|x|x|x|x|x|47|SDPA|Colmados|COMERCIAL||||||Unknown|Unknown|Unknown|HAINA|SANTO DOMINGO|||47|809|x\"",
"\"8x9|x|x|x|SECUNDARIO|x|x|15",
"|SDPA|x|||x||0x2xx8|xxx6|Unknown|Unknown|Unknown|xS|SxxGO|||15",
"|8x9|xx4\"", "\"809|3xx00|xx|Sxx|PRINCIPAL|DISTRITO NACIONAL|xxxx|86|SDPA|Bexxs|COMERCIAL||vexsxxcom|www.axx.do|00x54|1xx-7|$0 - $5M|0 - 25|$0 - $500K|LOxS|Santo Domingo|||86|8xx9|33xx0\""
)