Importing first three and last three fields from CSV with variable number of fields

Question

I've got a data set in csv. Unfortunately each line has different amount of "," commas. I am interested in importing only first 3 and last 3 variables from the file in R.

in example:

> line: "A","B","C","D",...,"X",Y","Z"

I want to achieve the following `

> line: "A","B","C","X","Y","Z"

I tried to use grep, to find - by using of regural expressions - first 3 variables:

new_data <- grep("([^,]+)(,[^,]+){2}", dataset, values=TRUE)

After that operation it shows me all lines in which that expression exists.

How can I remove the following variables in the line using grep, if it is possible, how can I remove the whole interval (each variable from <3;n-3>).

Do you now other method to solve that problem?

"Unfortunately each line has different amount of commas/fields". This can be called a "ragged" file or "variable number of fields". — smci, Oct 27 '16 at 10:52
Do you want a solution inside R, or a command-line one, as @VarunM gives? — smci, Oct 27 '16 at 10:53
Inside R solution will be cool, I am trying now to implement "awk" solution provide by Varun. — Meyk, Oct 27 '16 at 10:54
Sorry, I saw grep and assumed that command line solutions would be ok. I don't even follow `r` on StackOverflow. I hope my answer helps you though! — Chem-man17, Oct 27 '16 at 10:58
Thank you for your solution ;) I tried to implement it in R, but i received the following error: "Error in system(awk_call, intern = intern) : 'awk' not found" For now I don't know why, but when i will figure it out I will share that knowledge — Meyk, Oct 27 '16 at 11:00
The R approach is `textConnection -> string operations -> read.csv(data=...)` — smci, Oct 27 '16 at 11:04
Thanks both of you. I would like to give you +rep, but I cannot, cause of insufficient amount of mine. — Meyk, Oct 27 '16 at 11:08
@Meyk, your problem is already solved I'm sure but I thought it might help you to know how to use the `awk` solution using the `system` command as well. Check out the edit in my question. — Chem-man17, Oct 30 '16 at 12:53

h3rm4n · Accepted Answer · 2016-10-27T11:33:57.477

4

Using a combination of apply and head and tail:

d2 <- data.frame(t(apply(d1, 1, function(x) c(head(x[x != ''],3), tail(x[x != ''],3)))))

resulting in:

> d2
  X1 X2 X3 X4 X5 X6
1  a  b  c  x  y  z
2  a  b  c  g  h  i
3  a  b  c  t  u  v

Using the data of @VarunM:

d1 <- read.csv(text='a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z
a, b, c, d, e, f, g, h, i
a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v', header = FALSE, fill = TRUE)

edited Oct 27 '16 at 11:33

answered Oct 27 '16 at 11:11

h3rm4n

4,126
15
21

@VarunM used your data for conveniance ;-) – h3rm4n Oct 27 '16 at 11:35

Chem-man17 · Answer 2 · 2016-10-30T12:51:58.687

Try awk-

awk -F, '{print $1, $2, $3, $(NF-2), $(NF-1), $(NF)}' file

-F, changes the field separator to a comma.

NF is the last field in the dataset. NF-1 and NF-2 are obvious.

I made a sample file-

$cat file.csv
a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z 
a, b, c, d, e, f, g, h, i
a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v

On doing-

$awk -F, '{print $1, $2, $3, $(NF-2), $(NF-1), $(NF)}' file.csv

Output-

a  b  c  x  y  z 
a  b  c  g  h  i
a  b  c  t  u  v

Edit

This solution works perfectly well if you do the following-

> system('awk -F, \'{print $1, $2, $3, $(NF-2), $(NF-1), $(NF)}\' file.csv')

Where file.csv is the file containing the data.

score 1 · Answer 3 · answered Oct 27 '16 at 11:01

Command-line solutions will be easier, but if you want a solution inside R, in R code, then filter the incoming CSV through a textConnection and use string operations or regex to extract the first and last three fields:

csvConn <- textConnection('your.csv')
<use string operations or regex to extract the first and last three fields>
read.csv(data = csvFixed, ...)

This works, I've done it before. See other similar solutions involving textConnection and read.csv(data=...). I couldn't find a very clean example though.

Importing first three and last three fields from CSV with variable number of fields

3 Answers3

Edit