-1

I've encountered a simple issue on a dataset i'm working on right now that uses written text a you would see much of social media where people sensibly use commas on their writing process. The whole text is in column 1 on the dataset followed by a date column and so on. The data are in .xls format, separated by commas, and each cell is then placed inside parantheses. It would look like this:

"Come and get around, we have ice cream!", "2021-02-02", "lorem ipsum"

Using comma as the separator yields one extra column than it should have.

I used the normal read table function and couldn't wrap my head if i needed to use a regex or where would i put it.

Any tips are apreciated!

EDIT:

Here's an example of the dataset and the simple code I ran

These are the first two lines of the raw xls:

"Text","Time of posting","Reach","Comments"
"Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur?","2020-11-15T18:23:32","28360","5689"

Using the import tool on Rstudio for xls gave me no options for separators, so I used read.table and got the same dataset on .csv, the code was as follows:

                 header = TRUE,
                 sep=',',
                 skip= 5)´´´

It resulted in every single comma generating a new a new column, when what i actually want is just for commas outside the parentheses to generat new columns.
Hayguneys
  • 23
  • 4
  • Is there a way for you to share a small sample file with us ? – Ronak Shah Sep 18 '21 at 02:48
  • Yes! I'm working on it right now, I rushed this post out the door, I'll edit a table example and some simple code I used trying to get it read. – Hayguneys Sep 18 '21 at 04:19
  • added a sample of the first two lines with some ipsum lorem to protect ppls ID :) – Hayguneys Sep 18 '21 at 06:08
  • you should also include sample lines that have commas inside brackets. also share the syntax you are using to read the files, respond to answers with screenshots of the error if their solution does not solve the issue. – Macosso Sep 18 '21 at 07:32

1 Answers1

2

if you have the data in excel format

you can use the read_excel() function from readxl package by default it treats everything inside the parentheses as string

library(readxl)
read_excel("C:/Users/User/Google Drive/Trading/Test.xls") # do not use ```sep``` argument

# A tibble: 4 x 4
  A             B                   C     D       
  <chr>         <chr>               <chr> <chr>   
1 awsdf         (Alternativa, hoje) Tod   XLLLsss 
2 hoj           as                  aqwe  was     
3 hey           hello               world hurry up
4 (trust, code) check               hoj   hun  

then you can use gsub function to remove commas in the text

if you have the data in .csv format you need to use read.csv instead of read.table and do not specify the sep argument

read.table("C:/Users/User/Google Drive/Trading/Test.csv")
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  : 
  line 2 did not have 2 elements
read.csv("C:/Users/User/Google Drive/Trading/Test.csv")
              A                   B     C        D
1         awsdf (Alternativa, hoje)   Tod  XLLLsss
2           hoj                  as  aqwe      was
3           hey               hello world hurry up
4 (trust, code)               check   hoj      hun
Macosso
  • 1,352
  • 5
  • 22