I have a spark dataframe TABLE1 with one column with 100000 rows each contains a string of the identical length
AA105LONDEN 03162017045262017 16953563ABCDEF
and I would like to separate each row into multiple columns based on the lines separated below including the white space as separate columns.
AA|105|LONDEN| |0316|2017|04526|2017| |1695|3563|ABCDEF
Currently I am trying to use the DBI package as with code that looks like the following:
library(DBI)
newdataframe <- dbGetQuery(sc,"INSERT INTO TABLE2
(COLUMN1, COLUMN2, COLUMN3, COLUMN4)
SELECT SUBSTR(TRIM(COLUMN_NAME),1,3),
SUBSTR(TRIM(COLUMN_NAME),4,8),
SUBSTR(TRIM(COLUMN_NAME),9,12),
SUBSTR(TRIM(COLUMN_NAME),12,15)
FROM TABLE1
WHERE COLUMN_NAME IS NOT NULL")
I have a spark dataframe TABLE1 with one column with 100000 rows each contains a string of the identical length
AA105LONDEN 03162017045262017 16953563ABCDEF
and I would like to separate each row into multiple columns based on the lines separated below including the white space as separate columns.
AA|105|LONDEN| |0316|2017|04526|2017| |1695|3563|ABCDEF
Currently I am trying to use the DBI package as with code that looks like the following:
library(DBI)
newdataframe <- dbGetQuery(sc,"INSERT INTO TABLE2
(COLUMN1, COLUMN2, COLUMN3, COLUMN4)
SELECT SUBSTR(TRIM(COLUMN_NAME),1,3),
SUBSTR(TRIM(COLUMN_NAME),4,8),
SUBSTR(TRIM(COLUMN_NAME),9,12),
SUBSTR(TRIM(COLUMN_NAME),12,15)
FROM TABLE1
WHERE COLUMN_NAME IS NOT NULL")
This however does not seem to work. In addition even if it did it just returns a R dataframe and I need it to work in a spark dataframe. I am looking for an alternative suggestion other then SQL or for the proper syntax that will work with spark and return a parsed Spark dataframe that I can then do ML on. Any ideas.
Thanks in advance.