0

I have a spark dataframe TABLE1 with one column with 100000 rows each contains a string of the identical length

 AA105LONDEN    03162017045262017 16953563ABCDEF

and I would like to separate each row into multiple columns based on the lines separated below including the white space as separate columns.

 AA|105|LONDEN|    |0316|2017|04526|2017| |1695|3563|ABCDEF

Currently I am trying to use the DBI package as with code that looks like the following:

 library(DBI)

 newdataframe <- dbGetQuery(sc,"INSERT INTO TABLE2
 (COLUMN1, COLUMN2, COLUMN3, COLUMN4)
  SELECT SUBSTR(TRIM(COLUMN_NAME),1,3),
           SUBSTR(TRIM(COLUMN_NAME),4,8),
           SUBSTR(TRIM(COLUMN_NAME),9,12),
           SUBSTR(TRIM(COLUMN_NAME),12,15)
  FROM TABLE1
  WHERE COLUMN_NAME IS NOT NULL")

I have a spark dataframe TABLE1 with one column with 100000 rows each contains a string of the identical length

 AA105LONDEN    03162017045262017 16953563ABCDEF

and I would like to separate each row into multiple columns based on the lines separated below including the white space as separate columns.

 AA|105|LONDEN|    |0316|2017|04526|2017| |1695|3563|ABCDEF

Currently I am trying to use the DBI package as with code that looks like the following:

 library(DBI)

 newdataframe <- dbGetQuery(sc,"INSERT INTO TABLE2
 (COLUMN1, COLUMN2, COLUMN3, COLUMN4)
  SELECT SUBSTR(TRIM(COLUMN_NAME),1,3),
           SUBSTR(TRIM(COLUMN_NAME),4,8),
           SUBSTR(TRIM(COLUMN_NAME),9,12),
           SUBSTR(TRIM(COLUMN_NAME),12,15)
  FROM TABLE1
  WHERE COLUMN_NAME IS NOT NULL")

This however does not seem to work. In addition even if it did it just returns a R dataframe and I need it to work in a spark dataframe. I am looking for an alternative suggestion other then SQL or for the proper syntax that will work with spark and return a parsed Spark dataframe that I can then do ML on. Any ideas.

Thanks in advance.

Levi Brackman
  • 325
  • 2
  • 17

1 Answers1

1

Us this to create Dataframe. It will give correct result.

    sqlContext.sql("select substr(COLUMN_NAME,1,2),
substr(COLUMN_NAME,3,3),
substr(COLUMN_NAME,6,6),
substr(COLUMN_NAME,12,4),
substr(COLUMN_NAME,16,4),
substr(COLUMN_NAME,20,4), 
substr(COLUMN_NAME,24,5) ,
substr(COLUMN_NAME,29,4) ,
substr(COLUMN_NAME,33,1) ,
substr(COLUMN_NAME,34,4) ,
substr(COLUMN_NAME,38,4) ,
substr(COLUMN_NAME,42,6) 
TABLE1
WHERE COLUMN_NAME IS NOT NULL")
Ashish Singh
  • 523
  • 3
  • 14