sparklyr: skip first lines of text files

Question

I would like to skip (dropping out) the first two lines of a text file: to the best of my knowledge this is not possible with sparklyr method spark_read_csv. There is some workaround to solve this simple problem?

I know the existance of sparklyr extension but i'm searching for a "more" standard way to achieve my goal.

score 1 · Accepted Answer · answered Jan 26 '21 at 10:38

You can use a custom reader with function spark_reader introduced in version 1.3.0. API reference.

Let's see an example. Supose you have 2 files:

sample1.csv contains:

# file 1 skip line 1
# file 1 skip line 2
header1,header2,header3
row1col1,row1col2,1
row2col1,row2col2,1
row3col1,row3col2,1

sample2.csv contains:

# file 2 skip line 1
# file 2 skip line 2
header1,header2,header3
row1col1,row1col2,2
row2col1,row2col2,2
row3col1,row3col2,2

The following code reads the files from a local filesystem but the same can be applied to a HDFS source.

library(sparklyr)
library(dplyr)

sc <- spark_connect(master = "local")

paths <- paste("file:///", 
               list.files(getwd(), pattern = "sample\\d", full.names = TRUE), 
               sep = "")
paths

Paths must be absolute, in my example:"file:///C:/Users/erodriguez/Documents/sample1.csv" .... Then the schema is defined with the data types. The custom_csv_reader is the reader function which gets a URI and returns a dataframe. The reader tasks will be distributed across Spark worker nodes. Note the read.csv call has argument skip = 2 to drop first two lines.

schema <- list(name1 = "character", name2 = "character", file = "integer")

custom_csv_reader <- function(path) {
  read.csv(path, skip = 2, header = TRUE, stringsAsFactors = FALSE)
}

data <- spark_read(sc, path = paths, reader = custom_csv_reader, columns = schema)

data

Result:

# Source: spark<?> [?? x 3]
  name1    name2     file
  <chr>    <chr>    <int>
1 row1col1 row1col2     1
2 row2col1 row2col2     1
3 row3col1 row3col2     1
4 row1col1 row1col2     2
5 row2col1 row2col2     2
6 row3col1 row3col2     2

sparklyr: skip first lines of text files

1 Answers1