4

I have a huge parquet file that dont fits in memory nor in disk when read, theres a way to use spark_read_parquet to only read the first n lines?

Jader Martins
  • 759
  • 6
  • 26

1 Answers1

4

This might be a hacky way, but

spark_read_parquet(..., memory=FALSE) %>% head(n)

seems to do the job for me.

  • by this approach the param memory should be set to false, and then it must be piped do collect() to stay in memory, this solution perfeclty works, my real problem was that spark was creating the storage in a folder without capacity, so I setted the spark-default.conf to save the data in /tmp – Jader Martins Oct 18 '17 at 10:46
  • 1
    Ah, yes, you're right about setting `memory=FALSE`. However, when you pipe the result to `collect()`, you will collect to the driver rather than keep the data distributed in the cluster. Not sure where your data is stored but if it's in AWS S3 buckets, you might also be able to leverage the [Amazon Athena](https://aws.amazon.com/athena/) service. You can query Athena from R using a JDBC connection. – Oldřich Spáčil Oct 18 '17 at 11:11