I have a huge parquet file that dont fits in memory nor in disk when read, theres a way to use spark_read_parquet
to only read the first n lines?
Asked
Active
Viewed 1,066 times
4

Jader Martins
- 759
- 6
- 26
1 Answers
4
This might be a hacky way, but
spark_read_parquet(..., memory=FALSE) %>% head(n)
seems to do the job for me.

Oldřich Spáčil
- 156
- 7
-
by this approach the param memory should be set to false, and then it must be piped do collect() to stay in memory, this solution perfeclty works, my real problem was that spark was creating the storage in a folder without capacity, so I setted the spark-default.conf to save the data in /tmp – Jader Martins Oct 18 '17 at 10:46
-
1Ah, yes, you're right about setting `memory=FALSE`. However, when you pipe the result to `collect()`, you will collect to the driver rather than keep the data distributed in the cluster. Not sure where your data is stored but if it's in AWS S3 buckets, you might also be able to leverage the [Amazon Athena](https://aws.amazon.com/athena/) service. You can query Athena from R using a JDBC connection. – Oldřich Spáčil Oct 18 '17 at 11:11