0

I set stand alone spark cluster (with cassandra). My cluster has 3 nodes and each node has 64 GB ram and 20 cores. How can I write data which I read from cassandra with spark to csv file when I am using stand alone mode.

I have this error

Caused by: java.io.IOException: Mkdirs failed to create file:/opt/folder/tmp/file.csv/_temporary/0/_temporary/attempt_201603031703_0001_m_000000_5

As a result of my research, I think that there is a permission trick. But I don't know how can I ensure that.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
murzade
  • 29
  • 5
  • What is the error you are facing? – s510 Sep 06 '22 at 14:13
  • @the_ordinary_guy Caused by: java.io.IOException: Mkdirs failed to create file:/opt/folder/tmp/file.csv/_temporary/0/_temporary/attempt_201603031703_0001_m_000000_5 – murzade Sep 06 '22 at 14:16
  • are you having any hdfs or s3 based file system initiated? – s510 Sep 06 '22 at 14:35
  • Actually i did not interest about it. How can i check file system – murzade Sep 06 '22 at 14:39
  • You don't need either. You can use `file:///` based paths, which from your error, sounds like it's the default. Please edit the question to include your code – OneCricketeer Sep 06 '22 at 18:05
  • Welcome to Stack Overflow! A friendly note on how to ask good questions. The general guidance is that you (a) provide a good summary of the problem that includes software/component versions, the full error message + full stack trace; (b) describe what you've tried to fix the problem, details of investigation you've done; and (c) minimal sample code that replicates the problem. Cheers! – Erick Ramirez Sep 07 '22 at 00:46
  • @OneCricketeer After read data from cassandra i can show it by data_frame.show() command but when i try to write it to local data_frame.write.csv(file:///home/pc3/Desktop/try.csv) there is a same error which i mentioned above (Mkdirs failed to create file) . I found same problem [link](https://stackoverflow.com/questions/40786093/writing-files-to-local-system-with-spark-in-cluster-mode) in this post . That post owner solved his problem but when i try to his solution technique , i got same errors . The post owner use yarn but i don't there can be a something about that. – murzade Sep 07 '22 at 13:44
  • I don't think yarn is relevant to the answer there. The answer is saying the unix permissions for the output folder was not allowing the Spark process to write there. So, `chmod -R 777 /opt/folder/tmp` could work for you here. Or, if you want to write to your home folder, ensure Spark is running as pc3 user – OneCricketeer Sep 07 '22 at 14:40
  • @OneCricketeer thank you for your advice sir but still not solved my problem . For your first advice , chmod -R 777 /opt/folder/tmp i am setting permission but spark creating a directory (not file, it is directory ) which name is file.csv and which is empty . When i look error message which is **Caused by: java.io.IOException: Mkdirs failed to create file:/opt/folder/tmp/file.csv/_temporary/0/_temporary/attempt_201603031703_0001_m_000000_5** /_temporary/0/_temporary/attempt_201603031703_0001_m_000000_5 part caught my attention i think there can be a problem for creating temporary directorys – murzade Sep 08 '22 at 06:17
  • @OneCricketeer but i don't know how can i solve that if it is problem . For your second advice which is **ensure Spark is running as pc3 user** i am running spark on my pc3 which is my master node . I think it's okey . Is there another way to ensure that ? Am i missing something ? – murzade Sep 08 '22 at 06:21
  • 1) `-R` means recursive on all subfolders.. 2) You run `spark-submit` as some user account. Only `pc3` (and root) user has access to write to `file:///home/pc3`. – OneCricketeer Sep 08 '22 at 21:50
  • @OneCricketeer also there is no write access between slave nodes (pc1 and pc2). So , when i write to pc3 there is a error occur because pc1,pc2 have no write access. when i write to pc1 there is error occur because pc2 has no access. I want use all of my computer cores for fast reading(standalone - cluster mode ) . How can i ensure that ? – murzade Sep 09 '22 at 06:18
  • @oneCricketeer I found a technique it works for me but I'm not sure if it's a really suitable solution ; after read data from cassandra instead of save as csv(which cause error) , i first translate to pandas data frame after that i can save it without error. But this translation work is taking extra time. In that project my most important goal is decrease time for reading data from cassandra and save it to csv . Hence I decided to use spark considering the suggestions on the internet. But apparently there is no difference with this technic, cassandra and spark + cassandra according to time – murzade Sep 09 '22 at 06:44
  • @oneCricketeer do you have any suggestion ? – murzade Sep 09 '22 at 06:47
  • If you have a cluster, then you cannot easily use `file:///` paths. You need to setup HDFS, for example. Otherwise, your file will only write to the local driver (on the machine you start `spark-submit --deploy-mode=client`) – OneCricketeer Sep 09 '22 at 16:31
  • But yes, Spark will (probably? I haven't used it with Cassandra) only make one client connection to your database, then return a dataframe from that data. You can call `repartition` on that dataframe to split between executors, but calling `write` with `file:///` path or `.toPandas()` or simply `.collect()` and pulling all data from all executors to one driver is not how you should be using Spark. – OneCricketeer Sep 09 '22 at 16:40

0 Answers0