1

I'm trying to run my spark program using the spark-submit command (i'm working with scala), i specified the master adress, the class name, the jar file with all dependencies, the input file and then the output file but i'm having and error:

Exception in thread "main" org.apache.spark.sql.AnalysisException: Multiple sources found for csv (org.apache.spark.sql.execution.datasources.v2.csv.CSVDataSourceV2, org.apache.spark.sql.execution.datasources.csv.CSVFileFormat), please specify the fully qualified class name.;

Here is a screenshot for this error, What is it about? How can i fix it?

enter image description here

Thank you

koiralo
  • 22,594
  • 6
  • 51
  • 72
amelie
  • 25
  • 7
  • How did you run your job, can you share the dependencies or pom.xml file too ? – koiralo Jan 03 '21 at 10:20
  • Is that you are running fat jar file ? Also mention that are you running this in windows or linux environment . So if it is yes it will be like this `./spark-submit your-fat-jarfile,jar` . Also check whether your folder has the the appropriate permissions for file write or read. – Kaviranga Jan 03 '21 at 10:31
  • Yes i'm in the right folder And yes i mentionned the jar file in the spark-submit command – amelie Jan 03 '21 at 10:39
  • My pom.xml is too long i cannot share it by comment – amelie Jan 03 '21 at 10:43
  • check the list of jars you might have different versions of spark-csv jars in classpath – koiralo Jan 03 '21 at 10:44
  • No it is only one jar file with all dependencies: target/sample-1.0-SNAPSHOT-jar-with-dependencies.jar I created it using the mvn package commande I think it is a version problem too – amelie Jan 03 '21 at 10:46
  • See [this question](https://stackoverflow.com/questions/50884599/apache-spark-2-0-pyspark-dataframe-error-multiple-sources-found-for-csv). It's likely that you have multiple versions of Spark in the class path. – mck Jan 03 '21 at 11:02
  • Also try this solution [DataFrame Error Multiple sources found for csv](https://stackoverflow.com/questions/50884599/apache-spark-2-0-pyspark-dataframe-error-multiple-sources-found-for-csv) . This will be helpful – Kaviranga Jan 03 '21 at 11:10
  • which Spark version do oyuuse? Check the dependencies with `mvn dependency:tree` - as already mentioned, you have some dependency issue. Either you import another Spark lib that does come with its own CSV DataSource or you have multiple Spark libs - which would be weird. Also, in the Fat Jar, set the dependency scope of all Spark libs to `provided` - obviously, you don't have to put those into the Fat-jar given that your Spark cluster setup has all of them already – UninformedUser Jan 03 '21 at 11:15

1 Answers1

0

Here you got some warnings also,

If you correctly run your fat-jar file with correct permissions you can get a output like this for ./spark-submit

How successful spark-submit for CSV processing fat-jar file

Check whether if correctly set environmental variables for spark (~/.bashrc). Also check the source CSV file permissions. May be it will be the problem.

If you are running on linux environment set the folder permissions for the source CSV folder as

sudo chmod -R 777 /source_folder

After that again try to run ./spark-submit with your fat-jar file.

Kaviranga
  • 576
  • 2
  • 10
  • 24