-1

I am trying to use SparkContext.wholeTextFiles for reading some log files in a directory, and got error with the below config:

  • OS: Windows 10
  • Python version: 3.8.8
  • pyspark version: 3.1.2
  • java jdk: 1.8.0_91
  • hadoop version: 3.2.2
  • spark : 3.1.2
  • jupyter core : 4.7.1
  • jupyter-notebook : 6.3.0

My simple code:

from pyspark import SparkContext
sc = SparkContext('local', 'Test')
load_files = sc.wholeTextFiles('E:\Sample')
load_files.take(5)

Error:

Py4JJavaError: An error occurred while calling o22.partitions.
: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:645)
    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:1230)
    at org.apache.hadoop.fs.FileUtil.list(FileUtil.java:1435)
    at org.apache.hadoop.fs.RawLocalFileSystem.listStatus(RawLocalFileSystem.java:493)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
    at org.apache.hadoop.fs.FileSystem$4.<init>(FileSystem.java:2072)
    at org.apache.hadoop.fs.FileSystem.listLocatedStatus(FileSystem.java:2071)
    at org.apache.hadoop.fs.ChecksumFileSystem.listLocatedStatus(ChecksumFileSystem.java:700)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:312)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:274)
    at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:52)
    at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(WholeTextFileRDD.scala:54)
    at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:300)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:296)
    at org.apache.spark.api.java.JavaRDDLike.partitions(JavaRDDLike.scala:61)
    at org.apache.spark.api.java.JavaRDDLike.partitions$(JavaRDDLike.scala:61)
    at org.apache.spark.api.java.AbstractJavaRDDLike.partitions(JavaRDDLike.scala:45)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:745)

​ Any suggestion ?

Sina
  • 431
  • 4
  • 7

1 Answers1

0

the way you initialize your Spark sessions is very old, try this instead

spark = (SparkSession
    .builder
    .master('local[*]')
    .getOrCreate()
)

df = spark.read.text('E:\\Sample') # double backslashes here
df.show()
pltc
  • 5,836
  • 1
  • 13
  • 31