0

I have a temp directory (tempfile.mkdtemp()) where I make edits to a db file using sqlite3 in an Azure Synapse notebook. When trying to copy the finished db file to mounted data lake storage like so: mssparkutils.fs.cp('file:' + dirpath + '/example_database.db', 'synfs:/' + x + f'/container/example_directory/example_database.db')

Where x = mssparkutils.env.getJobId() I receive this error:

Py4JJavaError: An error occurred while calling z:mssparkutils.fs.cp.
: org.apache.hadoop.fs.ChecksumException: Checksum error: file:/tmp/tmpcem75tu1/InventoryForecasting.db at 0
    at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:264)
    at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:300)
    at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:252)
    at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:197)
    at java.io.DataInputStream.read(DataInputStream.java:100)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:94)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
    at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:415)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:387)
    at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337)
    at com.microsoft.spark.notebook.msutils.impl.MSFsUtilsImpl.cp(MSFsUtilsImpl.scala:247)
    at mssparkutils.fs$.cp(fs.scala:17)
    at mssparkutils.fs.cp(fs.scala)
    at sun.reflect.GeneratedMethodAccessor162.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)

I expected the file to copy, and the same method worked fine when trying with both a .txt and an .xlsx file. I can also get the file to copy as expected when using adlfs with fsspec, though it requires me to use put() rather than copy() as I'm copying from local storage to remote storage. There's also no problem with copying a db file from remote storage to either local storage or remote storage, so I think the issue is specific to using mssparkutils to copy a db file out of this temp directory.

0 Answers0