I have a csv of size 6GB. So far I was using the following line which when I check its size on dbfs after this copy using java io, it still shows as 6GB so I assume it was right. But when I do a spark.read.csv(samplePath) it reads only 18mn rows instead of 66mn.
Files.copy(Paths.get(_outputFile), Paths.get("/dbfs" + _outputFile))
So I tried dbutils to copy as shown below but it gives error. I have updated maven dbutil dependency and imported the same in this object where I am calling this line. Is there any other place too where I should make any change to use dbutils in scala code to run on databricks?
dbutils.fs.cp("file:" + _outputFile, _outputFile)
Databricks automatically assumes that when you do spark.read.csv(path) then it searches this path on dbfs by default. How to make sure it can read this path from driver memory instead of dbfs? Because I feel the file copy is not actually copying all rows due to 2GB size limit while using java io with databricks.
Can I use this:
spark.read.csv("file:/databricks/driver/sampleData.csv")
Any suggestions around this?
Thanks.