4

I am running the following code:

list_of_paths is a list with paths that end to an avro file. For example,

['folder_1/folder_2/0/2020/05/15/10/41/08.avro', 'folder_1/folder_2/0/2020/05/15/11/41/08.avro', 'folder_1/folder_2/0/2020/05/15/12/41/08.avro']

Note: The above paths are stored in Azure Data Lake storage, and the below process is executed in Databricks

spark.conf.set("fs.azure.account.key.{0}.dfs.core.windows.net".format(storage_account_name), storage_account_key)
spark.conf.set("spark.sql.execution.arrow.enabled", "false")
begin_time = time.time()

for i in range(len(list_of_paths)):

    try:
      read_avro_data,avro_decoded=None,None

      #Read paths from Azure Data Lake "abfss"
      read_avro_data=spark.read.format("avro").load("abfss://{0}@{1}.dfs.core.windows.net/{2}".format(storage_container_name, storage_account_name, list_of_paths[i]))

    except Exception as e:
      custom_log(e)

Schema

read_avro_data.printSchema()

root
 |-- SequenceNumber: long (nullable = true)
 |-- Offset: string (nullable = true)
 |-- EnqueuedTimeUtc: string (nullable = true)
 |-- SystemProperties: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- member0: long (nullable = true)
 |    |    |-- member1: double (nullable = true)
 |    |    |-- member2: string (nullable = true)
 |    |    |-- member3: binary (nullable = true)
 |-- Properties: map (nullable = true)
 |    |-- key: string
 |    |-- value: struct (valueContainsNull = true)
 |    |    |-- member0: long (nullable = true)
 |    |    |-- member1: double (nullable = true)
 |    |    |-- member2: string (nullable = true)
 |    |    |-- member3: binary (nullable = true)
 |-- Body: binary (nullable = true) 
# this is the content of the AVRO file.

Number of rows and columns

print("ROWS: ", read_avro_data.count(), ", NUMBER OF COLUMNS: ", len(read_avro_data.columns))

ROWS:  2 , NUMBER OF COLUMNS:  6

What I want is not to read 1 AVRO file per iteration, so 2 rows of content at one iteration. Instead, I want to read all the AVRO files at once. So 2x3 = 6 rows of content at my final spark DataFrame.

Is this feasible with spark.read()? Something like the following:

spark.read.format("avro").load("abfss://{0}@{1}.dfs.core.windows.net/folder_1/folder_2/0/2020/05/15/*")

[Update] Sorry for the misunderstanding of wildcard(*). This implies that all AVRO files are in the same folder. But rather, I have 1 folder per AVRO file. So 3 AVRO files, 3 folders. In this case the wildcard won't work. The solution as answered below is the use of a list [] with path names.

Thank you in advance for your help and advice.

NikSp
  • 1,262
  • 2
  • 19
  • 42

1 Answers1

5

load(path=None, format=None, schema=None, **options) this method will accept single path or list of paths.

For example, You can directly pass list of paths like below

spark.read.format("avro").load(["/tmp/dataa/userdata1.avro","/tmp/dataa/userdata2.avro"]).count()

1998

Srinivas
  • 8,957
  • 2
  • 12
  • 26
  • 2
    you can use something like wild card '*' which will automatically read all the avro files in parallel way. There won't be any issue in the performance `spark.read.format('avro').load('python/test_support/sql/*')` – Rohith Rangaraju May 16 '20 at 07:53
  • 2
    better to pass all paths to spark, it will load files parallel. if you use foreach , it will load file in sequence. – Srinivas May 16 '20 at 07:57
  • Sorry my bad it worked perfectly and I actually much faster than before (in a loop) – NikSp May 16 '20 at 08:09
  • 1
    I have tested & updated answer. Please accept answer or upvote if it helps to solve problem ? – Srinivas May 16 '20 at 08:13
  • Look my code: ``` read_avro_data=spark.read.format("avro").load(list_of_paths) ``` list_of_paths = ```['abfss://name@name.dfs.core.windows.net/folder_1/folder_2/0/2020/05/16/00/00/08.avro','abfss://name@name.dfs.core.windows.net/folder_1/folder_2/0/2020/05/16/00/01/08.avro','abfss://name@name.dfs.core.windows.net/folder_1/folder_2/0/2020/05/16/00/02/08.avro']``` – NikSp May 16 '20 at 08:13
  • @RohithRangaraju ```spark.read.format('avro').load('python/test_support/sql/*')``` it does not fit to my approach since I need to explicitly read specific path names at once. The avro files are not all in the same folder, but rather to different folders. That's why I believe your approach won't work, even though it is totally correct! – NikSp May 16 '20 at 08:17
  • 1
    @NikSp why don't you proceed by providing wild card characters. you don't need to read all the paths at all. Since you are already connected to ADLS, you can simply hardcode till root folder and pass `*`.. root path is `abfss://{0}@{1}.dfs.core.windows.net/rootpath/*`. – Rohith Rangaraju May 16 '20 at 08:21
  • @RohithRangaraju I have 1 AVRO file per folder. So 3 path names = 3 different folders..if you see closely my 3 path names in the list are not the same but rather totally different. Look how many custom folders I have before an AVRO file. ``` 'abfss://name@name.dfs.core.windows.net/folder_1/folder_2/0/2020/05/16/00/00/08.avro' 'abfss://name@name.dfs.core.windows.net/folder_1/folder_2/0/2020/05/16/00/01/08.avro' ``` Those two path names are different if you see. Although I will try your approach and I will update you – NikSp May 16 '20 at 08:26
  • 1
    @RohithRangaraju Yeah actually I tried with a wildcard (*) but since I have 1 folder per AVRO file and I want information from all the possible different folders (which are separated by year/month/day/hour/minute), I can use wildcard only If I make a for loop on the possible combinations of month, day, hour, minute. Since I want to avoid a for loop and optimize the reading of the AVRO files, are the reasons why I used the way of a list of multiple paths instead :). – NikSp May 16 '20 at 20:35