1

I'm trying to read a table on Databricks to a DataFrame using the pyspark.pandas.read_table and receive the following error:

AnalysisException: [UC_COMMAND_NOT_SUPPORTED] AttachDistributedSequence is not supported in Unity Catalog.;
AttachDistributedSequence[__index_level_0__#767L, _c0#734, carat#735, cut#736, color#737, clarity#738, depth#739, table#740, price#741, x#742, y#743, z#744] Index: __index_level_0__#767L
+- SubqueryAlias spark_catalog.default.diamonds
   +- Relation hive_metastore.default.diamonds[_c0#734,carat#735,cut#736,color#737,clarity#738,depth#739,table#740,price#741,x#742,y#743,z#744] csv

The table was created following the Databricks Quick Start notebook:

DROP TABLE IF EXISTS diamonds;
 
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")

I'm trying to read the table with

import pyspark.pandas as ps
psdf = ps.read_table("hive_metastore.default.diamonds")

and get the error above.

Reading the table into spark.sql.DataFrame works fine with

df = spark.read.table("hive_metastore.default.diamonds")

The cluster versions are

Databricks Runtime Version 11.2
Apache Spark 3.3.0
Scala 2.12

I'm familiar with pandas already and would like to use pyspark.pandas.DataFrame since I assume it will have a familiar API and be quick for me to learn and use.

The questions I have:

  • What does the error mean?
  • What can I do to read the tables to pyspark.pandas.DataFrame?
  • Alternatively, should I just learn pyspark.sql.DataFrame and use that? If so, why?
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Toivo Mattila
  • 377
  • 1
  • 9
  • are you using shared cluster? – Alex Ott Oct 25 '22 at 07:20
  • I'm new to Databricks and don't know what that means. There's one cluster and more than 1 person in the company use that one, is that a shared cluster? – Toivo Mattila Oct 26 '22 at 08:42
  • there are different policies: https://docs.databricks.com/clusters/create-cluster.html#what-is-cluster-access-mode – Alex Ott Oct 26 '22 at 09:01
  • @AlexOtt I just ran into this, too. If I use a cluster in Single User mode, ps.read_table works. If I use a shared cluster, I get the "UC_COMMAND_NOT_SUPPORTED] AttachDistributedSequence is not supported in Unity Catalog.;" error. But I'm still unclear as to why. Could you elaborate? Thanks! – mmarie Jan 13 '23 at 21:28
  • For now I'm just going to work around and load a spark data frame and and then use .toPandas(). I understand that loads it to memory, but that is ok in this case as it's fairly small. Is there a better way to handle this? – mmarie Jan 13 '23 at 21:43

1 Answers1

1

The AttachDistributedSequence is a special extension used by Pandas on Spark to create a distributed index. Right now it's not supported on the Shared clusters enabled for Unity Catalog due the restricted set of operations enabled on such clusters. The workarounds are:

  • Use single-user Unity Catalog enabled cluster
  • Read table using the Spark API, and then use pandas_api function (doc) to convert into Pandas on Spark DataFrame. (in Spark 3.2.x/3.3.x it's called to_pandas_on_spark (doc)):
pdf = spark.read.table("abc").pandas_api()

P.S. It's not recommended to use .toPandas as it will pull all data to the driver node.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132