AttachDistributedSequence is not supported in Unity Catalog

Question

I'm trying to read a table on Databricks to a DataFrame using the pyspark.pandas.read_table and receive the following error:

AnalysisException: [UC_COMMAND_NOT_SUPPORTED] AttachDistributedSequence is not supported in Unity Catalog.;
AttachDistributedSequence[__index_level_0__#767L, _c0#734, carat#735, cut#736, color#737, clarity#738, depth#739, table#740, price#741, x#742, y#743, z#744] Index: __index_level_0__#767L
+- SubqueryAlias spark_catalog.default.diamonds
   +- Relation hive_metastore.default.diamonds[_c0#734,carat#735,cut#736,color#737,clarity#738,depth#739,table#740,price#741,x#742,y#743,z#744] csv

The table was created following the Databricks Quick Start notebook:

DROP TABLE IF EXISTS diamonds;
 
CREATE TABLE diamonds
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", header "true")

I'm trying to read the table with

import pyspark.pandas as ps
psdf = ps.read_table("hive_metastore.default.diamonds")

and get the error above.

Reading the table into spark.sql.DataFrame works fine with

df = spark.read.table("hive_metastore.default.diamonds")

The cluster versions are

Databricks Runtime Version 11.2
Apache Spark 3.3.0
Scala 2.12

I'm familiar with pandas already and would like to use pyspark.pandas.DataFrame since I assume it will have a familiar API and be quick for me to learn and use.

The questions I have:

What does the error mean?
What can I do to read the tables to pyspark.pandas.DataFrame?
Alternatively, should I just learn pyspark.sql.DataFrame and use that? If so, why?

I'm new to Databricks and don't know what that means. There's one cluster and more than 1 person in the company use that one, is that a shared cluster? — Toivo Mattila, Oct 26 '22 at 08:42
there are different policies: https://docs.databricks.com/clusters/create-cluster.html#what-is-cluster-access-mode — Alex Ott, Oct 26 '22 at 09:01
@AlexOtt I just ran into this, too. If I use a cluster in Single User mode, ps.read_table works. If I use a shared cluster, I get the "UC_COMMAND_NOT_SUPPORTED] AttachDistributedSequence is not supported in Unity Catalog.;" error. But I'm still unclear as to why. Could you elaborate? Thanks! — mmarie, Jan 13 '23 at 21:28
For now I'm just going to work around and load a spark data frame and and then use .toPandas(). I understand that loads it to memory, but that is ok in this case as it's fairly small. Is there a better way to handle this? — mmarie, Jan 13 '23 at 21:43

score 1 · Accepted Answer · answered Jan 16 '23 at 11:18

The AttachDistributedSequence is a special extension used by Pandas on Spark to create a distributed index. Right now it's not supported on the Shared clusters enabled for Unity Catalog due the restricted set of operations enabled on such clusters. The workarounds are:

Use single-user Unity Catalog enabled cluster
Read table using the Spark API, and then use pandas_api function (doc) to convert into Pandas on Spark DataFrame. (in Spark 3.2.x/3.3.x it's called to_pandas_on_spark (doc)):

pdf = spark.read.table("abc").pandas_api()

P.S. It's not recommended to use .toPandas as it will pull all data to the driver node.

Thank you very much, this helped a lot! I accepted your answer as the correct one — Toivo Mattila, Jan 17 '23 at 12:05

AttachDistributedSequence is not supported in Unity Catalog

The questions I have:

1 Answers1