1

I'm practicing how to perform join operation on pydatatable's dataframes.

First DT is created as follows,

DT_1=dt.Frame({"title": np.array(['stat','math','stat','math','esp']),
               "score": np.array([23,43,21,50,16])})

Second DT is created as follows,

DT_2=dt.Frame({"title": np.array(['stat','esp','math','stat']),
               "price": np.array([350,450,530,430])})

I'm setting a key on DT_2.key = "title", as it has duplicated titles it's giving a key value error: ValueError: Cannot set a key: the values are not unique.

I would like to know that uniqueness is enforced on a key or not in python datatable? Whereas in R datatable uniqueness is not enforced and duplicate key values are allowed.

Is there any reference documentation for it?

Pasha
  • 6,298
  • 2
  • 22
  • 34
myamulla_ciencia
  • 1,282
  • 1
  • 8
  • 30

1 Answers1

1

Values in key columns must be unique, see the documentation here: https://datatable.readthedocs.io/en/latest/api/frame.html#datatable.Frame.key.

You can think of a key column as if it turns the Frame into a row-wise dictionary, where the the "key" part of the dictionary is in the key column(s), and the "value" part is in all other columns. The "key" may consist of multiple columns, in which case the key value for each row is the tuple of values from each of the key columns.

Thus, datatable's key is equivalent to pandas' index (via .set_index()), or to SQL primary key.

Pasha
  • 6,298
  • 2
  • 22
  • 34
  • But Pandas indexes don't have to be unique, i.e. `df = pd.DataFrame({'a':[0]*5, 'b':range(5)})`, `df=df.set_index('a')` – wst Jul 12 '20 at 18:58