I would like to create a DataFrame that has an "index" (integer) from a number of (sparse) Series, where the index (or primary key) is NOT necessarily consecutive integers. Each Series is like a vector of (index, value)
tuple or {index: value}
mapping.
(1) A small example
In Pandas, this is very easy as we can create a DataFrame at a time, like
>>> pd.DataFrame({
"A": {0: 'a', 20: 'b', 40: 'c'},
"B": {10: 'd', 20: 'e', 30: 'f'},
"C": {20: 'g', 30: 'h'},
}).sort_index()
A B C
0 a NaN NaN
10 NaN d NaN
20 b e g
30 NaN f h
40 c NaN NaN
but I can't find an easy way to achieve a similar result with Polars. As described in Coming from Pandas, Polars does not use an index unlike Pandas, and each row is indexed by its integer position in the table; so I might need to represent an "indexed" Series with a 2-column DataFrame:
A = pl.DataFrame({ "index": [0, 20, 40], "A": ['a', 'b', 'c'] })
B = pl.DataFrame({ "index": [10, 20, 30], "B": ['d', 'e', 'f'] })
C = pl.DataFrame({ "index": [20, 30], "C": ['g', 'h'] })
I tried to combine these multiple DataFrames, joining on the index
column:
>>> A.join(B, on='index', how='outer').join(C, on='index', how='outer').sort(by='index')
shape: (5, 4)
┌───────┬──────┬──────┬──────┐
│ index ┆ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str │
╞═══════╪══════╪══════╪══════╡
│ 0 ┆ a ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10 ┆ null ┆ d ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 20 ┆ b ┆ e ┆ g │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 30 ┆ null ┆ f ┆ h │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 40 ┆ c ┆ null ┆ null │
└───────┴──────┴──────┴──────┘
This gives the result I want, but I wonder:
- (i) if there is there more concise way to do this over many columns, and
- (ii) how make this operation as efficient as possible.
Alternatives?
I also tried outer joins as this is one way to combine Dataframes with different number of columns and rows, as described above.
Other alternatives I tried includes diagonal concatenation, but this does not deduplicate or join on index
:
>>> pl.concat([A, B, C], how='diagonal')
index A B C
0 0 a None None
1 20 b None None
2 40 c None None
3 10 None d None
4 20 None e None
5 30 None f None
6 20 None None g
7 30 None None h
(2) Efficiently Building a Large Table
The approach I found above gives desired results I'd want but I feel there must be a better way in terms of performance. Consider a case with more large tables; say 300,000 rows and 20 columns:
N, C = 300000, 20
pls = []
pds = []
for i in range(C):
A = pl.DataFrame({
"index": np.linspace(i, N*3-i, num=N, dtype=np.int32),
f"A{i}": np.arange(N, dtype=np.float32),
})
pls.append(A)
B = A.to_pandas().set_index("index")
pds.append(B)
The approach of joining two columns in a row is somewhat slow than I expected:
%%time
F = functools.reduce(lambda a, b: a.join(b, on='index', how='outer'), pls)
F.sort(by='index')
CPU times: user 1.49 s, sys: 97.8 ms, total: 1.59 s
Wall time: 611 ms
or than one-pass creation in pd.DataFrame:
%%time
pd.DataFrame({
f"A{i}": pds[i][f'A{i}'] for i in range(C)
}).sort_index()
CPU times: user 230 ms, sys: 50.7 ms, total: 281 ms
Wall time: 281 ms