14

I have a non-technical client who has some hierarchical product data that I'll be loading into a tree structure with Python. The tree has a variable number of levels, and a variable number nodes and leaf nodes at each level.

The client already knows the hierarchy of products and would like to put everything into an Excel spreadsheet for me to parse.

What format can we use that allows the client to easily input and maintain data, and that I can easily parse into a tree with Python's CSV? Going with a column for each level isn't without its hiccups (especially if we introduce multiple node types)

Erich
  • 2,509
  • 2
  • 21
  • 24
  • Not sure I've understood "Going with a column for each level isn't without its hiccups". Hierarchical information would be required in any case. – Roman Susi Jul 27 '13 at 17:26

2 Answers2

15

For future readers, I ended up using a column-based hierarchy where each row is the complete traversal to a leaf. So you end up with as many rows as there are leafs.

Electronics | Computers    | Laptops
Electronics | Computers    | Desktop
Electronics | Game Systems | Xbox
Electronics | Game Systems | PS3
Electronics | Game Systems | Wii
Electronics | MP3 Players  | iPod Shuffle
Clothing    | Menswear     | Pants         | Shorts
Clothing    | Menswear     | Pants         | Pajamas

In the script, Python traverses row-by-row, cell-by-cell, keeping track of both the current row and the previous row. Since you traverse from left-to-right you go from root to leaf. If the current column in current row is ever different than the current column in the previous row, then we must have gone down a new branch, and we'll add a new node to our tree.

Erich
  • 2,509
  • 2
  • 21
  • 24
  • I also recommend this solution. IMO, using an empty row indent is not good, especially to use data filters in excel, all the cells should have a value. So Erich solution is clean on both excel and python sides. – DevLounge Jul 27 '13 at 21:26
  • Then, on the python side, you can just use a nested structure of defaultdict(dict) – DevLounge Jul 27 '13 at 21:27
0

If spreadsheet is a must in this solution, hierarchy can be represented by indents on the Excel side (empty cells at the beginnings of rows), one row per node/leaf. On the Python side, one can parse them to tree structure (of course, one needs to filter out empty rows and some other exceptions). Node type can be specified on it's own column. For example, it could even be the first non-empty cell.

I guess, hierarchy level is limited (say, max 8 levels), otherwise Excel is not good idea at all.

Also, there is a library called openpyxl, which can help reading Excel files directly, without user needing to convert them to CSV (it adds usability to the overall approach).

Another approach is to put a level number in the first cell. The number should never be incremented by 2 or more.

Yet another approach is to use some IDs for each node and each node leaf would need to specify parent's id. But this is not very user-friendly.

Roman Susi
  • 4,135
  • 2
  • 32
  • 47