0

I have this file for work (and 7000 others of the same format) that is very messy and not tidy in any way. I've been reading about tidying data using Pandas but feel I'm spinning my wheels at this point...

Here is the raw data viewed in Excel:

enter image description here

Here is some example text from the CSV:

Section 6. Reserve Summary
Ten Minute Reserve Requirement:, 1801
Ten Minute Reserve Estimate:, 1801
Thirty Minute Reserve Requirement:, 626
Thirty Minute Reserve Estimate:, 1926
Expected Actions of OP 4:, 0
Additional Capacity Available from OP 4 Actions:, 0
Section 7. Interchange Summary
Description, Import Limit MW, Export Limit MW, Scheduled, Contract
Highgate, -225, 0, -225
NB, -550, 200, -432
NYISO AC, -1400, 1200, 0
NYISO CSC, -346, 330, 330 NYISO NNC, -200, 200, 194 Phase 2 -2000 1200 -1501
Section 8. Weather Forecast Summary for the Peak Hour
City, Conditions, Wind, High Temperature (F)
Boston, Partly Cloudy, NE-10, 66
Hartford, Mostly Clear, N-12, 77

You can see column A is useless so I can remove. Column B mostly has variable names but also has Section names (rows 7, 9, 11...). Sometime column B has the value, but most of the time the value is listed in Column C-- also sometimes listed in Column D. Lines 44- 54 have some extra formatting going on where there are is a table of variable names and values...

Anyway, I absolutely do not have the skills to turn this into a tidy dataframe and will need to throw this to someone else. However, I'm hoping anyone can give advice on what to do. Is this even called 'data cleaning' or 'data structuring'?

I dropped Col A, then transposed the data, but that is far from setting this dataframe up correctly. What are other techniques to move data into the tidy structure needed?

Any resources shared would be great! I searched for too long on 'tidy data', 'data cleaning', 'data structuring' but all were too simplistic compared to this application.

Cœur
  • 37,241
  • 25
  • 195
  • 267
  • 2
    Separate by section. Use nested dicts. Then you would access with tree-like hierarchy: `data['Section 9']['NYISO']` which would be `Yes` in this case. The multiple data fields, such as `data['Section 7']['Highgate']` would also return a dict with the key from the fields above and the value from the column. You will likely need to write this in native Python; not sure Pandas will do that straight. It is not terribly hard -- just tedious. – dawg Jun 26 '18 at 00:35
  • 2
    Add some example text CSV. No one is going to type all that to help you... – dawg Jun 26 '18 at 00:35
  • 1
    Column A doesn't seem useless. C = comment, H = header, D = data, T = total. H lines declare the column names for the following D lines. – AKX Aug 05 '18 at 11:14
  • 1
    Either way, like dawg said, you definitely do need to first read this using regular native Python before thinking about using Pandas. – AKX Aug 05 '18 at 11:16

0 Answers0