2

I know this is possible. I know there is a simple solution, but everything I've tried has failed.

Here's the deal:

I have a dataset in Excel format containing 939,019 weather station records (rows). The date/time interval is every 10 minutes starting from 1/29/1993 16:30 to 6/30/2013 24:00. If I do the math, it is clear that there are missing rows.

I need to know the missing dates/times. It would be cool if I could have some little program/script that returned the start date/time and end date/time of the missing intervals. But I'll just be happy with a list of the missing dates/times.

To figure it out, I thought, oh, all I need is a reference list to compare the list with missing dates and have some way of flagging or returning the gaps.

So, in Excel, I created a column adjacent to the weather station data and populated the first row with the start date. The subsequent rows just add 10 minutes to the cell above it. Unfortunately, the number of 10 minute intervals in that 20 year span is more than excel can handle. No worries. It gets close enough (1/6/2013 10:50).

Anyway, I tried the MATCH function in excel, but that is taking way too long. In the time it is taking me to type this, it has reached 3% (using 12 processors). I have 30 weather stations (with the same date range) to do. I'm hoping I can find a faster way to do this.

So, I next tried Acess. I imported the files (the weather station data and a separate reference date list) as tables in Access and thought I'd just do an UNMATCHED query, but for some reason (no matter how I format the date column (date/time, serial number), the query returns just about all the rows as unmatched. Not sure why, and it does do it quick, but it is obviously wrong.

I then thought - Python! That'd do it, right? But I'm a GIS person. I've only ever used Python sample scripts to run geoprocessing tools (or used ESRi's Model Builder). I don't really have a clue where to start. Any pointers?

PKumar
  • 10,971
  • 6
  • 37
  • 52
mandeno
  • 21
  • 1
  • 30 stations * 20 years of 10 minute readings is around 32e6 transactions. I would expect Excel to be rather slow at this. I would also think a database would be do the job nicely, especially if you index the time stamps. Your problem with unmatched time stamps might be resolved by scaling up the time stamps to the minute level. This should get around differences in how floating point dates are stored. – andy holaday Dec 20 '13 at 04:43
  • If you save the page as `csv` and show the format of the first ten rows or so, it'll be easier to give a worked example. – DSM Dec 20 '13 at 04:57
  • Each station is in its own .xlsx file and is about 70 to 80MB. I'm used to dealing with much larger files in ArcGIS, so I hadn't realized how much it would bog down Excel. Here's a sample of the data: oops! hit return when I shouldn't have...here's a sample, just copied/pasted a few of the first rows: DATETIME DATE TIME DAYTIME WIND_MSEC WINDGUST WINDDEG AIRTEMP1 HUMID SOLAR AIRTEMP2 – mandeno Dec 20 '13 at 14:40
  • Ug - these comments are finicky! Here's a link: ftp://digitalmappingsolutions.com/public_ftp/sample.csv (might be a bit long) – mandeno Dec 20 '13 at 14:47
  • @mandeno - your link requires a login. Can you post a gist on github? – MattDMo Dec 20 '13 at 14:55
  • @MattDMo - not sure what GitHub is, sorry, but does this work? https://www.dropbox.com/s/cm0hcd1qb8ges19/sample.csv – mandeno Dec 20 '13 at 18:55
  • 1
    @MattDMo - I did download the xlutils executables and will try that route, but now I am, of course, on to another project. I'm shelving this until after the holidays since there is no hurry - plus I want to do it right. After perusing the data a bit more, it appears that the weather station (at least the first one) switched to hourly intervals at some point during that 20 year interval - of course. – mandeno Dec 20 '13 at 18:58

1 Answers1

2

First, check out python-excel.org for xlrd, xlwt, and xlutils modules and documentation (I'm assuming you're working with .xls files, and not .xlsx - if so, check out openpyxl). Once you've got them installed, read through the docs to familiarize yourself with them, they're not too long or overly complicated. The actual comparison shouldn't be too hard: all you need to do is read cell N, compare its value to cell N+1, and see if the difference is 10 minutes. If it is, great, go to the next value. If not, print the value to a new workbook (or whatever you want to do - insert a blank row with the missing time and calculate again, or what have you).

I don't know how long this will take to run through ~30 million records, but I'm willing to bet it'll be faster than doing it via Excel itself :)

Good luck!

MattDMo
  • 100,794
  • 21
  • 241
  • 231