1

I hope this is an appropriate question for here. If not, let me know, and I will remove it immediately.

Question:

How can I use python to inspect (visually?) a large dataset for errors that arise during combination?

Background:

I am working with several large (but not, you know "Big") datasets that I combine to form one larger dataset. This new set is ~2.5G in size, so it does not fit in most spreadsheet programs, or at least not in the ones I've tried (MS Excel, OpenOffice).

The process to create the final dataset uses fuzzy matching (via fuzzywuzzy), and I want to inspect the results of the matching to see if there are any errors introduced.

As of now, I have tried importing the entire set into a pandas dataframe. This DF has 64 columns, so when I simply do something like df.head() the resulting displayed info obviously does not show all the columns; I thus ruled out just iterating through multiple .head() calls.

There is a similar question about visualizing specific aspects of a dataframe here. My question is different, I think, because I don't need to visualize anything about the underlying structure or types. I just want to visually inspect areas I suspect might have errors.

Community
  • 1
  • 1
Savage Henry
  • 1,990
  • 3
  • 21
  • 29
  • 2
    How about setting the display properties so that you can display all rows and columns? Would that be acceptable? – Julien Marrec Jul 21 '15 at 15:03
  • Thanks for the thought! I have tried that, but on a standard screen inside an IDE (I use PyCharm) there are wrapping issues, and I'd like to inspect 10-12 of those columns at a time. Another approach that I can't seem to get to work right now is to make each row a list, then print the list out in a descending fashion on screen, so I can at least read down the "row" to see if things are looking good. I think this might work because each item of the list would be printed on its own line on-screen, so I'd have a lot of screen space. – Savage Henry Jul 21 '15 at 16:07
  • 1
    I think you probably just need to spend some time with indexing/selection docs: http://pandas.pydata.org/pandas-docs/version/0.16.2/indexing.html Then you can, for example, look at the first five columns of data where a certain column begins with the letter "W". If there are specific things you have trouble doing, then post a new followup question about how to do it. – JohnE Jul 21 '15 at 17:09
  • 1
    Henry, if you're looking at only 10-12 rows at a time... I suggest transposing so that your 60 columns end up as index and vice versa. That should work nicely – Julien Marrec Jul 21 '15 at 17:32

1 Answers1

1

How about slicing your 10-12 rows and then transposing that you have a 64 rows x 12 columns dataframe. This should be readable provided you don't have very large index names.

import pandas as pd
import numpy as np

# Set max number of rows, 64 would be enough here but I'm trying to be safe
pd.set_option('display.max_rows', 500)

df = pd.DataFrame(np.random.randn(1000,64))
nstart = 100
# Slice 12 lines starting at nstart, and transpose that...
df.iloc[nstart:(nstart+13)].T

I'm sparing you the output here, but try running the above code.

Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
  • Much appreciated. I think this is the way to go. Wasn't ware of the `.iloc` feature. – Savage Henry Jul 21 '15 at 17:53
  • `.ix` accepts mixed integer/label based indexing should you need it.If my answer solved your question please mark the answer as accepted in order to close the question – Julien Marrec Jul 21 '15 at 19:24