Method for guessing type of data represented currently represented as strings

Question

I'm currently parsing CSV tables and need to discover the "data types" of the columns. I don't know the exact format of the values. Obviously, everything that the CSV parser outputs is a string. The data types I am currently interested in are:

integer
floating point
date
boolean
string

My current thoughts are to test a sample of rows (maybe several hundred?) in order to determine the types of data present through pattern matching.

I am particularly concerned about the date data type - is their a python module for parsing common date idioms (obviously I will not be able to detect them all)?

What about integers and floats?

score 5 · Answer 1 · answered Jun 23 '10 at 01:59

5

ast.literal_eval() can get the easy ones.

answered Jun 23 '10 at 01:59

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

Would *never* have found this one on my own! Thanks. – fmark Jun 23 '10 at 02:16
4

-1 ast.literal_eval() as one might imagine is more suitable for CODE than data ... for example, ast.literal_eval(" 123 ") raises IndentationError, while int(" 123 ") isn't spooked by whitespace. Look at the result of `field = "some text"; ast.literal_eval(field)` ... `ValueError: malformed string`. No use on booleans: works with True, but TRUE (usual Excel CSV output) -> malformed string. Useless with dates. – John Machin Jun 23 '10 at 03:23
I was happy when I read this ast recommend... then realized it also evaluates dicts and lists (that is, strings looking like dicts or lists) and I don't want that. Back to the brutal method. – Jürgen A. Erhard Feb 18 '17 at 23:03

score 3 · Accepted Answer · answered Jun 23 '10 at 01:49

Dateutil comes to mind for parsing dates.

For integers and floats you could always try a cast in a try/except section

>>> f = "2.5"
>>> i = "9"
>>> ci = int(i)
>>> ci
9
>>> cf = float(f)
>>> cf
2.5
>>> g = "dsa"
>>> cg = float(g)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for float(): dsa
>>> try:
...   cg = float(g)
... except:
...   print "g is not a float"
...
g is not a float
>>>

score 2 · Answer 3 · answered Oct 25 '14 at 13:08

You may be interested in this python library which does exactly this kind of type guessing on both general python data and CSVs and XLS files:

It happily scales to very large files, to streaming data off the internet etc.

There is also an even simpler wrapper library that includes a command line tool named dataconverters: http://okfnlabs.org/dataconverters/ (and an online service: https://github.com/okfn/dataproxy!)

The core algorithm that does the type guessing is here: https://github.com/okfn/messytables/blob/7e4f12abef257a4d70a8020e0d024df6fbb02976/messytables/types.py#L164

S.Lott · Answer 4 · 2010-06-23T09:51:04.913

The data types I am currently interested in are...

These do not exist in a CSV file. The data is only strings. Only. Nothing more.

test a sample of rows

Tells you nothing except what you saw in the sample. The next row after your sample can be a string which looks entirely different from the sampled strings.

The only way you can process CSV files is to write CSV-processing applications that assume specific data types and attempt conversion. You cannot "discover" much about a CSV file.

If column 1 is supposed to be a date, you'll have to look at the string and work out the format. It could be anything. A number, a typical Gregorian date in US or European format (there's not way to know whether 1/1/10 is US or European).

try:
    x= datetime.datetime.strptime( row[0], some format )
except ValueError:
    # column is not valid.

If column 2 is supposed to be a float, you can only do this.

try:
    y= float( row[1] )
except ValueError:
    # column is not valid.

If column 3 is supposed to be an int, you can only do this.

try:
    z= int( row[2] )
except ValueError:
    # column is not valid.

There is no way to "discover" if the CSV has floating-point digit strings except by doing float on each row. If a row fails, then someone prepared the file improperly.

Since you have to do the conversion to see if the conversion is possible, you might as well simply process the row. It's simpler and gets you the results in one pass.

Don't waste time analyzing the data. Ask the folks who created it what's supposed to be there.

Capripot · Answer 5 · 2020-12-09T06:58:52.097

We tested ast.literal_eval() but rescuing from error is pretty slow, if you want to cast from data that you receive all as string, I think that regex would be faster.

Something like the following worked very well for us.

import datetime
import re

"""
Helper function to detect the appropriate type for a given string.
"""
def guess_type(s):
    if s == ""
        return None
    elif re.match("\A[0-9]+\.[0-9]+\Z", s):
        return float
    elif re.match("\A[0-9]+\Z", s):
        return int
    # 2019-01-01 or 01/01/2019 or 01/01/19
    elif re.match("\A[0-9]{4}-[0-9]{2}-[0-9]{2}\Z", s) or \
         re.match("\A[0-9]{2}/[0-9]{2}/([0-9]{2}|[0-9]{4})\Z", s): 
        return datetime.date
    elif re.match("\A(true|false)\Z", s):
        return bool
    else:
        return str

Tests:

assert guess_type("") == None
assert guess_type("this is a string") == str
assert guess_type("0.1") == float
assert guess_type("true") == bool
assert guess_type("1") == int
assert guess_type("2019-01-01") == datetime.date
assert guess_type("01/01/2019") == datetime.date
assert guess_type("01/01/19") == datetime.date

At the beginning, one could return NoneType for an empty string and, in the datetime check, one could split by 'T' and take the first part, to include the case of a timestamp... Anyway the point is that this is not only fast but also flexible and easily customizable. — , Oct 22 '20 at 19:56

Method for guessing type of data represented currently represented as strings

5 Answers5

Linked

Related