1

I'm pulling in a data file with key value pairs where the data is raw and starts as a string. I created this function to pass the value of the key value pair to check what data type it is.

I created this function to tag that value and convert it to the appropriate datatype as needed.

Is this the best way to handle this, or is there a library or function already included with python that's faster, or more efficient?

import dateparser
def dataType(value):
    try: 
        int(value)
        return 'INTEGER'
    except ValueError:
        try:
            float(value)
            return 'DOUBLE'
        except ValueError:
            try:
                if value and value[0].isdigit():
                    dateparser.parse(value, settings={'STRICT_PARSING': True})
                    return 'DATETIME'
                else: 
                    return 'VARCHAR'
            except ValueError:
                return 'VARCHAR'
Dom DaFonte
  • 1,619
  • 14
  • 31
  • maybe you are looking for `type` or `isinstance` bif. – frederick99 Apr 26 '17 at 12:19
  • Are you trying to build a sql statement? – OneCricketeer Apr 26 '17 at 12:21
  • My ultimate goal is to take this data that's passed in a raw string format and then convert it as needed. I'll also use it to algorithmically create create table statement with the appropriate datatype (which is why you see text referenced as varchar). – Dom DaFonte Apr 26 '17 at 12:27

4 Answers4

1

You cloud do something like:

import re
from collections import OrderedDict
import datetime

register_type = OrderedDict()


register_type["INTEGER"] = {"handle":int, "args": [], "kw": {}}
register_type["DOUBLE"] = {"handle":float, "args": [], "kw": {}}
register_type["DATE"] = {"handle":lambda x: datetime.datetime.strptime(x, "%Y-%m-%d"), "args": [], "kw": {}}
register_type["ALPHA"] = {"handle":lambda x: re.match("\w+", x), "args": [], "kw": {}}


def get_type(value):
    type_ = "UNKNOWN"
    for k, v in register_type.items():
        try:
            parsed = v["handle"](value, *v["args"], **v["kw"])
            type_ = k
            break
        except ValueError as E:
            continue
    return  type_




# print(get_type("2017-01-26"))
# "DATE"
# print(get_type("ali"))
# "alpha"
# print(get_type("10"))
# "INTEGER"
# print(get_type("10.0"))
# "DOUBLE"

In that way you can easly add parser, thus you can decouple your code.

Ali SAID OMAR
  • 6,404
  • 8
  • 39
  • 56
  • 1
    I just benchmark tested your function compared to mine and yours performed the task in 28 seconds compared to 9 minutes 22 seconds mine executes the same file in. Thanks! – Dom DaFonte Apr 27 '17 at 01:20
1

Using arrow library to easily detect datetime type.

I got more than inspired by @thefourtheye's answer using literal_eval to convert any data to the appropriate type.

Then I use a map to convert the type found by the literal string wanted.

from ast import literal_eval
import arrow

map_type_to_string = {int: 'INTEGER',
                      bool: 'BOOLEAN',
                      str: 'VARCHAR',
                      float: 'DOUBLE',
                      arrow.arrow.Arrow: 'DATETIME'}

def get_type(input_data):
  try:
      return map_type_to_string[type(literal_eval(input_data))]
  except (ValueError, SyntaxError):
      try:
        return map_type_to_string[type(arrow.get(input_data))]
      except (arrow.parser.ParserError, ValueError):
        return map_type_to_string[str]

print(get_type("1"))                              # INTEGER
print(get_type("1.2354"))                         # DOUBLE
print(get_type("True"))                           # BOOLEAN
print(get_type("2002-12-25 00:00:00-06:39"))      # DATETIME
print(get_type("abcd"))                           # VARCHAR

Hope it helps a bit.

Community
  • 1
  • 1
Kruupös
  • 5,097
  • 3
  • 27
  • 43
1

If you want something that's built-in to Python (no modules required) then this could be of some use.

get_type = lambda value: str(type(value)).split("'")[1]

What's going on here is that we're taking the result from the built-in type function and converting it into a string in order to split it so that we can return the data type that is reported back from that function.

Here's what I've gathered from some tests in Python 3:

>>> get_type(10)
'int'
>>> get_type('10')
'str'
>>> get_type(b'10')
'bytes'
>>> get_type(10.0)
'float'
>>> get_type(True)
'bool'

I wish I could go into further detail on this but this just happens to work for my needs so hopefully this may be of use to someone else.

0

If your code waits for inputs to be typed one by one then my answer should be fine. If you have to analyse a data set then it is quite a different job.

Please note this is a naive answer which does not use any extra libraries. You can also specify additional date, time and datetime formats to match your requirements.

from re import match

# Does not consider that 0 and 1 can be used to describe a BOOLEAN SQL value
# Use True, true and FALSE, false to describe SQL BOOLEAN values
PY_2_SQL_TYPE_PATTERNS = (
    ((r'^TRUE|true|FALSE|false$', ), 'BOOLEAN'),
    ((r'^\d+$', ), 'INTEGER'),
    ((r'^\d+\.\d+$', ), 'DOUBLE'),
    ((r'^\d{4}/\d{2}/\d{2}$', ), 'DATE'),
    ((r'^\d{2}:\d{2}:\d{2}.\d{1,5}$', ), 'TIME'),
    ((r'^\d{4}/\d{2}/\d{2} \d{2}:\d{2}:\d{2}.\d{1,5}$', ), 'DATETIME'),
    ((r'.*', ), 'VARCHAR'),  # Falls back to VARCHAR as .* will match anything
)


def get_sql_type(value):
    '''
    Find the corresponding SQL TYPE according to value
    :param value: A value typed by the user
    :return: The corresponding SQL TYPE from PY_2_SQL_TYPE_PATTERNS as a string
    '''
    for patterns, sql_type in PY_2_SQL_TYPE_PATTERNS:
        for p in patterns:
            if match(p, value):
                return sql_type


v = input('Type a value: ').strip()
print(get_sql_type(v))

You will probably want to test this to make sure there are no special cases to take into account.

Q Caron
  • 952
  • 13
  • 26