2

I am new to python and can use any help I can get. I am on a win7 machine and am using python 3.5 (anaconda)

I am trying to iterate through multiple CSV files (10k +) within a folder, checking for any value within that file that exceeds a predefined threshold.

I would like to build a dictionary, or list/tuple (basically whatever most resembles an sql table) using a sub-string of the filename as a unique identifier for a name field and have another column with total count of files that had values that exceeded the given threshold.

I don't expect any of you to do this for me as it is great practice, but I would appreciate any recommendations for modules that may make this a bit easier.

I've been able to check a file for a value but this is only about 10 minutes into this quest and I am not sure how I'd iterate through multiple files and build the table etc.. Thanks!

import numpy as np
path = 'C:\\path' 
file = 'file.csv'
with open(path+file) as f:
    my_data = np.genfromtxt(path+file, delimiter = ",")
    for data in my_data:
        if -1 in my_data:
            print("it sure is")   
StelioK
  • 1,771
  • 1
  • 11
  • 21
  • 1
    Start with a small set of the files; get the load and checking working well for one file. Then just iterate over the files and collect the data. A dictionary, maybe a `defaultdict` is as good a place to start collecting. But with only a few K files, the data structure isn't that big of an issue. At this point your description is too vague to suggest anything more concrete. – hpaulj Nov 04 '16 at 23:51
  • Thank @hpaulj I appreciate your input. – StelioK Nov 07 '16 at 17:09

3 Answers3

1

Since you asked for modules and possible usage. You might consider something like this. import os import sqlite3 for root, dirs, files in os.walk(): //using os module if file == somethingyouwanttoparse: //ie *.csv with open as f: if line data == IWantToSaveThis: insert data into sqlite table //using sqlite3 module
https://docs.python.org/3/library/os.html https://docs.python.org/3.5/library/sqlite3.html I tend to use actual SQL databases when possible.

vsmayberry
  • 67
  • 3
1

Here is working Pandas solution:

import glob
import os
import pandas as pd

all_files = glob.glob(r'd:/temp/csv/*.csv')

threshold = 100

data = []

for f in all_files:
    data.append([os.path.basename(f),
                (pd.read_csv(f, header=None) > threshold).sum().sum()])

df = pd.DataFrame(data, columns=['file','count'])

print(df)

# optionally save DataFrame to SQL table (`conn` - is a SQLAlchemy connection)
#df.to_sql('table_name', conn)

Output:

    file  count
0  1.csv      2
1  2.csv      3

Test Data:

1.csv:

1,2,3,400
10,111,45,67

2.csv:

1,200,300,4
10,222,45,67

UPDATE:

you can parse the first number from the filename this way:

In [87]: import re

In [88]: f
Out[88]: '/path/to/touchscreen_data_123456_1456789456_178.16.66.3'

In [89]: re.sub(r'.*_\D+_(\d+)_\d+.*', r'\1', f)
Out[89]: '123456'
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Thanks @MaxU, this is definitely something I can work with. Beautiful! – StelioK Nov 07 '16 at 18:01
  • Thank you; only issue I am having is this: The filenames are of the form touchscreen_data_123456_1456789456_178.16.66.3, the third set of strings i.e. "123456" in this case, change. There may be 10k for this set and maybe 10 for another, what would be the best way of grabbing each of those unique id's and summing the count fields for each? I guess I could import this into an SQL table and do that in SQL fairly easily but would be nice to learn how to do so in python – StelioK Nov 07 '16 at 18:21
  • was able to insert the df into sql server, I can do the munging in there. Thanks again, if you have tips regrading my previous comment I'd appreciate them. – StelioK Nov 07 '16 at 19:36
0

If all the files are in a single folder, you would use something like glob to step through them all. Then use csv to test for presence:

found=[]
for fn in glob.glob('c:\\path\\*.csv'):
    with open(fn) as f:
        for row in csv.reader(f):
             if tgt_value_as_string in row:
                  found.append(fn)
                  break

Something like that...

dawg
  • 98,345
  • 23
  • 131
  • 206