Import data from a text file with random missing data in python

Question

I have been trying, unsuccessfuly, to use numpy.genfromtxt() to import the data from a text file into an array.

The problem I have is these data files should have five columns but from time to time a data entry is missing on a line and I only have 4 or fewer columns.

I've read through the numpy documentation for genfromtxt() and eventually found the comment "When spaces are used as delimiters, or when no delimiter has been given as input, there should not be any missing data between two fields". Unfortunately this is exaclty the situation I am in.

Can somebody suggest or show me another function/module that I can use to handle this kind of data?

Thanks

Update with an example of what I tried:

data = np.genfromtxt(matches[0], skip_header = 6, usecols = (0,1,2,3,4), dtype=['S15','f8','f8','f8','i8'])

The error I get is:

ValueError: Some errors were detected !
    Line #7 (got 4 columns instead of 5)
    Line #17 (got 4 columns instead of 5)
    Line #27 (got 4 columns instead of 5)
    Line #78 (got 4 columns instead of 5)

As expecetd when I eyeball the data file the fourth data point on said lines is missing (therefore it only sees 4 columns). I have looked at the many data files I have to import in this way and it happens randomly that data in the fourth column is sometimes missing.

For completeness here here is an excerpt of the data file:

Start voltage = 0.000000V
Final voltage = 30.000000V
Voltage step = 5.000000V
Acquisition time = 10s
Post Irradiation 1

20180214_162747  -6.07967e-07 7.24649e-10  00000000000
20180214_162748  -3.69549e-07 6.10220e-10 +0.52310E-10 00000009504
20180214_162749  -6.19888e-07 5.97525e-10 +0.61081E-10 00000009239
20180214_162750  -1.27554e-06 6.65617e-10 +0.63719E-10 00000009053
20180214_162751  4.42266e-06 6.88171e-10 +0.70692E-10 00000009188
20180214_162752  1.99080e-06 6.10995e-10 +0.67934E-10 00000009321
20180214_162753  5.60284e-07 7.29239e-10 +0.71260E-10 00000009007
20180214_162754  1.04904e-06 6.29222e-10 +0.72195E-10 00000009386
20180214_162755  -1.84774e-06 6.12736e-10 +0.67136E-10 00000009403
20180214_162756  -4.76837e-08 6.86717e-10 +0.62982E-10 00000009379
20180214_162757  2.80142e-06 6.87110e-10  00000009417
20180214_162758  5.00005e+00 1.70809e-08 +1.61506E-09 00000006002
20180214_162759  5.00004e+00 1.07430e-08 +1.67208E-09 00000011408
20180214_162800  5.00003e+00 9.07902e-09 +1.75613E-09 00000011277
20180214_162801  5.00002e+00 8.52853e-09 +1.80156E-09 00000011702
20180214_162802  5.00002e+00 8.42900e-09 +1.86753E-09 00000011736

Can you show us an example of what you tried? You can try also pandas `df = pd.read_csv('your_file.txt')` — DimKoim, Feb 28 '18 at 10:14
Your question might benefit from posting an example of your CSV file structure. Headers, lines to skip, unused columns etc influence, what is the best strategy to read a CSV file. Python, numpy and pandas have different keywords for reading a CSV file, leading to slight differences in functionality. — Mr. T, Feb 28 '18 at 10:49
Thanks, I shall have another look at pandas and report back if this module is useful. — Richard, Feb 28 '18 at 11:01

score 1 · Answer 1 · answered Feb 28 '18 at 10:18

I had the same issue resolved it by using the CSV library

# Call CSV library
import csv
# Select your text file
text_file = open("C:\DataSet\your-file.txt", "r")

# Read each line of text file and save it in lines. 
lines = text_file.readlines()

# Print lines and you are good to go
print(lines)
text_file.close()

# In case you want to export it as csv file.
mycsv = csv.writer(open('C:\DataSet\OutPut.csv', 'wb'))

# Write header for csv file.
mycsv.writerow(['h1','h2', .... ,'hn'])

Mr. T · Answer 2 · 2018-02-28T11:11:13.513

As indicated in the comments, you could utilise pandas csv_reader, which has a different set of keywords

arr = pd.read_csv("test.txt", delim_whitespace = True, header = None).fillna(0).values
print(arr)

From your added code I assume that you want to skip lines, so you might want to use

arr = pd.read_csv("test.txt", delim_whitespace = True, skiprows = 2).fillna(0).values

#Sample input:
#unused row
#another unused row
#i     j          k       l
#0   38.52200   5.600  129.203995  
#1   23.85499  
#2    4.41700  40.182  121.309998  
#3   65.76199  27.550  

#Sample output:
#[[  0.        38.522      5.6      129.203995]
# [  1.        23.85499    0.         0.      ]
# [  2.         4.417     40.182    121.309998]
# [  3.        65.76199   27.55       0.      ]]

Import data from a text file with random missing data in python

2 Answers2