Efficiently processing a very large unicode string into csv

Question

Usually I'm able to find the answers to my dilemmas pretty quickly on this site but perhaps this problem requires a more specific touch;

I have a ~50 million long unicode string I download from a Tektronix Oscilloscope. Getting this assigned is a pain in a** for memory (sys.getsizeof() reports ~100 MB)

The problem lies in that I need to turn this into a CSV so that I can grab 10,000 of the 10 million Comma Sep Values (this is fixed)... 1) I have tried split(",") method, using this, the RAM usage on the python kernel SPIKES another 300 MB....BUT the process is VERY efficient (except when I loop this ~100 times in one routine...somewhere between iterations 40-50, the kernel spits back a memory error.) 2) I wrote my own script that after downloading the absurdly long string, scans the the number of commas until I see 10,000 and stops, turning all the values between the commas into floats and populating an np array. This is pretty efficient from a memory usage perspective (from before importing file to after running script, memory usage only changes by 150MB.) However it is MUCH slower, and usually results in a kernel crash shortly after completion of the 100x loops.

Below is the code used to process this file, and if you PM me, I can send you a copy of the string for experimenting (however I'm sure it may be easier to generate one)

Code 1 (using split() method)

PPStrace = PPSinst.query('CURV?')
PPStrace = PPStrace.split(',')
PPSvals = []
for iii in range(len(PPStrace)): #does some algebra to values
    PPStrace[iii] = ((float(PPStrace[iii]))-yoff)*ymult+yzero

maxes=np.empty(shape=(0,0))
iters=int(samples/1000)
for i in range(1000): #looks for max value in 10,000 sample increments, adds to "maxes"
    print i
    maxes = np.append(maxes,max(PPStrace[i*iters:(i+1)*iters]))
PPS = 100*np.std(maxes)/np.mean(maxes)
print PPS," % PPS Noise"

Code 2 (self generated script);

PPStrace = PPSinst.query('CURV?')
walkerR=1
walkerL=0
length=len(PPStrace)
maxes=np.empty(shape=(0,0))
iters=int(samples/1000) #samples is 10 million, iters then is 10000

for i in range(1000):
    sample=[] #initialize 10k sample list
    commas=0 #commas are 0
    while commas<iters: #if the number of commas found is less than 10,000, keep adding values to sample
        while PPStrace[walkerR]!=unicode(","):#indexes commas for value extraction
            walkerR+=1
            if walkerR==length:
                break
        sample.append((float(str(PPStrace[walkerL:walkerR]))-yoff)*ymult+yzero)#add value between commas to sample list
        walkerL=walkerR+1
        walkerR+=1
        commas+=1
    maxes=np.append(maxes,max(sample))
PPS = 100*np.std(maxes)/np.mean(maxes)
print PPS,"% PPS Noise"

Also tried Pandas Dataframe with StringIO for CSV conversion. That thing gets memory error just trying to read it into a frame.

I am thinking the solution would be to load this into a SQL table and then pull CSV in 10,000 sample chunks (which is intended purpose of the script). But I would love to not do this!

Thanks for all your help guys!

Hi Prune, Thanks for your input. I have not tried cStringIO....will have to try it. If in read/parse method you mean read from scope one 10k block instead of 10m block, this I cannot do. The idea behind what I'm doing req's specifically that large of a pull (10m sequential points, blocks of 10k will not be sequential. Although now that I think about it, I can store the 10m as saved waveform and read in chunks from that! Thanks for the idea!!) — Steven Yampolsky, Jan 27 '16 at 22:46
I turned my previous comment into an answer for more effective handling. /// I meant for you to read in whatever block size is effective for your application. You can often speed things up by having the (slower) read fetching the next block while you're processing the present one. This is simple buffering from the old days. — Prune, Jan 27 '16 at 22:49
Makes sense; For my application I need to grab all 10m samples at once. I forgot the scope can actually STORE these 10m samples and then I can read in chunk by chunk from the stored waveform instead of downloading the whole live feed at once! Is there a way to parallelize the process? It seems that Python will not go one to next command until one (ppsinst.query('CURVE?") is finished. — Steven Yampolsky, Jan 28 '16 at 17:08

crdavis · Answer 1 · 2016-02-01T12:28:18.203

0

Take a look at numpy.frombuffer (http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.frombuffer.html). This lets you specify a count and an offset. You should be able to put the big string into a buffer and then process it in chunks to avoid huge memory spikes.

EDIT 2016-02-01

Since frombuffer needs to have a fixed byte width I tried numpy.fromregex and it seems to be able to parse the string quickly. It has to do the whole thing though which might cause some memory issues. http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.fromregex.html

Something like this:

buf = StringIO.StringIO(big_string)
output = numpy.fromregex(buf, r'(-?\d+),', dtype=[('val', np.int64)])
# output['val'] is the array of values

edited Feb 01 '16 at 12:28

answered Jan 27 '16 at 18:28

crdavis

21,005
3
14
9

Will give it a shot if time permits, have not tried this approach yet! – Steven Yampolsky Jan 28 '16 at 17:09
from buffer is a good suggestion, however unfortunately there is no way to know apriori how many would be 10k csv. let me explain; The string consist of this kind of feed; ...127,23,1,100,-3,24... Because fromBuffer needs to know exactly how many characters to pull out I cant know in advanced the parameters to extract 10k comma separated values. This would require pulling out some arbitrary number, processing thru the stuff pulled to count the number of commas and repeat (with bisection search given an upper bound and lower bound maybe?). Unless I misunderstand how fromBuffer works? – Steven Yampolsky Jan 31 '16 at 02:04
Understood. You need to have a fixed number of bytes for frombuffer to work right. – crdavis Feb 01 '16 at 12:23

Prune · Answer 2 · 2016-01-28T19:16:21.837

Have you tried class cStringIO? It's just like file IO, but uses a string as a buffer instead of a specified file. Frankly, I expect that you're suffering from as chronic speed problem. Your self-generated script should is the right approach. You might get some speed-up if you read a block at a time, and then parse that while the next block is reading.

For parallel processing, use the multiprocessing package. See the official documentation or this tutorial for details and examples.

Briefly, you create a function that embodies the process you want to run in parallel. You then create a process with that function as the target parameter. Then start the process. When you want to merge its thread back to the main program, use join.

Upgrading poor PC from 3gb Ram to 16gb solved the memory problems using Code 1 approach! — Steven Yampolsky, Feb 16 '16 at 23:06

Efficiently processing a very large unicode string into csv

2 Answers2