Python pandas finding the middle 50%

Question

I'm using python with pandas to deal with stock tick data and I would like to compress it down to the total volume for the day, the high, the low, the average, 25% of trade volume, 75% of trade volume. I am unsure of how to find where the 25% and 75% levels would lie.

#Refrences
from time import *
import urllib.request as web
import pandas as pd
import os

dateToday = "2014-10-31"

def pullData(exchange,stock,date):
    baseUrl='http://netfonds.no/quotes/tradedump.php?csv_format=csv'
    fullUrl=baseUrl+'&date='+date.replace("-","")+'&paper='+stock+'.'+exchange
    fileName=('netfonds/trades/'+stock+'.txt')
    try:
        if not os.path.isdir(os.path.dirname(fileName)):
            os.makedirs(os.path.dirname(fileName))
    except OSError:
        print("Directory Error")
    #print(fullUrl)    
    webBuffer=web.urlopen(fullUrl)
    webData=pd.read_csv(webBuffer,usecols=['price','quantity'])
    low = webData['price'].min()
    high = webData['price'].max()
    print(low,high)


def getList(fileName):
    stockList = []
    file = open(fileName+'.txt', 'r').read()
    fileByLines = file.split('\n')
    for eachLine in fileByLines:
        if '#' not in eachLine:
            lineByValues = eachLine.split('.')
            stockList.append(lineByValues)
    return stockList

def fromList():
    print("Parsing stock tickers...")
    stockList = getList('stocks')
    print("Found "+str(len(stockList))+" stocks")

    for eachEntry in stockList:
        start_time = time()
        try:
            print("Attempting to pull data for "+eachEntry[1])
            pullData(eachEntry[0],eachEntry[1],dateToday)
            print("Pulled succcessfully in "+str(round(time()-start_time))+" seconds")
        except Exception:
            print("Unable to pull data... "+eachEntry[1])

first_time = time()
fromList()
print("Program Finished! Took "+str(round((time()-first_time)/60))+' minutes')

Welcome to Stack Overflow! Can you post some code that you've written? — tsnorri, Nov 02 '14 at 18:05
[`describe`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html#pandas.DataFrame.describe) does this already — EdChum, Nov 02 '14 at 18:10
If interested, I had to write my own helper functions to do something like this (in an older version of pandas) because I needed *weighted* percentiles, and not just regular percentiles. The implementation I hacked up and then fixed is here: < http://stackoverflow.com/questions/11585564/definitive-way-to-match-stata-weighted-xtile-command-using-python >. Most of this is directly available in pandas now. — ely, Nov 02 '14 at 18:16
Yeah, It looks like I need weighted quartiles, but I don't quite understand the code in that post. — IDon'tUnderstandOOP, Nov 02 '14 at 19:22

score 2 · Answer 1 · answered Nov 02 '14 at 18:12

2

pandas Series and DataFrame have a describe method, which is similar to R's summary:

In [3]: import numpy as np

In [4]: import pandas as pd

In [5]: s = series.values()

In [6]: s.describe()
Out[6]: 
count    100.000000
mean       0.540376
std        0.296250
min        0.002514
25%        0.268722
50%        0.593436
75%        0.831067
max        0.991971

answered Nov 02 '14 at 18:12

hd1

33,938
5
80
91

Umm, it's doing the statistics independent of each other. I need the quantity column to be the frequency list for the price column. – IDon'tUnderstandOOP Nov 02 '14 at 18:17
Then, put the quantity column as the frequency list for the price column. – hd1 Nov 02 '14 at 18:19

score 0 · Accepted Answer · answered Nov 02 '14 at 20:01

0

I found what I needed by simply using numpy.repeat().

inflated=pd.DataFrame(np.repeat(webData['price'].values,webData['quantity'].values))

answered Nov 02 '14 at 20:01

IDon'tUnderstandOOP

424
1
5
9

Python pandas finding the middle 50%

2 Answers2