Efficiently handle Market Broadcast Data and process it using python

Question

This might seem a broad question but it is precise that I want to achieve so please read it.

I am building an application that connects to market broadcast and subscribes nearly 200 scrips for broadcast. Broadcast is received for every script every second.

Now in the application I want to achieve two things:

Collect the broadcast and make 1min, 5min, 15min candle and so on.
Compare the live broadcast value for certain scrips with predefined values.

I had tried different approach to the problem but only one approach showed best results. Even though it is not the desired result.

My current approach:

main.py----->Process----
                       |
                       |
                  ------------
                  |           |
              broadcast    candle

In broadcast I get the broadcast and make 1 min candle and forward it through a Pipe() connection to the candle. In candle I try to make candles at required intervals.

Problem:

It takes nearly one second for each symbol/each candle to form. I use pandas dataframe for storing data and then groupby() operation to form a candle.
For 1min candle takes nearly 6-10 secs for candle formation. It is desired to be nearly 2 secs or less.
There is no room for broadcast comparison.

What I want to know is that what should be my approach in terms of data structures and data processing so that there is enough room for all the processes to take place and as fast as possible ?

I was thinking of HDF5 for data storage and dask for processing but is that worth ? (I have never worked on HDF5/dask).

Any suggestion is welcome. Thanks !

EDIT

This is code where I make candle according to the required frequency. Note that it is called for every scrip after I slice scrip data from a master dataframe.

def __create_minutes_candles(self, data=None, freq=None, latest=True):
     start_time = time()
     data = data if isinstance(data, pd.DataFrame) else self.__data
     if freq and not data.empty:
         for _symbol in self.scrips:
             symbol_df = data[data['symbol'] == _symbol]
             symbol_candle = self.__create_candle(_symbol, symbol_df, freq)
             # Following code assembles all candles for all symbols and then append it to file

def __create_candle(self, symbol, symbol_df, freq):
     if not symbol_df.empty:
         candle = symbol_df.groupby(pd.Grouper(freq=freq, base=self.base)).apply(self.__candle_calc)
         candle.dropna(how='all', inplace=True)
         volume = symbol_df['volume'].resample(rule=freq, base=self.base).sum()
         volume = volume[volume != 0]
         symbol_candle = pd.concat([candle, volume], axis=1)
         symbol_candle['symbol'] = [symbol] * len(symbol_candle)
         symbol_candle = symbol_candle[self.file_columns]
         return symbol_candle

This is view of dataframe from where I generate candles:

               timestamp      symbol       ltp      volume
0    2019-02-27 12:34:38    NIFTY_50  10778.70         NaN
1    2019-02-27 12:34:36         ACC   1430.00   1154730.0
2    2019-02-27 12:34:37    ADANIENT    131.30   3687673.0
3    2019-02-27 12:34:37  ADANIPORTS    325.60   4091014.0
4    2019-02-27 12:34:37  ADANIPOWER     47.85  28333489.0
5    2019-02-27 12:34:35  AJANTPHARM    997.00    397915.0
6    2019-02-27 12:34:37        ALBK     46.30  17753857.0
7    2019-02-27 12:34:33  AMARAJABAT    720.90    317973.0
8    2019-02-27 12:34:37   AMBUJACEM    214.95   1648760.0
9    2019-02-27 12:34:35  APOLLOHOSP   1152.15    738995.0
10   2019-02-27 12:34:37  APOLLOTYRE    212.65   1722708.0
11   2019-02-27 12:34:37      ARVIND     75.45   2225521.0
12   2019-02-27 12:34:37    ASHOKLEY     82.95  21954785.0
13   2019-02-27 12:34:37  ASIANPAINT   1405.35    755940.0
14   2019-02-27 12:34:37  AUROPHARMA    717.70   1248541.0
15   2019-02-27 12:34:37    AXISBANK    715.25   8247772.0
16   2019-02-27 12:34:37  BAJAJ-AUTO   2900.95    541710.0
17   2019-02-27 12:34:33  BAJAJFINSV   6398.05    182373.0
18   2019-02-27 12:34:37  BAJFINANCE   2637.80    759410.0
19   2019-02-27 12:34:37  BALKRISIND    890.90    659789.0
20   2019-02-27 12:34:37  BANKBARODA    100.60   6876303.0
21   2019-02-27 12:34:37   BANKINDIA     81.15   6643359.0
22   2019-02-27 12:34:36   BATAINDIA   1298.00    492525.0
23   2019-02-27 12:34:37         BEL     80.40   8375060.0
24   2019-02-27 12:34:37        BEML    837.90   2072199.0
25   2019-02-27 12:34:36  BERGEPAINT    297.00    564332.0
26   2019-02-27 12:34:36   BHARATFIN    921.05    271384.0
27   2019-02-27 12:34:36  BHARATFORG    510.85   1321287.0
28   2019-02-27 12:34:37  BHARTIARTL    316.25   2660313.0
29   2019-02-27 12:34:37        BHEL     65.05   3227136.0
...                  ...         ...       ...         ...
8410 2019-02-27 12:35:56      RPOWER     10.50  30267776.0
8411 2019-02-27 12:35:56        SAIL     48.45  12045549.0
8412 2019-02-27 12:35:56        SBIN    268.10  13249778.0
8413 2019-02-27 12:35:56    SHREECEM  16861.30     24079.0
8414 2019-02-27 12:35:56     SIEMENS    986.80    168896.0
8415 2019-02-27 12:35:56  SRTRANSFIN   1132.40    813839.0
8416 2019-02-27 12:35:56        STAR    412.65    710360.0
8417 2019-02-27 12:35:56   SUNPHARMA    441.75   9114905.0
8418 2019-02-27 12:35:56       SUNTV    604.80   2645366.0
8419 2019-02-27 12:35:56    TATACOMM    582.05    745842.0
8420 2019-02-27 12:35:56  TATAGLOBAL    191.55   1080853.0
8421 2019-02-27 12:35:56  TATAMOTORS    177.80  12885577.0
8422 2019-02-27 12:35:56  TATAMTRDVR     87.95   2138988.0
8423 2019-02-27 12:35:56   TATAPOWER     65.80   3032921.0
8424 2019-02-27 12:35:56   TATASTEEL    502.50   5517049.0
8425 2019-02-27 12:35:56         TCS   2046.40   2315885.0
8426 2019-02-27 12:35:56       TECHM    823.75   2428864.0
8427 2019-02-27 12:35:56       TITAN   1025.30    960611.0
8428 2019-02-27 12:35:56  TORNTPHARM   1823.75    134955.0
8429 2019-02-27 12:35:56  TORNTPOWER    238.90    901773.0
8430 2019-02-27 12:35:56  TV18BRDCST     33.05   2314156.0
8431 2019-02-27 12:35:56    TVSMOTOR    457.40    993427.0
8432 2019-02-27 12:35:56         UBL   1338.50    203388.0
8433 2019-02-27 12:35:56  ULTRACEMCO   3863.20    401531.0
8434 2019-02-27 12:35:56         UPL    864.00   1799297.0
8435 2019-02-27 12:35:56        VEDL    164.60   6466586.0
8436 2019-02-27 12:35:56      VOLTAS    543.25   1388822.0
8437 2019-02-27 12:35:56       WIPRO    376.90   4978421.0
8438 2019-02-27 12:35:56     YESBANK    224.45  42431769.0
8439 2019-02-27 12:35:56        ZEEL    462.50   7847558.0

This is the view of Dataframe that I generate.

                         symbol     open     high      low    close    volume
timestamp                                                                    
2019-02-27 12:29:00         ACC  1431.50  1432.15  1431.25  1431.85    1091.0
2019-02-27 12:29:00    ADANIENT   131.80   131.90   131.75   131.80    4516.0
2019-02-27 12:29:00  ADANIPORTS   326.15   326.55   326.00   326.40    6476.0
2019-02-27 12:29:00  ADANIPOWER    47.80    47.80    47.50    47.70   71156.0
2019-02-27 12:29:00  AJANTPHARM   999.00   999.00   998.10   998.10     245.0
2019-02-27 12:29:00        ALBK    46.55    46.70    46.55    46.65   11483.0

Can you give us an example of the undesired output you wanted to receive from the 'working solution'? — Daniel Scott, Feb 27 '19 at 06:34
Also, in my related field I understand the value of what you are trying to do. Groupby will be very very slow for you. Are you sure that you need pandas for that? Why not access the underlying numpy arrays? It's about 500x faster for each operation. Please post an example of a slow groupby operation and I can show you a potential equivalent in numpy. — Daniel Scott, Feb 27 '19 at 06:35
I have added the code. I can use anything that gets me to desired result. If you have other approach I can work on it. — ap14, Feb 27 '19 at 06:44
How is this dataframe being generated? Are you using something like a MT5 -> _> DLL - > mySQL -> pd.read_sql pipeline? — Daniel Scott, Feb 27 '19 at 06:53
It would be great to see a slice of the original df that you are generating so I can fit the solution better. But I understand if you can't. — Daniel Scott, Feb 27 '19 at 06:54
I have made edits. @DanielScott. Now I am getting to root I think. May be that concat I use is slow in pandas. Actually I have worked more on pandas than numpy so can't tell the difference. — ap14, Feb 27 '19 at 07:07

Efficiently handle Market Broadcast Data and process it using python

0 Answers0