4

I have two lists of files that I'm pulling from an FTP folder using:

sFiles = ftp.nlst(date+'sales.csv')
oFiles = ftp.nlst(date+'orders.csv')

This results with two lists looking something like:

sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']

oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']

With my real data-set, something like...

for sales, orders in zip(sorted(sFiles),sorted(oFiles)): 
     df = pd.concat(...)

Gets my desired result, but there are going to be times where something goes wrong and both files do not make it into the proper FTP folder, so I'd like some code that will create an iterable object where I can extract the matched orders and sales file name based on date.

The following works... I'm not sure what "pythonic" score I'd give it. Poor readability, but it is a comprehension, so I'd imagine there are performance gains?

[(sales, orders) for sales in sFiles for orders in oFiles if re.search(r'\d+',sales).group(0) == re.search(r'\d+',orders).group(0)]
Yale Newman
  • 1,141
  • 1
  • 13
  • 22
  • Your example is O(n²) so won't be efficient for large data sets. Regex is overkill: `sales[:8]==orders[:8]` would work if the naming is consistent. – Mark Tolonen Sep 19 '17 at 01:30
  • yea i get that nested loops aren't ideal. was hoping there was some way to have a lambda function in the zip function. also the naming conventions of the files will always be consistent. – Yale Newman Sep 19 '17 at 01:40
  • 1
    Interestingly, here's some timings on your original data and a list of 1000 pairs of dated files: listcomp with re: 117us (original data only), listcomp with [:8] slice: 10.2us/249ms, my solution: 13.5us/1.63ms, pandas solution: 2.41ms/50.2ms. So listcomp with [:8] slice was the fastest with small data, but scaled badly. pandas was actually the worst but was only 20x slower with large data, while my Python solution was 120x slower for large data, so pandas with **very** large datasets would likely be faster. Moral of the story, measure! – Mark Tolonen Sep 19 '17 at 02:42

4 Answers4

3

Taking advantage of the index of the pandas DataFrame:

import pandas as pd
sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']

s_dates = [pd.Timestamp.strptime(file[:8], '%Y%m%d') for file in sFiles]
s_df = pd.DataFrame({'sFiles': sFiles}, index=s_dates)

o_dates = [pd.Timestamp.strptime(file[:8], '%Y%m%d') for file in oFiles]
o_df = pd.DataFrame({'oFiles': oFiles}, index=o_dates)

df = s_df.join(o_df, how='outer')

and so:

>>> print(df)
                        sFiles               oFiles
2017-08-22  20170822_sales.csv  20170822_orders.csv
2017-08-23                 NaN  20170823_orders.csv
2017-08-24  20170824_sales.csv  20170824_orders.csv
2017-08-25  20170825_sales.csv  20170825_orders.csv
2017-08-26  20170826_sales.csv  20170826_orders.csv
2017-08-27  20170827_sales.csv  20170827_orders.csv
2017-08-28  20170828_sales.csv                  NaN
Hazzles
  • 436
  • 2
  • 6
  • `20170823_orders.csv` is missing. – Mark Tolonen Sep 19 '17 at 01:14
  • 1
    so it is - fixed now! – Hazzles Sep 19 '17 at 01:31
  • @Hazzles I'd imagine that this solution would likely be the fastest just because of how pandas leverages C or whatever it is which is typically faster than python? – Yale Newman Sep 19 '17 at 01:48
  • Definitely faster than a nested for-loop approach for large numbers of items. I *suspect* most of the speed up will be because we've converted the problem to a set-like operation of taking the union of the two indexes, and the C-under-the-hood speedup is a second-order effect. – Hazzles Sep 19 '17 at 01:55
  • @Hazzles have any solid reading material on Big O / how to think about making programs more efficient? – Yale Newman Sep 19 '17 at 02:19
  • Afraid I'm not familiar with any good resources. Off the top of my head; try and vectorize instead of loops (e.g. `np.arange(1e6) * 2` is much quicker than `[i * 2 for i in range(1e6)]`; in a very practical sense, don't spend too much time prematurely optimizing, it can be quite non-obvious where the slow-down in a program is coming from. `cProfile` can be really useful https://docs.python.org/3.6/library/profile.html – Hazzles Sep 19 '17 at 02:55
2

You can use a dictionary:

import collections
d = collections.defaultdict(dict)

sFiles = ftp.nlst(date+'sales.csv')
oFiles = ftp.nlst(date+'orders.csv')
for sale, order in zip(sFiles, oFiles):
    a, b = sale.split("_")
    a1, b2 = order.split("_")
    d[a]["sales"] = sale
    d[a1]["orders"] = order
print(dict(d))

Now, your data is structured in the format: {"date":{"sales":"sales filename", "orders":"orders filename"}}

Output:

{'20170828': {'sales': '20170828_sales.csv'}, '20170822': {'sales': '20170822_sales.csv', 'orders': '20170822_orders.csv'}, '20170823': {'orders': '20170823_orders.csv'}, '20170824': {'sales': '20170824_sales.csv', 'orders': '20170824_orders.csv'}, '20170825': {'sales': '20170825_sales.csv', 'orders': '20170825_orders.csv'}, '20170826': {'sales': '20170826_sales.csv', 'orders': '20170826_orders.csv'}, '20170827': {'sales': '20170827_sales.csv', 'orders': '20170827_orders.csv'}}

Edit:

with dictionary comprehension and building off of your proposed list-comprehension solution, you can try this:

import re
final_data = [{"sold":sold, "order":order} for sold in sFiles for order in oFiles if re.findall("\d+", sold)[0] == re.findall("\d+", order)[0]]

Output:

[{'sold': '20170822_sales.csv', 'order': '20170822_orders.csv'}, {'sold': '20170824_sales.csv', 'order': '20170824_orders.csv'}, {'sold': '20170825_sales.csv', 'order': '20170825_orders.csv'}, {'sold': '20170826_sales.csv', 'order': '20170826_orders.csv'}, {'sold': '20170827_sales.csv', 'order': '20170827_orders.csv'}]
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • What is collections doing here? Is it possible to do this with a dictionary comprehension? Check my desired solution in the edited version of my question – Yale Newman Sep 19 '17 at 01:04
  • @YaleNewman Please see my recent edit and let me know if my solution is what you had in mind. – Ajax1234 Sep 19 '17 at 02:12
  • yes this is what I initially had in mind. however, i'm fairly confident leveraging pandas indices yields the most efficient solution. i could be wrong. – Yale Newman Sep 19 '17 at 02:21
  • @YaleNewman regarding efficiency, the first `zip` is O(n); however, the list-comprehension is O(n^2). In this case, with smaller input sizes, a dictionary is more efficient. However, when accessing data by column, Pandas is faster. – Ajax1234 Sep 19 '17 at 02:27
  • @Ajax, actually, pandas isn't until the data is *much* larger. I put some timings in a comment under the question. – Mark Tolonen Sep 19 '17 at 02:45
  • @MarkTolonen Interesting timing data and observations. Thank you for posting! – Ajax1234 Sep 19 '17 at 02:52
  • `zip` will truncate the longer of the two lists to the length of the shorter, which could mean you don't process the last pair if one of the files is missing. Just iterate over them separately. – trent Sep 19 '17 at 11:38
1

Just because comprehensions exist doesn't mean you should use them for everything. This works fine:

date = re.compile(r'\d+')
for sales in sFiles:
    salesDate = date.search(sales).group(0)
    for orders in oFiles:
        orderDate = date.search(orders).group(0)
        if salesDate == orderDate:
            print sales, orders

Is it possible to make it faster? Yes. But you don't need to force it into a list comprehension just because you can. Sometimes writing more code is better, just because it spreads the complexity out a little.

Here's an incremental improvement that makes the algorithm O(n):

date = re.compile(r'\d+')
orders_dict = dict((date.search(file).group(0), file) for file in oFiles)

for sales in sFiles:
    salesDate = date.search(sales).group(0)
    if salesDate in orders_dict:
        orders = orders_dict[salesDate]
        print sales, orders
    else:
        # what do you do if it doesn't exist? You can't put handling code
        # here if you try to write this as a comprehension.
trent
  • 25,033
  • 7
  • 51
  • 90
1

This creates a generator that returns matched pairs in date order:

from collections import defaultdict

def match(sales,orders):
    # When a key is referenced for the first time, the value
    # will default to the result of the lambda.
    d = collections.defaultdict(lambda:[None,None])

    # set sales files on the first entry in the value.
    for sale in sFiles:
        d[sale[:8]][0] = sale
    # set orders files on the second entry.
    for order in oFiles:
        d[order[:8]][1] = order

    for k in sorted(d):
        # Both values need to exist.
        # If you want the singles remove the if.
        if all(v for v in d[k]):
            yield d[k]

sFiles = ['20170822_sales.csv','20170824_sales.csv','20170825_sales.csv','20170826_sales.csv','20170827_sales.csv','20170828_sales.csv']
oFiles = ['20170822_orders.csv','20170823_orders.csv','20170824_orders.csv','20170825_orders.csv','20170826_orders.csv','20170827_orders.csv']

for s,o in match(sFiles,oFiles):
    print(s,o)

Output:

20170822_sales.csv 20170822_orders.csv
20170824_sales.csv 20170824_orders.csv
20170825_sales.csv 20170825_orders.csv
20170826_sales.csv 20170826_orders.csv
20170827_sales.csv 20170827_orders.csv
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251