7

I'm an R user and I cannot figure out the pandas equivalent of match(). I need use this function to iterate over a bunch of files, grab a key piece of info, and merge it back into the current data structure on 'url'. In R I'd do something like this:

logActions <- read.csv("data/logactions.csv")
logActions$class <- NA

files = dir("data/textContentClassified/")
for( i in 1:length(files)){
    tmp <- read.csv(files[i])
    logActions$class[match(logActions$url, tmp$url)] <- 
            tmp$class[match(tmp$url, logActions$url)]
}

I don't think I can use merge() or join(), as each will overwrite logActions$class each time. I can't use update() or combine_first() either, as neither have the necessary indexing capabilities. I also tried making a match() function based on this SO post, but cannot figure out how to get it to work with DataFrame objects. Apologies if I'm missing something obvious.

Here's some python code that summarizes my ineffectual attempts to do something like match() in pandas:

from pandas import *
left = DataFrame({'url': ['foo.com', 'foo.com', 'bar.com'], 'action': [0, 1, 0]})
left["class"] = NaN
right1 = DataFrame({'url': ['foo.com'], 'class': [0]})
right2 = DataFrame({'url': ['bar.com'], 'class': [ 1]})

# Doesn't work:
left.join(right1, on='url')
merge(left, right, on='url')

# Also doesn't work the way I need it to:
left = left.combine_first(right1)
left = left.combine_first(right2)
left 

# Also does something funky and doesn't really work the way match() does:
left = left.set_index('url', drop=False)
right1 = right1.set_index('url', drop=False)
right2 = right2.set_index('url', drop=False)

left = left.combine_first(right1)
left = left.combine_first(right2)
left

The desired output is:

    url  action  class
0   foo.com  0   0
1   foo.com  1   0
2   bar.com  0   1

BUT, I need to be able to call this over and over again so I can iterate over each file.

Community
  • 1
  • 1
Solomon
  • 946
  • 14
  • 18
  • [This topic](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_r.html#match) is probably the most relevant. – Roman Luštrik Jan 27 '21 at 16:44

3 Answers3

10

Note the existance of pandas.match which does precisely what R's match does.

Wes McKinney
  • 101,437
  • 32
  • 142
  • 108
  • Hey Wes, can you point me to the doc for pandas.match? I didn't see it here http://pandas.pydata.org/pandas-docs/stable/api.html or here http://pandas.pydata.org/pandas-docs/stable/genindex.html#M? Thanks! – Solomon Apr 08 '13 at 19:17
  • `In: pd.match([1,2,3],[1,2]) Out: array([0, 1, -1], dtype=int64)` it appears to give 0 for matching the first value in the second array and 1 for matching the second value and so on and -1 when not matched – seumas Apr 20 '13 at 22:37
  • Be careful with this solution. pd.match([1,2,3,5,8,2],[1,2,4,5,9,2]) produces array([ 0, 5, -1, 3, -1, 5]) It will only return a single match not 1 and 5 like you might hope. – mike Apr 26 '17 at 07:05
  • Note that `pandas.match` was deprecated several years ago, but [Index.get_indexer](https://pandas.pydata.org/docs/reference/api/pandas.Index.get_indexer.html) works great if your data are unique, and otherwise there's [this answer](https://stackoverflow.com/a/61388325/1519199) – Jthorpe Nov 07 '21 at 02:00
1

Edit:

If url in all right dataframes re unique, you can make the right dataframe as a Series of class indexed by url, then you can get the class of every url in left by index it.

from pandas import *
left = DataFrame({'url': ['foo.com', 'bar.com', 'foo.com', 'tmp', 'foo.com'], 'action': [0, 1, 0, 2, 4]})
left["klass"] = NaN
right1 = DataFrame({'url': ['foo.com', 'tmp'], 'klass': [10, 20]})
right2 = DataFrame({'url': ['bar.com'], 'klass': [30]})

left["klass"] = left.klass.combine_first(right1.set_index('url').klass[left.url].reset_index(drop=True))
left["klass"] = left.klass.combine_first(right2.set_index('url').klass[left.url].reset_index(drop=True))

print left

Is this what you want?

import pandas as pd
left = pd.DataFrame({'url': ['foo.com', 'foo.com', 'bar.com'], 'action': [0, 1, 0]})
left["class"] = NaN
right1 = pd.DataFrame({'url': ['foo.com'], 'class': [0]})
right2 = pd.DataFrame({'url': ['bar.com'], 'class': [ 1]})

pd.merge(left.drop("class", axis=1), pd.concat([right1, right2]), on="url")

output:

   action      url  class
0       0  foo.com      0
1       1  foo.com      0
2       0  bar.com      1

if the class column in left is not all NaN, you can combine_fist it with the result.

HYRY
  • 94,853
  • 25
  • 187
  • 187
  • thanks but I need to call the match() equivalent once for each file so this approach won't work :( – Solomon Apr 07 '13 at 02:21
  • unique in each right dataframe (but can be repeated across dfs). Ideally the solution would be agnostic to this question. But I think class indexed by url is the right way to go. – Solomon Apr 07 '13 at 15:31
0

Here's the full code I ended up going with:

#read in df containing actions in chunks:
tp = read_csv('/data/logactions.csv', 
  quoting=csv.QUOTE_NONNUMERIC,
  iterator=True, chunksize=1000, 
  encoding='utf-8', skipinitialspace=True,
  error_bad_lines=False)
df = concat([chunk for chunk in tp], ignore_index=True)

# set classes to NaN
df["klass"] = NaN
df = df[notnull(df['url'])]
df = df.reset_index(drop=True)

# iterate over text files, match, grab klass
startdate = date(2013, 1, 1)
enddate = date(2013, 1, 26) 
d = startdate

while d <= enddate:
    dstring = d.isoformat()
    print dstring

    # Read in each file w/ classifications in chunks
    tp = read_csv('/data/textContentClassified/content{dstring}classfied.tsv'.format(**locals()), 
        sep = ',', quoting=csv.QUOTE_NONNUMERIC,
        iterator=True, chunksize=1000, 
        encoding='utf-8', skipinitialspace=True,
        error_bad_lines=False)
    thisdatedf = concat([chunk for chunk in tp], ignore_index=True)
    thisdatedf=thisdatedf.drop_duplicates(['url'])
    thisdatedf=thisdatedf.reset_index(drop=True)

    thisdatedf = thisdatedf[notnull(thisdatedf['url'])]
    df["klass"] = df.klass.combine_first(thisdatedf.set_index('url').klass[df.url].reset_index(drop=True))

    # Now iterate
    d = d + timedelta(days=1)
Solomon
  • 946
  • 14
  • 18