0

I'm trying to do something simple, but don't know how to read the actual rows from the dataframe. I want to run some regex on each string.

The .csv file has no header, it's just one column full of a bunch of strings.

csv_data = pd.read_csv('list.csv', sep=',', header=None)

pattern = re.compile(r'(.*\/)(?!\/)(.*)', flags=re.DOTALL)

url_file = {
        pattern.findall(row)[0]:
        pattern.findall(row)[1]
        for index, row in csv_data.iterrows()
    }

But I just get

TypeError: expected string or bytes-like object


Edit 1

I do not believe this to be a duplicate, the other suggested SO question/solution is different context and has headers and multiple columns.


Edit 2

print(csv_data.dtypes)

0    object
dtype: object

print( csv_data.head())

0  https://...
1  https://...
2  https://...
3  https://...
4  https://...

Edit 3

Doing this:

for row in csv_data.iterrows():
    print(row.dtypes)

gave the error AttributeError: 'tuple' object has no attribute 'dtypes'

So, it seems the contents are tuples, therefore just need to figure out how to get the string out of it.

Kenny
  • 2,124
  • 3
  • 33
  • 63
  • Possible duplicate of [pandas.read\_csv from string or package data](https://stackoverflow.com/questions/20696479/pandas-read-csv-from-string-or-package-data) – R4444 Apr 02 '19 at 15:08
  • Can you `print(csv_data.dtypes)` for us? `csv_data.head()` migh help as well. – nick Apr 02 '19 at 15:17
  • 1
    @nick I added to the original question those prints, thanks! – Kenny Apr 02 '19 at 15:30

3 Answers3

1

You can better use lambda function on this single column and Keep the regex operations in a function and call like this: Suppose data is the data frame and string is the column name:

data = pd.read_csv('list.csv', sep=',', header=None)
data.columns = ['string']
data['string']  = data['string'].apply(lambda x:regex_function(x))
  • Thanks for your answer, Yoshitha! Sadly, I don't understand lambda functions very will. I was hoping to get a dictionary of data back from the original .csv file. – Kenny Apr 02 '19 at 15:39
1

Major edit. You were right: Yoshitha's solution is not ideal as you specifically want the two elements from that regex match.

However, Pandas does have a nice regex handling solution to help you. Something like this is a lot neater:

matches = csv_data.iloc[:,0].str.extract(r'(.*\/)(?!\/)(.*)', expand=True)

And then to get your dictionary representation, we can run: matches.set_index(0, drop=True).to_dict()[1]

This might still have issues if there is a url string in the input that does fully match this regex though.

Simple example:

l = ['https://example.s3.amazonaws.com/uploads/full/68518-5df5b5e5t5b.jpg', 'test_with_bad_url']
matches = pd.DataFrame(l).iloc[:,0].str.extract(r'(.*\/)(?!\/)(.*)', expand=True)
your_dict = matches.set_index(0, drop=True).to_dict()[1]
print(your_dict)
{'https://example.s3.amazonaws.com/uploads/full/': '68518-5df5b5e5t5b.jpg',
 nan: nan}
nick
  • 1,310
  • 8
  • 15
  • Hi @nick, thanks for the response! I did get `for index, row in csv_data[0].iteritems() IndexError: list index out of range` . In regards to Yoshitha, I would like to implement that, but I don't know how: `x:regex_function(x)` – Kenny Apr 02 '19 at 16:31
  • So it seems then that the regex is not matching on one row in the file. I’ll reply fully shortly. – nick Apr 02 '19 at 16:35
  • Hi @Kenny. `.apply(lambda x: regex_function(x))` is just short hand for saying for each element `x` in the dataframe call `regex_function` with `x` as the input. – nick Apr 02 '19 at 16:42
  • and I make that `regex_function` and just return the output? It's tough to visualize because I'm getting two pieces of data from the regex – Kenny Apr 02 '19 at 16:45
  • Can you just post an example of what a matched regex should look like? – nick Apr 02 '19 at 16:53
  • Yes: `https://example.s3.amazonaws.com/uploads/full/68518-5df5b5e5t5b.jpg` – Kenny Apr 02 '19 at 16:57
  • Also, I did get this just now: `TypeError: tuple expected at most 1 arguments, got 2` – Kenny Apr 02 '19 at 16:58
  • 1
    @Kenny, you were right. Sorry, I was overlooking the need for returning that tuple from that function. I have changed the way I suggest you tackle this problem. Pandas has a neat way of dealing with regex matches on the columns. See if the above solution works for you. – nick Apr 02 '19 at 17:04
  • 1
    This is great, thank you very much for taking the time to help me with this! It seems to be working, I just need to now figure out how to go through my entire .csv file and add it to the dictionary. – Kenny Apr 02 '19 at 17:16
  • Something weird I'm experiencing, just to add, is that I have 200 lines in my .csv, but only line 177 and 200 are being added to the dictionary. It's skipping the rest. If anything comes to mind, that would be great. Otherwise, thanks again. – Kenny Apr 02 '19 at 19:18
  • 1
    The dictionary has a unique key constraint. This means one of two things. If the regex does not match the string input (i.e., you have some input that does not have the required '/' characters that it is searching for) then the tuple (nan,nan) is added to the dictionary. If there is more than one example then this just overwrites the previous entries. The other reason is you have duplicate entries for the path in the URL. – nick Apr 03 '19 at 03:24
0

Or you can try this code:

csv_data = pd.read_csv('list.csv', sep=',', header=None, dtype=str)
csv_data = csv_data.fillna("")

pattern = re.compile(r'(.*\/)(?!\/)(.*)', flags=re.DOTALL)

url_file = {
        pattern.findall(str(row))[0]:
        pattern.findall(str(row))[1]
        for index, row in csv_data.iterrows()
    }

Jay
  • 90
  • 10
  • I tried, I do still have the same error: `TypeError: expected string or bytes-like object`, thanks for your answer though! – Kenny Apr 02 '19 at 15:24