0

I have imported a dataset, and need to find how many names in a given column ("name") begin with a vowel and have more than 5 letters.

The below is what I have so far, but it does not seem to be returning the desired values.:

x = re.findall(r'[aeiouAEIOU]\w{5,}',str(name))

Full code for reference:

import re


url = "https://data.nasa.gov/resource/y77d-th95.json"

df = pd.read_json(url, orient='columns')
name = df["name"]


x = re.findall(r'[aeiouAEIOU]\w{5,}',str(name))

print(x)
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Aine
  • 1
  • Could you give a [mre]? Rather than the whole dataset, for example, a subset that illustrates the specific problem. – jonrsharpe Mar 30 '20 at 11:51
  • `r'[aeiouAEIOU]\w{5,}'` => `r'\b[aeiouAEIOU]\w{5,}'` – Wiktor Stribiżew Mar 30 '20 at 11:52
  • Shouldn't re.findall(r'[aeiouAEIOU]\w{5,}',str(name)) => re.findall(r'[aeiouAEIOU]\w{4,}',str(name)) since the vowel would count as one of the letters. I get 10 with this change vs. 9 with the original. – DarrylG Mar 30 '20 at 11:58
  • Thanks for your help, all. @jonrsharpe a possible reproducible example would be the below. It seems to work on this shorter list, however when I apply it to the dataset (more than 1000 rows), it is only returning 5 results. I know from manually checking the rows that this is not correct. – Aine Mar 30 '20 at 12:25
  • EXAMPLE: list = {"Aachen", "Aarhus", "Aarhu","Abee", "Abeen-Aarhus", "Acapulco", "Achiras"} x = re.findall(r'\b[aeiouAEIOU]\w{5,}',str(list)) print(x) – Aine Mar 30 '20 at 12:27
  • That's a *set*, not a list. Also please give an example for which it *doesn't* work. – jonrsharpe Mar 30 '20 at 12:28
  • Apologies, ok I think the below will be clearer. The error occurs where a word fits the criteria BUT has a special character such as "-" or a space. For example "Adxhii-Bogdo" and "Aguila Blanca" are not returned despite beginning with a vowel and being over 5 letters. – Aine Mar 30 '20 at 12:44
  • import pandas as pd data = {'Name':['Aachen', 'Aarhus', 'Abee', 'Acapulco',"Achiras","Adzhii-Bogdo", "Aguila Blanca","Tirupati","Tjabe"]} df = pd.DataFrame(data) print(df) x = re.findall(r'\b[aeiouAEIOU]\w{5,}',str(df)) print(x) – Aine Mar 30 '20 at 12:46

0 Answers0