Cleaning text in python

Question

I just started coding with python and I have a dataset where two of my columns are giving me some problems. One of them has the information of the country of origin of an artist, and some of them have dual nationalities, like so: France/America. I am trying to get the first country only, in this case France. For the second column, I have the name of the artist but some of them have strange characters, for example: GyÌ¦rgy Kepes. What would be the best way to clean those elements? If this is of any help, I am opening my file the following way:

 data = pd.read_csv(fpn_csv, encoding='ISO-8859-1')

I don't know if this is affecting my process in any way, but I cannot open the file if I use UTF-8

The name of the columns are:

country_of_origin and artist.

Here is a sample of my file:

+------+-------------------------------+-----------------------------+-------------------+-------------------------+------------+-----------------+
| ID   |         artist_title          |        art_movement         |   museum_venue    |    country_of_origin    |  has_text  |  primary_medium |
+------+-------------------------------+-----------------------------+-------------------+-------------------------+------------+-----------------+
| 361  |  LÌÁszlÌ_ Moholy-Nagy         |  Vertical Black, Red, Blue  |  LACMA also MoMA  |  Hungary                |  FALSE     |  sculpture      |
| 362  |  BrassaÌø (Gyula HalÌÁsz)     |  Buttress of the Elevated   |  MoMA             |  Transylvania / France  |  FALSE     |  photography    |
| 363  |  M. C. Escher                 |  Relativity                 |  MoMA             |  Denmark                |  FALSE     |  print          |
| 364  |  Clyfford Still 1944-N No. 2  |  abstract expressionism     |  MoMA             |  America                |  FALSE     |  painting       |
| 365  |  Harold E. Edgerton           |  Milk Drop                  |  MoMA             |  America                |  FALSE     |  photography    |
| 366  |  Meret Oppenheim Object       |  surrealism                 |  MoMA             |  Germany / Switzerland  |  FALSE     |  sculpture      |
+------+-------------------------------+-----------------------------+-------------------+-------------------------+------------+-----------------+

The name of the artist in this case is György Kepes, my guess is that my file is not reading the special characters so that's why I'm getting the GyÌ¦rgy Kepes. I have other examples like that across my dataset like: LÌÁszlÌ_ Moholy-Nagy instead of László Moholy-Nagy. I am not worried about the correct spelling since I am going to transform the names to categorical values. So yes, the suggestion you gave me works perfectly! — Alonso Ag, Apr 23 '18 at 16:13
To get the first country you can use some simple string methods. `'France/America'.split('/')[0]` — Jeyekomon, Apr 23 '18 at 16:16

user3483203 · Answer 1 · 2018-04-23T16:13:39.583

0

If you want to remove the bad characters, you can simply encode to ascii.

>>> s = 'GyÌ¦rgy Kepes'
>>> s.encode('ascii', errors='ignore').decode()
Gyrgy Kepes

The decode is not needed if you don't mind having the output be of type bytes

A different approach might be to use filter:

>>> import string
>>> good = set(string.printable) # 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 
>>> s = 'GyÌ¦rgy Kepes'
>>> ''.join(filter(lambda x: x in good, s))
Gyrgy Kepes

edited Apr 23 '18 at 16:13

answered Apr 23 '18 at 16:08

user3483203

50,081
9
65
94

How would it work for the entire column tho? I tried something like this: import string clean_name = set(string.printable) data['artist'].join(filter(lambda x: x in clean_name, data['artist'])) it gave the following error: AttributeError: 'Series' object has no attribute 'join' – Alonso Ag Apr 23 '18 at 16:33
@AlonsoAg You can use a `for` cycle to loop over the entire column. – Jeyekomon Apr 23 '18 at 16:49

Cleaning text in python

1 Answers1