I just started coding with python and I have a dataset where two of my columns are giving me some problems. One of them has the information of the country of origin of an artist, and some of them have dual nationalities, like so: France/America. I am trying to get the first country only, in this case France. For the second column, I have the name of the artist but some of them have strange characters, for example: Gy̦rgy Kepes. What would be the best way to clean those elements? If this is of any help, I am opening my file the following way:
data = pd.read_csv(fpn_csv, encoding='ISO-8859-1')
I don't know if this is affecting my process in any way, but I cannot open the file if I use UTF-8
The name of the columns are:
country_of_origin and artist.
Here is a sample of my file:
+------+-------------------------------+-----------------------------+-------------------+-------------------------+------------+-----------------+
| ID | artist_title | art_movement | museum_venue | country_of_origin | has_text | primary_medium |
+------+-------------------------------+-----------------------------+-------------------+-------------------------+------------+-----------------+
| 361 | LÌÁszlÌ_ Moholy-Nagy | Vertical Black, Red, Blue | LACMA also MoMA | Hungary | FALSE | sculpture |
| 362 | BrassaÌø (Gyula HalÌÁsz) | Buttress of the Elevated | MoMA | Transylvania / France | FALSE | photography |
| 363 | M. C. Escher | Relativity | MoMA | Denmark | FALSE | print |
| 364 | Clyfford Still 1944-N No. 2 | abstract expressionism | MoMA | America | FALSE | painting |
| 365 | Harold E. Edgerton | Milk Drop | MoMA | America | FALSE | photography |
| 366 | Meret Oppenheim Object | surrealism | MoMA | Germany / Switzerland | FALSE | sculpture |
+------+-------------------------------+-----------------------------+-------------------+-------------------------+------------+-----------------+