0

The dataset has 14k rows and has many titles, etc.

I am a beginner in Pandas and Python and I'd like to know how to proceed with getting the output of first name and last name from this dataset.

Dataset:

0 Pr.Doz.Dr. Klaus Semmler Facharzt für Frauenhe...

1 Dr. univ. (Budapest) Dalia Lax

2 Dr. med. Jovan Stojilkovic

3 Dr. med. Dirk Schneider

4 Marc Scheuermann

14083 Bag Kinderarztpraxis

14084 Herr Ulrich Bromig

14085 Sohn Heinrich

14086 Herr Dr. sc. med. Amadeus Hartwig

14087 Jasmin Rieche

jerof
  • 73
  • 1
  • 8
  • How is the data formatted? How do you determine whether something is a name or a location/title/conjunction/etc? Is the name put in manually or is their a systematic structure to it? – Ted Klein Bergman May 19 '20 at 13:51
  • Does it contain Chinese names? Because Chinese names starts with their family name. – Ted Klein Bergman May 19 '20 at 13:55
  • The dataset I shared has two columns: the index and a column "title". The "title" in the string that I'd like to clean. This string contains - first names, last names, title (Dr, Mr. Ms, etc.), characters like "/, - , ; " Some rows have just the first and last names, but it is mostly noise. Not sure if it answers your question – jerof May 19 '20 at 13:57
  • No, it is mainly german names, but your point is still valid here because I've observed the first names and last names inversed in order in some rows. – jerof May 19 '20 at 13:58
  • Okay, so the "title" doesn't contain any structure? If the structure is arbitrary, then you cannot extract their names, unless you have another list of valid names. – Ted Klein Bergman May 19 '20 at 14:00

2 Answers2

1
    for name in dataset:
        first = name.split()[-2]
        last = name.split()[-1]
        # save here

This will work for most names, not all. For repeatability you may need a list of titles such as (dr., md., univ.) to skip over

Blake
  • 51
  • 6
0

As it doesn't contain any structure, you're out of luck. An ad-hoc solution could be to just write down a list of all locations/titles/conjunctions and other noise you've identified and then strip those from the rows. Then, if you notice some other things you'd like to exclude, just add them to your list.

This will not solve the issue of certain rows having their name in reverse order. So it'll require you to manually go over everything and check if the row is valid, but it might be quicker than editing each row by hand.

A simple, brute-force example would be:

excludes = {'dr.', 'herr', 'budapest', 'med.', 'für', ... }

new_entries = []

for title in all_entries:
    cleaned_result = []
    parts = title.split(' ')
    for part in parts:
        if part.lowercase() not in excludes:
            cleaned_result.append(part)

    new_entries.append(' '.join(cleaned_result))
Ted Klein Bergman
  • 9,146
  • 4
  • 29
  • 50