0

I am trying to work on the adult dataset, available at this link.

At the moment I'm stuck since the data I am able to crawl are in formats which are not completely known to me. Therefore, after downloading the files, I am not able to correcly get a pandas dataframe with the downloaded files.

I am able to download 3 files from UCI using the following links:

data = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data'  
names = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names'
test = 'https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test'

They are respectively of formats .data, .names and .test. I have always worked using .csv format, therefore I am a little confused about these ones.

How can I get a pandas dataframe with the train data (= data + names) and a pandas dataframe with the test data (= test + names)?

This code won't completely work:

train_df = pd.read_csv(r'./adult.data', header=None)
train_df.head()  # WORKING (without column names)

df_names = df = pd.read_csv(r'./adult.names')
df_names.head()  # ERROR

test_df = pd.read_csv(r'./adult.test')
test_df.head()  # ERROR
E_net4
  • 27,810
  • 13
  • 101
  • 139
hellomynameisA
  • 546
  • 1
  • 7
  • 28
  • Have you opened the files to look at the data? adult.names is not in csv format, it is a human-readable description of the column names, there is no reason why read_csv should work on it, and in adult.test you probably want to skip first line – blurry Mar 15 '22 at 09:24
  • Hi @blurry, yes I opened them. In one case (.names) you're right: they're human readable. In the other two cases, they're like they were a csv file – hellomynameisA Mar 15 '22 at 16:15

2 Answers2

3

Use:

import pandas as pd
import re

# adult.names
with open('adult.names') as fp:
    cols = []
    for line in fp:
        sre = re.match(r'(?P<colname>[a-z\-]+):.*\.', line)
        if sre:
            cols.append(sre.group('colname'))
    cols.append('label')

# Python > 3.8, walrus operator
# with open('adult.names') as fp:
#     cols = [sre.group('colname') for line in fp
#                 if (sre := re.match(r'(?P<colname>[a-z\-]+):.*\.', line))]
#     cols.append('label')

options = {'header': None, 'names': cols, 'skipinitialspace': True}

# adult.data
train_df = pd.read_csv('adult.data', **options)

# adult.test
test_df = pd.read_csv('adult.test', skiprows=1, **options)
test_df['label'] = test_df['label'].str.rstrip('.')
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Thank you very much, this is exactly what I needed – hellomynameisA Mar 15 '22 at 16:17
  • this produces `cols = ['label']` so train_df and test_df don't have column labels for any columns but the last. I suspect we want `cols = ['age', 'workclass', 'fnlwgt', "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "captial_gain", "capital_loss", "hours_per_week", "native_country", "label"]` – travelingbones Dec 19 '22 at 21:10
  • Recommend skipping the loop under `with open('adult.names') as fp:` and simply hardcoding `cols=['age', 'workclass', 'fnlwgt', "education", "education_num", "marital_status", "occupation", "relationship", "race", "sex", "captial_gain", "capital_loss", "hours_per_week", "native_country", "label"]` Then use the code in this answer from `options...` onward. – travelingbones Dec 19 '22 at 21:14
  • @travelingbones. I tried again and it's work perfectly. `cols` variable is set to `['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'label']` after the loop – Corralien Dec 20 '22 at 06:00
  • hmm, ok. didn't work on mine, but hopefully it does for most! perhaps there are some formatting differences in the .names file. In any case, you solved it and these comments may help those w/ my outcome. – travelingbones Dec 22 '22 at 17:26
0

You achieve that using pandas like this:

import pandas as pd
# reading csv files
data =  pd.read_csv('adult.data', sep=",")
print(data)


names =  pd.read_csv('adult.names', sep="\t")
print(names)

test =  pd.read_csv('adult.test', sep="\t")
print(test)
Gaston Alex
  • 151
  • 1
  • 6