0

I am trying to create a dataframe of states and cities.

Each state name in the table I am reading from ends with the letters [edit],city on the other hand either end with (text)[number]

I have used regex to remove the text within the parentheses and square brackets, saved states in a list for states and cities in another list for cities.

I then converted these two lists into a dictionary with the state as the key and city as the value.

However there are 517 cities and when I do this I lose 467 cities. I'm guessing because as it currently stands I am not allowing my dictionary to handle multiple values. My goal is to create a dataframe of 517x2 dimensions with a state column and city column (city matching it's state). If I create a dataframe from this dictionary I would therefore only get 50x2 as opposed to 512x2 dimensions.

My question is; i.) is my reasoning correct, ii.) how should i think about solving this problem/how should I solve it, iii.) is the code that I have written the most efficient way of reaching my end goal

import pandas as pd
import numpy as np
import re
state = []
city = []
with open("university_towns.txt","r") as i:
    uni = i.readlines()
for st in uni:
    if "[edit]"in st:
        state.append(re.sub("[\\[].*?[\\]]\s", "", st))
    else:
        city.append(re.sub("[\(\[].*?[\)\]]\s", "", st))
city_st = dict(zip(state,city))
#need to take the key-value pairs/items from the dictionary
s = pd.Series(city_st, name ='RegionName')
s.index.name = 'State'
s = s.reset_index()
s

ADD: not quite sure how to add the relevant data for this question

Emm
  • 2,367
  • 3
  • 24
  • 50
  • I'd say `table[state] -> set(cities)` it could be e.g. https://stackoverflow.com/a/47548471/705086 – Dima Tisnek Dec 11 '18 at 08:35
  • "However there are 517 cities and when I do this I lose 467 cities" - this is logical since dictionary holds only unique keys (i.e. 50 unique states). Why not read the table with states and cities directly with `pandas`? You could use something like `pandas.read_table` or `pandas.read_csv` and then format the strings. – Julia Dec 11 '18 at 09:12
  • @Julia but that will be read into a dataframe with one column consisting of both states and cities – Emm Dec 11 '18 at 09:45
  • @Emm: you could also do it like this (if index does not need to be states): `s=pd.DataFrame(); s['state'] = state; s['city'] = city`. To perform operations based on a state, you could group data like `cities_count = s.groupby('state').size()` – Julia Dec 11 '18 at 10:10

1 Answers1

0

Note that:

city_st = dict(zip(state,city))

this operation could lead to reduce number of your result because of multi value.

you can just use

aa = pd.dataframe({'state': state,'city': city})
aa['State' ] = range(aa.shape[0])

then use the pivot_table to melt your data

Rahul Agarwal
  • 4,034
  • 7
  • 27
  • 51
KyChen
  • 59
  • 4
  • Thanks. Getting a traceback with this 'ValueError: arrays must all be same length' – Emm Dec 11 '18 at 09:46
  • @Emm: the number of elements in `state` and `city` must be the same. Are you trying to match a state name to a city name? – Julia Dec 11 '18 at 10:17
  • @Julia Hmmm...when I check the length of states after running my regex the length of states is 50...and there are about 517 cities.... So in the initial table, the cities are listed under the state – Emm Dec 11 '18 at 12:31