I am trying to create a dataframe of states and cities.
Each state name in the table I am reading from ends with the letters [edit],city on the other hand either end with (text)[number]
I have used regex to remove the text within the parentheses and square brackets, saved states in a list for states and cities in another list for cities.
I then converted these two lists into a dictionary with the state as the key and city as the value.
However there are 517 cities and when I do this I lose 467 cities. I'm guessing because as it currently stands I am not allowing my dictionary to handle multiple values. My goal is to create a dataframe of 517x2 dimensions with a state column and city column (city matching it's state). If I create a dataframe from this dictionary I would therefore only get 50x2 as opposed to 512x2 dimensions.
My question is; i.) is my reasoning correct, ii.) how should i think about solving this problem/how should I solve it, iii.) is the code that I have written the most efficient way of reaching my end goal
import pandas as pd
import numpy as np
import re
state = []
city = []
with open("university_towns.txt","r") as i:
uni = i.readlines()
for st in uni:
if "[edit]"in st:
state.append(re.sub("[\\[].*?[\\]]\s", "", st))
else:
city.append(re.sub("[\(\[].*?[\)\]]\s", "", st))
city_st = dict(zip(state,city))
#need to take the key-value pairs/items from the dictionary
s = pd.Series(city_st, name ='RegionName')
s.index.name = 'State'
s = s.reset_index()
s
ADD: not quite sure how to add the relevant data for this question