extract some data from a huge dataset

Question

I have a huge dataset about airbnb in the world. This dataset have 5500 city in it. I want to work only on 'London', 'Paris' and 'Berlin' So from my original dataset name 'df' I want to creat a new dataset 'filtered_df' with only all the data from these 3 cities. I have a variable 'City', so i tried this below but doesn't work as i want.

df_berlin = df['City']== 'Berlin'

df_paris = df['City']== 'Paris'

df_london = df['City']== 'London'

filtered_df = [df_berlin + df_paris + df_london]

It's not surprising that this does not do anyting. It's just assigning `False` to all three o your filter variables and adds them up to result in `[0]`. What format is your dataset in and what framework do you use to work with it? There may be a built-in subset method you can use. — Martin Wettstein, Jun 10 '21 at 08:52
Maybe you can try something like ```df_paris = df["Paris"]``` — Milos Stojanovic, Jun 10 '21 at 08:52
my dataset is in CSV format, and i'm working on jupiter with python. I get a key error when using df_paris = df["Paris"], because Paris, London and Berlin are in the column 'City' — pandawan, Jun 10 '21 at 08:55
I'm just beginning with Pandas, too, so I'm not sure if it's right, but does `df[df.City=='Berlin']` do it? — fsimonjetz, Jun 10 '21 at 09:04
No :/ i just get all the column names, but not the data about berlin — pandawan, Jun 10 '21 at 09:11
I think we need to know exactly what format the data is in, otherwise it's just guesswork ;/ — fsimonjetz, Jun 10 '21 at 09:15
@pandawan run `df.head()` and add output in the question. It will gives some insight about your dataset. I assume that you have converted your csv format dataset into a `pandas DataFrame` using `read_csv()`. — nobleknight, Jun 10 '21 at 09:19
Does this answer your question? [How do I select rows from a DataFrame based on column values?](https://stackoverflow.com/questions/17071871/how-do-i-select-rows-from-a-dataframe-based-on-column-values) — SunilG, Jun 10 '21 at 09:43
How about `filtered_df = df.loc[df['City'].isin(['Berlin', 'Paris', 'London'])]`? — 0x5453, Jun 10 '21 at 13:17

Sandeep Kumar · Accepted Answer · 2021-06-10T13:13:29.240

1

df_berlin=df[df.City=='Berlin']
df_paris=df[df.City=='Paris']
df_london=df[df.City=='London']
filtered_df = df_berlin.append(df_paris.append(df_london))
filtered_df.sort_index(inplace=True,kind='mergesort')

I tried to simulate this on a small dataset and it worked, for your dataset since it is huge, you can use mergesort and it should work I guess.

edited Jun 10 '21 at 13:13

answered Jun 10 '21 at 09:36

Sandeep Kumar

46
7

Perfect, your code work perfectly thanks a lot, you just made a small misstake, .append(df_london) and not .append(london) – pandawan Jun 10 '21 at 10:38
Corrected it :) – Sandeep Kumar Jun 10 '21 at 13:14

extract some data from a huge dataset

1 Answers1