1

My Dataset is huge. I am using Azure ML notebooks and using azureml.core to read dateset and convert to azureml.data.tabular_dataset.TabularDataset. Is there anyway i would filter the data in the tabularDataset with out converting to pandas data frame. I am using below code to read the data. as the data is huge pandas data-frame is running out of memory. I don't have to load complete data into the program. Only subset is required. is there any way i could filter the records before converting to pandas data frame

def read_Dataset(dataset):
    ws = Workspace.from_config()
    ds = ws.datasets
    tab_dataset = ds.get(dataset)
    dataframe = tab_dataset.to_pandas_dataframe()
    return dataframe
karas27
  • 335
  • 1
  • 5
  • 15

2 Answers2

1

At this point of time, we only support simple sampling, filtering by column name, and datetime (reference here). Full filtering capability (e.g. by column value) on tabulardataset is an upcoming feature in the next couple of months. We will update our public documentation once the feature is ready.

May Hu
  • 501
  • 2
  • 3
  • any significant documentation on this preview feature? I'm trying to build an expression to filter on a column using 'contains' but having some issues. e.g dataset['col'].str.contains(['a','b','c']) but it seems a pandas filter like this one isn't valid to feed into filter() like: dataset.filter(dataset['col'].str.contains(['a','b','c'])) – geominded Jun 21 '22 at 20:01
  • I'm also interested in further documentation to see what filter expressions are supported. Trying to do a simple filter using isin.() but azureml sdk is throwing error. e.g. `dataset.filter(dataset['column'].isin(['a','b'])).to_pandas_dataframe()` throws AttributeError: 'RecordFieldExpression' object has no attribute 'isin' – geominded Jul 26 '22 at 19:06
  • Does not give enough documentation to figure out what filter expressions work: https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.data.tabulardataset?view=azure-ml-py#azureml-data-tabulardataset-filter – geominded Jul 26 '22 at 21:31
0

You can subset your data in two ways,

  1. row wise - use TabularDataset class filter method
  2. column wise - use TabularDataset class keep_columns method or drop_columns method

hope this helps tackle out of memory error

Anitha
  • 1
  • 1