What is the best format for a BigQuery table that is constatly receiving new columns (feature store best practices)

Question

I have a BigQuery table that is used as the input table for an ML model. This model is constantly been changed since the Data Scientists are always coming up with new features for the model. Right now my approach for the problem is altering the schema of the table every time a new feature is added. I do not like this approach since the model needs to generate predictions every 5 minutes and each time I am recreating the table with the new column, the historic data, and the partitions the process is taking more time.

I fell that there should be some feature store good practice for this type of problem that I am missing.

How would you approach this problem?

PS: the table has currently 100 M+ lines, is updated with 300 new lines (or more depending on delays on ETL process) every 5 minutes and has currently 80+ features (many of which we are not using anymore).

May I ask what kind of model is it? If it's a linear model, you might want to try long table format instead of wide format. However, long format comes with low maintenance, but high monitoring needs. Also, I'd expect the queries to be slower since there will be more rows. (If most of the columns are sparse, row number might not be too big, though.) — Sabri Karagönen, Jun 21 '23 at 09:56

score -1 · Accepted Answer · answered Jun 22 '23 at 10:38

In your case, it would be beneficial to adopt a feature store approach to manage the evolving features of your ML model. A feature store is a centralized repository that serves as a single source of truth for all the features used in machine learning models. It enables efficient feature management, versioning, and retrieval for training and inference.

Instead of altering the schema of your existing BigQuery table every time a new feature is added, consider separating the feature extraction step from the input table. You can create a separate pipeline or process that extracts the features from your input data and stores them in a dedicated feature store.

Then set up a feature store to store and manage your extracted features. This can be implemented using a dedicated database or a specialized feature store tool. The feature store should allow you to version and organize features efficiently. It should also support easy retrieval of features based on timestamps or other relevant criteria.

Create a feature engineering pipeline that takes the raw input data, applies the necessary transformations, and extracts the features. This pipeline should populate the feature store with the latest features and ensure that they are properly versioned.

When training your ML model, you can fetch the required features from the feature store based on the desired version. This allows you to consistently use the same set of features for model training, regardless of the changes made to the input data schema.

Instead of recreating the entire table every time, design your feature engineering pipeline to handle incremental updates. This means that for each new batch of data, the pipeline should update only the necessary features in the feature store, rather than rebuilding the entire feature set. This can significantly reduce the processing time and allow for more frequent model predictions.

Then finally, try to implement a mechanism to track feature versions and manage deprecated features. This ensures that you can easily track which features were used for each training run and inference, and you can remove unused or deprecated features from the feature store to keep it clean and efficient.

What is the best format for a BigQuery table that is constatly receiving new columns (feature store best practices)

1 Answers1