The meaningfulness of any comparison of summary statistics depends entirely on your inference goals. Are you using the model to make future-looking forecasts, and care about forecast accuracy? Or are you looking at a study of data, trying to quantify the regression coefficients for purposes of analyzing relationships and effect sizes? There's no isolated, singular way to compare the models without knowing your specific inference goals.
If your goals are focused more on the prediction side you have some options. You can use the fitted parameters from your OLS model to make predictions on a hold-out or test set, and then look at the root mean squared error, or the correlation or R-squared between the test set target values and the predictions (but generally not the R-squared on the data used for fitting the model).
For the Bayesian model, you can choose one point estimate of the parameters, such as the MAP estimate, and then repeat the same test set evaluation as for OLS. This is conceptually easier, but fails to take advantage of the fact that the Bayesian model gives you a posterior distribution over the parameters.
To utilize the posterior, you could use a tool like pymc
or stan
to draw a lot of samples of the fitted coefficient vector from the posterior distribution. Then for each sample, calculate your performance metric on the test data like you did for OLS. Then this will give you a distribution over your test statistics, so you can see the mean and standard deviation of the performance on the test set -- something you can't as easily get from standard OLS models.
If you are more concerned with effect sizes or discussions about explanatory analysis of the coefficients, then you can separately look at the frequentist (NHST) p-values you get for the OLS parameters, and the standard error of the regression (standard error of the residuals). If you care about notions of frequentist statistical significance, these metrics give you that information.
While you can calculate Bayesian p-values and credible (high density) intervals for the coefficients in the Bayesian model, you have to be careful not to directly compare this to the frequentist analogues. A Bayesian p-value tells you something about the relative extremity of an outcome in the posterior distribution, which implicitly includes your assumption about priors and model structure. A frequentist p-value tells you something about the relative extremity of an outcome assuming the null hypothesis -- which is a different thing and is not equivalent to a posterior distribution.
You likely also want to make sure your variables have been standardized in a useful way to make coefficients comparable across the models. For example, you likely want to z-score your input predictors, or even consider normalizing by twice the standard deviation if you categorical variables (this makes coefficients directly interpretable in terms of units of standard deviation in the fitted model).
Finally, you are mentioning a hierarchical model in the Bayesian case, which suggests you are modeling different possible treatment effects for different groups of observations.
It is very difficult to create a direct equivalent for this in the frequentist setting. You can add indicator variables for different group membership, and then try to interpret the other coefficients as the baseline effect in the "default" group, and the coefficients of the indicators as the marginal additional effect when the baseline predictor is at its mean value. But this interpretation becomes really convoluted, especially as the number of groups becomes larger.
You can also try to use the machinery developed for random effects and mixed effects modeling from econometrics, but it boils down to a lot of the same indicator-variable-based techniques and it also becomes exceedingly tricky to interpret and to ensure you're modeling correlation of errors correctly (which can require clustered standard errors).
Frankly, I think if you have reason to suspect that a hierarchical model is useful for the Bayesian approach, then just ignore the OLS approaches entirely. Perhaps try to choose "uninformative priors" for your metaparameters, unless those priors can be based on previous research. And then just focus on standard methods for interpreting Bayesian p-values, posterior predictive checks, test sample accuracy metrics, and Bayesian credible intervals. Your effort will probably be better spent this way than doing mental gymnastics to argue about some interpretability connection to mixed effects models.