How to structure Machine Learning projects using Object Oriented programming in Python?

Question

I have observed that staticians and machine learning scientist generally doesnt follow OOPS for ML/data science projects when using Python (or other languages).

Mostly it should be due to lack of understanding of best software engineering practises in oops while developing ML code for production. Because they mostly come from math & statistics education background than computer science.

Days when ML scientist develop ad hoc protype code and another software team make it production ready are over in the industry.

Questions

How do we structure code using OOP for ML project?
Should every major task (from picture above) like data cleaning, feature transformation, grid search, model validation etc. be a individual class? What are the recommended code design practises for ML?
Any good github links with well strcutured code for reference (may be a well written kaggle solution)
Should every class like data cleaning have fit(), transform(), fit_transform() function for every process like remove_missing(), outlier_removal()? When this is done why is scikit-learn BaseEstimator be usually inherited?
What should be the structure of typical config file for ML projects in production?

I think it is debatable whether OOP is that clever choice for data science and languages like Python. Personally, I am in favour of functional style, especially when we're dealing with math. The fact that it is opinion-based makes this question perhaps not best suited for SO (although I certainly agree it is interesting). — Lukasz Tracewski, Oct 28 '17 at 18:19
Most of the production quality python codes are written in oops as far as I heard and seen. Why is functional style more favorable compared to oops for math? — GeorgeOfTheRF, Oct 28 '17 at 18:24
The very essence of functional programming is treating code as evaluation of mathematical functions. By avoiding mutable data structures and changing of state, one can produce code that is more robust and certainly easier to test. I think it is hard to deny that having unit tests around ML project makes smoother iterations in the cycle you depicted. — Lukasz Tracewski, Oct 28 '17 at 18:43
This is an interesting question that should be moved IMHO to https://softwareengineering.stackexchange.com/ The difference in the tools and libraries should also be considered. For example, a Pandas Dataframe (a very powerful and versatile tool in the hands of a data scientist) feels like putting a SQL table right in the middle of the code. It is very hard to work with combined with OOP code surrounding it. — zardosht, Aug 19 '22 at 16:04

BartoszKP · Answer 1 · 2020-01-08T22:10:39.383

You are right about one thing being special about ML: data scientists are generally clever people, so they have no problem in presenting their ideas in code. The problem is that they tend to create fire&forget code, because they lack software development craftsmanship - but ideally this shouldn't be the case.

When writing code it shouldn't make any difference what the code is for¹. ML is just another domain like anything else, and should follow clean code principles.

The most important aspect always should be SOLID. Many important aspects directly follow: maintainability, readability, flexibility, testability, reliability etc. What you can add to this mix of features is risk of change. It doesn't matter whether a piece of code is pure ML, or banking business logic, or audiological algorithm for a hearing instrument. All the same - the implementation will be read by other developers, will contain bugs to fix, will be tested (hopefully) and possibly refactored and extended.

Let me try to explain this in more detail while addressing some of your questions:

1,2) You shouldn't think that OOP is the goal in itself. If there is a concept that can be modeled as a class and this will make its usage easy for other developers, it will be readable, easy to extend, easy to test, easy to avoid bugs then of course - make it a class. But unless it's needed, you shouldn't follow the BDUF antipattern. Start with free functions and evolve into a better interface if needed.

4) Such complex inheritance hierarchies are typically created to allow implementation to be extensible (see "O" from SOLID). In this case, library users can inherit BaseEstimator and it's easy to see what methods can they override and how this will fit into scikit's existing structure.

5) Almost the same principles as for code, but with people who will create/edit these config files in mind. What will be the easiest format for them? How to choose parameter names so it will be obvious what do they mean, even for a beginner, who is just starting to use your product?

All these things should be combined with guessing how likely is it that this piece of code will be changed/extended in the future? If you are sure something should be written in stone, don't worry about all aspects too much (e.g. focus only on readability), and direct your efforts to more critical parts of the implementation.

To sum up: think about people who will interact with what you create in the future. In case of products/config files/user interfaces it should be always "user first". In case of code, try to put yourself in the shoes of a future developer who will want to fix/extend/understand your code.

¹ There are of course some special cases, like code that needs to be formally proven correct or extensively documented because of formal regulations and this main goal imposes some particular constructs/practices.

How to structure Machine Learning projects using Object Oriented programming in Python?

1 Answers1