For the companies I've worked for data for a development environment is either (1) rsynced from a central server or (2) have a minimal data set which is added to the git repository.
In case of option 1 there's usually an automated process which pulls data from production servers and cleans it up (removing cache/logs, anonymize any sensitive data, etc). The advantages of this option are you have (mostly) real data for your development environment and there's no need to manually manage a separate data set. The disadvantages are you might not have data to test all situations, data can get large and there's a chance you might miss sensitive data which could lead to data leaks.
In case of option 2 there's usually a way to generate random data to get a more filled development environment. The advantages to this option are you have a clean development environment, the data set is as small as it can be and there's no chance of leaking sensitive data. The disadvantages are you need to maintain a separate minimal data set, problems related to specific data might be harder to debug.
Personally I think 2 is the better option. You should not need production data for development as long as you have a good way to create realistic random data. Production data might actually miss a lot of situations you do need for development. Some content elements might not always be used, things like empty news lists might not happen (often) in production, etc. I also don't want to have to download several Gb of data if I have to change a small thing in a project I don't have locally yet.