Games like Diablo or Sims 1, 2, SimCity 1-3, X-Com 1,2 etc. were actually just 2D games. The 2.5D effect requires that tiles further away are exactly the same size as tiles nearby. Your rotation around these games are restricted to 90 degrees.
How they draw is basically painters algorithm. Drawing what is furthest away first and overdrawing things that are nearer. Diablo is actually pretty simple, it didn't introduce layers or height differences as far as I remember. Just a flat map. So you draw the floor tiles first (in this case back to front isn't too necessary since they are all on the same elevation.) Then drawing back to front the walls, characters effects etc.
Everything in these games were rendered to bitmaps and rendered as bitmaps. Even though their source may have been a 3D textured model.
If you want to add perspective or free rotation then you need everything to be a 3D model. Your rendering will be simpler because depth or render order isn't as critical as you would use z-buffering to solve your issues. The only main issue is to properly render transparent bits in the right order or else you may end up with some odd results. However even if your rendering is simpler, your animation or in memory storage is a bit more difficult. You need to animate 3D models instead of just having an array of bitmaps to do the animation. Selection of items on the screen requires a little more work since position and size of the elements are no longer consistent or easily predictable.
So it depends on which features you want that will dictate which sort of solution you can use. Either way has it's plusses and minuses.