Not quite an answer.
I had similar effects with it and I think all the parameters and how you train it is important. For example, with more layers (resnet34 vs. resnet18 for the backbone) you need more information to train the bigger network. In this case, augmentations are useful.
Another example is network resolution. I trained it with the default one min_size=800 and max_size=1333 on some learning rate and with the higher resolution, you have a higher potential for the aggressive growth of the network AP on a higher LR. Yet another example related to this is how many "levels" you have in your FPN and what is the grid settings for AnchorGenerator. If your augmentations generate samples smaller than the anchors on a particular level of FPN then they probably will cause more issues than do any good. And if your augmentations generate samples such a small that the details of your object are not visible - again, not very useful, especially on small networks.
There are tons of similar small issues that matter. I had a situation, that rotations made the result worse because, with some rotation angle, the rotated sample started to look like a part of the background and the detector based on maskrcnn failed to work with it. Cubic interpolation fixed it a little bit but eventually, I came up with the idea to limit the angle of the rotation.
Just experiment and find hyperparameters that play well for your particular task.