"Obviously!", you might say... But there's one significant difference that I have trouble explaining by the difference in random initialization.
Take the two pre-trained basenets (before the average pooling layer) and feed them with the same image, you will notice that the output features don't follow the same distribution. Specifically, TensorFlow's backbone has more inhibited features by the ReLU compared to Pytorch's backbone. Additionally, as shows in the third figure, the dynamic range is different between the two frameworks.
Of course, this difference is absorbed by the dense layer addressing the classification task, but: Can that difference be explained by randomness in the training process? Or training time? Or is there something else that would explain the difference?
Code to reproduce:
import imageio
import numpy as np
image = imageio.imread("/tmp/image.png").astype(np.float32)/255
import tensorflow as tf
inputs = image[np.newaxis]
model = tf.keras.applications.ResNet50(include_top=False, input_shape=(None, None, 3))
output = model(inputs).numpy()
print(f"TensorFlow features range: [{np.min(output):.02f};{np.max(output):.02f}]")
import torchvision
import torch
model = torch.nn.Sequential(*list(torchvision.models.resnet50(pretrained=True).children())[0:8])
inputs = torch.tensor(image).permute(2,0,1).unsqueeze(0)
output = model(inputs).detach().permute(0,2,3,1).numpy()
print(f"Pytorch features range: [{np.min(output):.02f};{np.max(output):.02f}]")
Outputting
TensorFlow features range: [0.00;25.98]
Pytorch features range: [0.00;12.00]
Note: it's similar to any image.