There are 13x13 grid cells, true, but P(object) is calculated for each of 5x13x13 anchor boxes. From the YOLO9000 paper:
When we move to anchor boxes we also decouple the class prediction mechanism from the spatial location and instead predict class and objectness for every anchor box.
I can't comment yet because I'm new here, but if you're wondering about test time, it works kind of like an RPN. At each grid cell, the 5 anchor boxes each predict a bounding box, which can be larger than the grid cell, and then non-maximum suppression is used to pick the top few boxes to do classification on.
P(object) is just a probability, the network doesn't "know" if there is really an object in there or not.
You can also look at the source code for the forward_region_layer method in region_layer.c and trace how the losses are calculated, if you're interested.