It might be little easier to understand if you think about the process of building an Ethernet Frame containing an IP Packet that is destined for something off the network:
Computer with IP 10.10.10.1 with a mask of 255.255.255.0 and a default gateway of 10.10.10.254. And lets say its mac address (fake) is 12:34:56:78:90:12
Lets say the default gateway 10.10.10.254, has a mac (also fake) of 22:33:44:22:33:44
So, now the computer wants to send traffic to an IP 50.50.50.50. It knows from its IP address and mask that the IP is off network, so:
- It sends out an ARP request for IP 10.10.10.254
- It gets back a response with the mac 22:33:44:22:33:44
Now it builds an ethernet frame:
From MAC: 12:34:56:78:90:12
To MAC: 22:33:44:22:33:44 (this mac is local)
With an IP Packet encapsulated inside:
From IP: 10.10.10.1
To IP: 50.50.50.50 (this ip is remote)
So, the real packet isn't sent until it is built. The ARP process helps to get the entire packet built. If that destination IP had been local, the same process would have taken place, except it would ARP for the local IP's MAC Address and the To MAC would have been that response and the destination IP would be local (like 10.10.10.50).
All of the communications between devices really happens at layer 2, via the MAC addresses, or Broadcast Mac Address (like in the case of ARP). The IP packets are delivered via those Layer 2 frames.