In general, the FFTs for the complete file will not be equal - consider a 40 sec. file that contains four 10 sec. segments of sine waves at 20Hz, 40Hz, 60Hz and 80Hz, respectively.
The corresponding spectrum for the whole file would show peaks at those four frequencies, but any 10 sec. excerpt would have two of them at most. Hence, they do not match.
Now, what you're trying to do sounds a bit like Shazam, and luckily, they've released a research paper on how it works. Maybe that will solve your problem.
For another approach (albeit one that might not be able to deal with pitch and speed changes), consider the implications of my example above: You shouldn't try to match a spectrogram that was created over 40 sec. to one that represents only 10 sec. So you'll have to find which 10 sec.-segment of the original file the second file is taken from.
To achieve this, you could use a simple sliding window (start with the data from seconds 1 through 10, then 2 through 11, and so on), or you could chop the second file into even smaller chunks and combine the initial sliding window with techniques from string searching.