The character code used for the text in a PDF file need not have any direct relationship with any language coding. Here's what the PDF contains for the bit of text you are pointing at:
/F1.0 1 Tf (these houses ) Tj ET Q q 1 0 0 -1 0 792 cm BT 11 0 0 -11 235.8 375
Tm /F2.1 1 Tf (7) Tj ET Q q 1 0 0 -1 0 792 cm BT 11 0 0 -11 242.2346 375 Tm
/F1.0 1 Tf ( ) Tj ET Q q 1 0 0 -1 0 792 cm BT 11 0 0 -11 244.9846 375 Tm /F2.1
1 Tf [ (!) 0.2 ("#) -0.3 ($) ] TJ ET Q q 1 0 0 -1 0 792 cm BT 11 0 0 -11 235.8 406
Now Tf selects a font (and point size), Tj draws text. BT and ET mean Begin Text Block and End Text Block q and Q mean gsvare and grestore, cm is concatmatrix, Tm is set text matrix, and TJ is another way to draw text.
You can ignore most of these.
Looking at just the important bits we have:
/F1.0 1 Tf (these houses ) Tj
/F2.1 1 Tf (7) Tj
/F1.0 1 Tf ( ) Tj
/F2.1 1 Tf [ (!) 0.2 ("#) -0.3 ($) ] TJ
Now you can see that the text in the font named 'F1.0' is encoded using ASCII (more or less), this font is AGaramondPro-Regular, using MacRomanEncoding:
8 0 obj
<<
/Type /Font
/Subtype /Type1
/BaseFont /GFJJBF+AGaramondPro-Regular
/FontDescriptor 54 0 R
/Widths 55 0 R
/FirstChar 32
/LastChar 169
/Encoding /MacRomanEncoding
>>
endobj
The text using font 'F2.1' is your Devanagri font, defined as:
10 0 obj
<<
/Type /Font
/Subtype /TrueType
/BaseFont /MWSGSJ+DevanagariMT
/FontDescriptor 48 0 R
/Widths 49 0 R
/FirstChar 33
/LastChar 105
/ToUnicode 50 0 R
>>
endobj
Notice this has no Encoding, but it does have a ToUnicode entry. Essentially this means the font has a non-standard custom Encoding. The subset font is defined in such a way that the character code maps directly to a specific glyph in the font's GLYF table (its a TrueType font). Because its not a standard Encoding, there's no way ot know what the charcter codes 'mean'. However, the ToUnicode CMap is intended to give you a mapping from character code to Unicode code point.
THe ToUnicode CMap is Acrobat (and other viewers) first and best way to extract text. A properly constructed ToUnicode CMap should give you a direct Unicode code point from a given character code. The CMap in the file is :
50 0 obj
<<
/Length 913
>>
stream
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo <<
/Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<00><FF>
endcodespacerange
39 beginbfrange
<21><21><092e>
<22><22><0915>
<23><23><093e>
<24><24><0928>
<25><25><092c>
<26><26><095c>
<27><27><0938>
<2a><2a><0926>
<2b><2b><0930>
<2c><2c><091b>
<2d><2d><094b>
<2e><2e><091f>
<2f><2f><090f>
<32><32><0924>
<33><33><0940>
<34><34><092f>
<35><35><0939>
<36><36><0935>
<39><39><0906>
<3a><3a><0932>
<3e><3e><092a>
<46><46><0905>
<49><49><095b>
<4a><4a><095a>
<4b><4b><091a>
<51><51><0917>
<52><52><091c>
<58><58><0920>
<5a><5b><095d>
<5c><5c><0959>
<5d><5d><0914>
<60><60><0921>
<61><61><094c>
<62><62><092d>
<63><63><0936>
<64><64><093f>
<65><65><0916>
<66><66><0907>
<68><68><0927>
endbfrange
endcmap
CMapName currentdict /CMap defineresource pop
end
end
endstream
endobj
Taking the first line:
<21><21><092e>
That means the character codes from 0x21 to 0x21 map to Unicode code points starting at 0x092e. Obviously that's a single character code, but it could be a range.
Now you'll note that the CMap has 'holes' in the ranges, for instance there are no entries for 0x28 and 0x29.
So taking your text, the characters are 7, !, ", #, $. Or, in hex 0x37, 0x21, 0x22, 0x23, 0x24 (you can see how the indices have been chosen, the first character in the file is 0x01, the second is 0x02 and so on, so the code to glyph mapping depends on the order the characters are used).
So we run those numbers through the ToUnicode CMap, 0x37 maps to... Oops! There is no entry in the CMap for character code 0x37! 0x21 maps to 0x092e, 0x22 to 0x0915, 0x23 to 0x093e and 0x24 maps to 0x0928.
So the latter four characters copy and paste correctly. Acrobat (and any other viewer) doesn't know what to do with character code 0x37, so it does the best it can and falls back to good old ASCII in the hope that it might be right, which is why the initial pasted character is a 7, that's 0x37 in ASCII.
So that's your problem, the ToUnicode CMap does not contain a mapping to Unicode code points for all the character codes which are used in the PDF file. This is a fault of the PDF Creation tool, Mac OS/X 10.6 Quartz PDF Cn=ontext or (since the file has been modified) the editing application, 'Pages'.
How can you fix this ? Well you could hand edit the ToUnicode CMap file and add entries for each character code. That would be a laborious process, because first you'd have to identify each character code in the text and figure out what its Unicode code point is. Also, PDF is a binary format, with a cross-reference table. If you make any insertions in the file then the xref table will be invalid and the PDF file effectively corrupted. Some viewers will be able to fix it, some won't.
As I hinted above, a custom-encoded subset font is normally created so that the first character used in the document is given the character code 1, the second is 2 and so on. So for each document the actual mapping will be unique, its not going to be possible to write some code to reliably do this for you, because there is no 'one size fits all' mapping.
Basically you need to remake the PDF file using software which embeds a correct ToUnicode CMap in the PDF file.