12

I am using a library called EXIFextractor to extract metadata information from images. This lib in part is using System.Drawing.Imaging.PropertyItem to do all the hard work. Some of the data in PropertyItem, such as Image Details etcetera, are fetched as an ASCII-string stored in a byte[] according to the Microsoft documentation.

My problem is that international characters (å, ä, ö, etcetera) are dropped and replaced by questionmarks. When I debug the code it is apparent that the byte[] is a representation of an UTF-8.

I'd like to parse the byte[] as an UTF8-string, how can I do this without loosing any information in the process?

Thanks in advance!


Update:

I have been asked to provide a snippet from my code:

The first snippet is from the class I use, namely the EXIFextractor.cs written by Asim Goheer

foreach( System.Drawing.Imaging.PropertyItem p in parr )
{
 string v = ""; 

                // ...

 else if( p.Type == 0x2 )
 {
  // string     
  v = ascii.GetString(p.Value);
 }

And this is my code where I try my best to handle the results of the above.

                try {
  EXIFextractor exif = new EXIFextractor(ref bmp, "");
  object o;
                    if ((o = exif["Image Description"]) != null)
                        MediaFile.Description = Tools.UTF8Encode(o.ToString()); 

I have also tried a couple of other ways of getting my precious å, ä, ö from the data, but nothing seems to do the trick. I am starting to think Hans Passant is right about his conclusions in his answer below.

dotmartin
  • 531
  • 1
  • 4
  • 25
  • If the information is read using ASCII encoding, any non-ASCII characters will not be read correctly as a consequence. This reading of characters with an encoding and then writing to a byte array doesn't sound right. Can you link to the documentation that states this is the case? – Paul Turner Aug 04 '10 at 14:50
  • 1
    Here it is, if I am allowed to post another hyperlink :) http://msdn.microsoft.com/en-us/library/system.drawing.imaging.propertyitem.type.aspx – dotmartin Aug 05 '10 at 09:30

4 Answers4

43
string yourText = System.Text.Encoding.UTF8.GetString(yourByteArray);
teedyay
  • 23,293
  • 19
  • 66
  • 73
Scoregraphic
  • 7,110
  • 4
  • 42
  • 64
  • 1
    Thanks for the swift answer. However I have already tried this. No luck. I am starting to wonder if the sources (image files) are correctly encoded in the first place. – dotmartin Aug 04 '10 at 14:17
  • If you can share an example, we may check or try on our own. – Scoregraphic Aug 05 '10 at 05:09
  • 1
    Of course. Since I am new at this, shall I provide it as an answer or in a comment or what is the preferred way of doing this? – dotmartin Aug 05 '10 at 09:06
  • You should edit and update your question. A bold "Update" label in the text with the "new" stuff should do. – Scoregraphic Aug 05 '10 at 09:21
  • Please see my comment in Hans Passant's answer – Scoregraphic Aug 05 '10 at 09:44
  • Alright, this seems to be the solution after all. Sort of, at least. I was just a bit of regarding the encoding. The metadata seems to be encoded using ISO-8859-1, which makes sense since we are using windows across all our sites. So I simply create an encoder: Encoding enc = Encoding.GetEncoding("ISO-8859-1"); Then I use it to decode the byte array: v = enc.GetString(p.Value,0,p.Len - 1); Where p is the ProperyItem. This seems to work! Thanks for all your help! I am impressed by your enthusiasm and your helpfulness. Sure hope I can tribute in the same way! Again, thanks! – dotmartin Aug 05 '10 at 11:50
4

Use the GetString method on the Encoding.UTF8 object.

Tim Robinson
  • 53,480
  • 10
  • 121
  • 138
2

Yes, this is a problem with the app or camera that originated the image. The EXIF standard has horrible support for text, it has to be encoded in ASCII. That only ever works out well when the photographer speaks English. No doubt the software that encoded the image is ignoring this requirement. Which is what the PropertyItem class is doing as well, it encodes a string to byte[] with Marshal.StringToHGlobalAnsi(), which assumes the system's default code page.

There's no obvious fix for this, you'll get mojibake when the photo was made too far away from your machine.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • 1
    This was what I expected. How ever I was still hoping that Photoshop and the built in tool by XMP would be able to get things straight. Are there any suggestions on what one could do to resolve the issue? My company has a lot of files with bad encoding, so a batch processor would be preferred. – dotmartin Aug 05 '10 at 06:23
  • Is it still true that in the byte-array all bytes are correct according to your locale? If it is, you may try encoding/decoding using your locale instead of UTF8 / ascii. See http://msdn.microsoft.com/en-us/library/system.text.encoding.getencoding.aspx – Scoregraphic Aug 05 '10 at 09:44
  • 1
    I downloaded an application called GeoSetter which is used to geotag photos, but it also have the capabilities to read and write EXIF- and IPTC-metadata. It tells me that the metadata is UTF-8 encoded and displays the Swedish characters correctly. – dotmartin Aug 05 '10 at 10:54
  • I wonder if you could add an example of such a picture (if allowed). You may edit the picture as well, as long as the EXIF data is still written. – Scoregraphic Aug 05 '10 at 11:04
  • I might be on the right course towards a solution. I have managed to edit the EXIFextractor class to translate the byte-array to a correctly encoded string right away. I will conduct some more research and soon be able to tell if my theories holds up! – dotmartin Aug 05 '10 at 11:43
1

Maybe you could try another encoding? UTF16, Unicode? If you aren't sure if it got encodes right in the first place try to view the exif metadata with another exif reader.

codymanix
  • 28,510
  • 21
  • 92
  • 151