Get char encoding information in scala

Question

import scala.io.Source

def checkCodec(filename:String): String = {
val bufferedSource = Source.fromFile(filename)
val codec:String = (bufferedSource.codec).toString
println("bufferedSource.codec - " +bufferedSource.codec)
bufferedSource.close
if(codec.equalsIgnoreCase("UTF-8")){
  return filename + " " + codec
}
else{
  return "CodecErrorDetected"
}
  }

val validFile = checkCodec(fileName)

println("The file is - "+validFile)

This function runs fine and gives "UTF-8" as the result even when the file type is .zip, incorrect file format or some corrupted file (used https://pinetools.com/corrupt-file-generator). How can I distinguish atleast the corrupted file (for eg: I changed a pdf file to .pddssee format which even doesn't exist, it is still recognized as a UTF-8 file). Need help in understanding how can I distinguish a corrupted file using scala. Is this the correct way I am checking for corrupt file?

Will appreciate your valuable input.

A [codec](https://en.wikipedia.org/wiki/Codec) ("coder-decoder") is not the same thing as a file format. A file's format is the set of rules by which the bits and bytes are organized. There are hundreds (thousands?) of different file formats. Changing a file's name won't change its format. If I have a file `cat.png` and I rename it to `dog.jpg` it is still a picture of a cat in the PNG format. To identify a corrupted file you probably need to read it via a library that understands/expects the intended format. Something like [PDFBox](http://www.pdfbox.org) for example. — jwvh, Mar 11 '21 at 09:42
@jwvh Thanks for your reply. I was trying to change the file format (not the file names) from .pdf to .pddssee format (to make it unreadable or corrupt file). But the above function returns UTF-8 everytime even for the corrupt file. — divyank khandelwal, Mar 11 '21 at 15:52
How do you change a file from a known format (PDF) to a made-up format? How do you play a game with no rules and a made-up name? How would you know if you were cheating? The corrupt-file-generator offers different file extensions (`.pdf`, `.mp3`) but that's just part of the file name. It doesn't alter the random-byte generation. [UTF-8](https://en.wikipedia.org/wiki/UTF-8) is the character set used by the `BufferedSource`. It's an enhancement from the ASCII character set and is unrelated to the file format being read from or written to. — jwvh, Mar 11 '21 at 21:56
The biggest issue with the sample code is the use of return. Insert smiley. Detecting file formats is always tricky. You check the name and then probe the data. To answer the question narrowly, the codec just says how you expected to converted bytes to chars. Also thanks @jwvh for supporting folks with questions. — som-snytt, Mar 15 '21 at 00:01

Get char encoding information in scala

0 Answers0