0

I'm parsing CSV files, and I might sometime bump into illegal files , like jpeg or pdf and etc...

So when I parse the file content I want to determine if the char is legal (came from keyboard) like a 5 & % ! and etc...

But not chars like this : � ַ and other weird chars that can be found inside images pdfs and other files

I don't want to check mime type of the file and I prefer not to add several third party jars to solve this problem , I want to figure out that the file that is being parsed is valid by looking into its chars

Is the something similar to Character.isLetterOrDigit that can tell if the char is a char that was typed from keyboard or some weird char like � ַ

*One more thing I need to be able to accept chars of various languages (not only English) so I want to avoid doing plain char comparing like c <= 32 && c >= 126 and etc...


B.t.w in general I'm looking an answer to problem described in this question CSV file validation with Java

Community
  • 1
  • 1
Daniel
  • 36,833
  • 10
  • 119
  • 200

1 Answers1

3

If you're looking for a built-in function, I don't know of one. You can, however, look at the char's ascii value and filter to your liking. Check out this ASCII table for the values.

You can say, for example, if the ascii value is <= 32 && >= 126, you will not accept it; otherwise, you will:

public boolean isValid(char c) {
    if (c <= 32 && c >= 126) {
        return false;
    } else {
        return true;
    }
}

If you operate on an entire line/String, you might be able to use this to strip away your valid characters and determine if any invalid characters remain:

public boolean isValid(String s) {
    return s.replaceAll("\\w|\\p{Punct}", "").length() == 0;
}
cklab
  • 3,761
  • 20
  • 29
  • Thanks , I'm aware of this approach , but prefer some built in method (if exists) ,also I might bump into different languages chars , so their ascii code may vary... – Daniel Jul 18 '12 at 19:26
  • @Daniel You should have no problems with different language chars using this method (I'm assuming you mean chinese characters and the like?). You are specifically defining your _accepted_ range of values. Values not in that range will be "invalid", different languages included! Furthermore, if you operate on `String`s as opposed to `char`s, maybe regex might help? I've added another function though I don't really know the corner cases for this one, if any. – cklab Jul 18 '12 at 19:44
  • One example is ascii code ...125 126 are valid char signs and 127 is del (not real char) and while 128,129... are valid chars letters, <-- This reason and the fact that I dont wanna mess with ascii code , instead I prefer a ready made proven api – Daniel Jul 18 '12 at 19:52