1

I want to search for specific data in a text file which contains accentuated letters. I used this code:

<?php
    $file = 'textfile.txt';
    $searchfor = 'key';


    // get the file contents, assuming the file to be readable (and exist)
    $contents = file_get_contents($file);
    // escape special characters in the query
    $pattern = preg_quote($searchfor, '/');
    // finalise the regular expression, matching the whole line
    $pattern = "/^.*$pattern.*\$/m";
    // search, and store all matching occurences in $matches
    if(preg_match_all($pattern, $contents, $matches))
    {
       echo utf8_encode(implode("\n", $matches[0]));
    }
    else
    {
       echo utf8_encode("No matches found");
    }
?>

But it's case sensitive and doesn't work with accentuaded letters.

Can somebody help me please?

Thanks :)

Hiroo17
  • 25
  • 4

3 Answers3

0

Add a i with your current pattern.

$pattern = "/^.*$pattern.*\$/mi";
Syed mohamed aladeen
  • 6,507
  • 4
  • 32
  • 59
0

You can use this to get all the strings that contains accentuated letters.

preg_match_all("/\s+(.*?[ÇÜ]+.*?)\s+/i", $str, $matches);

[ÇÜ] is the range of chars between Ç and Ü

for more details about that range check the ASCII table

zakaria35
  • 857
  • 1
  • 7
  • 12
0

@Hiroo17 I explaining the method to do this.

Suppose you have textfile.txt file in which you have accentuated letters like below.

Éric Cantona kÉy.

Here is the below script to deal with accentuated letters.

$searchfor = 'key';
function file_get_contents_utf8($fn) {
    $content = file_get_contents($fn);
    return mb_convert_encoding($content, 'UTF-8',
    mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}

function normalizeChars($s) {
    $replace = array(
    'ъ'=>'-', 'Ь'=>'-', 'Ъ'=>'-', 'ь'=>'-',
    'Ă'=>'A', 'Ą'=>'A', 'À'=>'A', 'Ã'=>'A', 'Á'=>'A', 'Æ'=>'A', 'Â'=>'A', 'Å'=>'A', 'Ä'=>'Ae',
    'Þ'=>'B',
    'Ć'=>'C', 'ץ'=>'C', 'Ç'=>'C',
    'È'=>'E', 'Ę'=>'E', 'É'=>'E', 'Ë'=>'E', 'Ê'=>'E',
    'Ğ'=>'G',
    'İ'=>'I', 'Ï'=>'I', 'Î'=>'I', 'Í'=>'I', 'Ì'=>'I',
    'Ł'=>'L',
    'Ñ'=>'N', 'Ń'=>'N',
    'Ø'=>'O', 'Ó'=>'O', 'Ò'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'Oe',
    'Ş'=>'S', 'Ś'=>'S', 'Ș'=>'S', 'Š'=>'S',
    'Ț'=>'T',
    'Ù'=>'U', 'Û'=>'U', 'Ú'=>'U', 'Ü'=>'Ue',
    'Ý'=>'Y',
    'Ź'=>'Z', 'Ž'=>'Z', 'Ż'=>'Z',
    'â'=>'a', 'ǎ'=>'a', 'ą'=>'a', 'á'=>'a', 'ă'=>'a', 'ã'=>'a', 'Ǎ'=>'a', 'а'=>'a', 'А'=>'a', 'å'=>'a', 'à'=>'a', 'א'=>'a', 'Ǻ'=>'a', 'Ā'=>'a', 'ǻ'=>'a', 'ā'=>'a', 'ä'=>'ae', 'æ'=>'ae', 'Ǽ'=>'ae', 'ǽ'=>'ae',
    'б'=>'b', 'ב'=>'b', 'Б'=>'b', 'þ'=>'b',
    'ĉ'=>'c', 'Ĉ'=>'c', 'Ċ'=>'c', 'ć'=>'c', 'ç'=>'c', 'ц'=>'c', 'צ'=>'c', 'ċ'=>'c', 'Ц'=>'c', 'Č'=>'c', 'č'=>'c', 'Ч'=>'ch', 'ч'=>'ch',
    'ד'=>'d', 'ď'=>'d', 'Đ'=>'d', 'Ď'=>'d', 'đ'=>'d', 'д'=>'d', 'Д'=>'D', 'ð'=>'d',
    'є'=>'e', 'ע'=>'e', 'е'=>'e', 'Е'=>'e', 'Ə'=>'e', 'ę'=>'e', 'ĕ'=>'e', 'ē'=>'e', 'Ē'=>'e', 'Ė'=>'e', 'ė'=>'e', 'ě'=>'e', 'Ě'=>'e', 'Є'=>'e', 'Ĕ'=>'e', 'ê'=>'e', 'ə'=>'e', 'è'=>'e', 'ë'=>'e', 'é'=>'e',
    'ф'=>'f', 'ƒ'=>'f', 'Ф'=>'f',
    'ġ'=>'g', 'Ģ'=>'g', 'Ġ'=>'g', 'Ĝ'=>'g', 'Г'=>'g', 'г'=>'g', 'ĝ'=>'g', 'ğ'=>'g', 'ג'=>'g', 'Ґ'=>'g', 'ґ'=>'g', 'ģ'=>'g',
    'ח'=>'h', 'ħ'=>'h', 'Х'=>'h', 'Ħ'=>'h', 'Ĥ'=>'h', 'ĥ'=>'h', 'х'=>'h', 'ה'=>'h',
    'î'=>'i', 'ï'=>'i', 'í'=>'i', 'ì'=>'i', 'į'=>'i', 'ĭ'=>'i', 'ı'=>'i', 'Ĭ'=>'i', 'И'=>'i', 'ĩ'=>'i', 'ǐ'=>'i', 'Ĩ'=>'i', 'Ǐ'=>'i', 'и'=>'i', 'Į'=>'i', 'י'=>'i', 'Ї'=>'i', 'Ī'=>'i', 'І'=>'i', 'ї'=>'i', 'і'=>'i', 'ī'=>'i', 'ij'=>'ij', 'IJ'=>'ij',
    'й'=>'j', 'Й'=>'j', 'Ĵ'=>'j', 'ĵ'=>'j', 'я'=>'ja', 'Я'=>'ja', 'Э'=>'je', 'э'=>'je', 'ё'=>'jo', 'Ё'=>'jo', 'ю'=>'ju', 'Ю'=>'ju',
    'ĸ'=>'k', 'כ'=>'k', 'Ķ'=>'k', 'К'=>'k', 'к'=>'k', 'ķ'=>'k', 'ך'=>'k',
    'Ŀ'=>'l', 'ŀ'=>'l', 'Л'=>'l', 'ł'=>'l', 'ļ'=>'l', 'ĺ'=>'l', 'Ĺ'=>'l', 'Ļ'=>'l', 'л'=>'l', 'Ľ'=>'l', 'ľ'=>'l', 'ל'=>'l',
    'מ'=>'m', 'М'=>'m', 'ם'=>'m', 'м'=>'m',
    'ñ'=>'n', 'н'=>'n', 'Ņ'=>'n', 'ן'=>'n', 'ŋ'=>'n', 'נ'=>'n', 'Н'=>'n', 'ń'=>'n', 'Ŋ'=>'n', 'ņ'=>'n', 'ʼn'=>'n', 'Ň'=>'n', 'ň'=>'n',
    'о'=>'o', 'О'=>'o', 'ő'=>'o', 'õ'=>'o', 'ô'=>'o', 'Ő'=>'o', 'ŏ'=>'o', 'Ŏ'=>'o', 'Ō'=>'o', 'ō'=>'o', 'ø'=>'o', 'ǿ'=>'o', 'ǒ'=>'o', 'ò'=>'o', 'Ǿ'=>'o', 'Ǒ'=>'o', 'ơ'=>'o', 'ó'=>'o', 'Ơ'=>'o', 'œ'=>'oe', 'Œ'=>'oe', 'ö'=>'oe',
    'פ'=>'p', 'ף'=>'p', 'п'=>'p', 'П'=>'p',
    'ק'=>'q',
    'ŕ'=>'r', 'ř'=>'r', 'Ř'=>'r', 'ŗ'=>'r', 'Ŗ'=>'r', 'ר'=>'r', 'Ŕ'=>'r', 'Р'=>'r', 'р'=>'r',
    'ș'=>'s', 'с'=>'s', 'Ŝ'=>'s', 'š'=>'s', 'ś'=>'s', 'ס'=>'s', 'ş'=>'s', 'С'=>'s', 'ŝ'=>'s', 'Щ'=>'sch', 'щ'=>'sch', 'ш'=>'sh', 'Ш'=>'sh', 'ß'=>'ss',
    'т'=>'t', 'ט'=>'t', 'ŧ'=>'t', 'ת'=>'t', 'ť'=>'t', 'ţ'=>'t', 'Ţ'=>'t', 'Т'=>'t', 'ț'=>'t', 'Ŧ'=>'t', 'Ť'=>'t', '™'=>'tm',
    'ū'=>'u', 'у'=>'u', 'Ũ'=>'u', 'ũ'=>'u', 'Ư'=>'u', 'ư'=>'u', 'Ū'=>'u', 'Ǔ'=>'u', 'ų'=>'u', 'Ų'=>'u', 'ŭ'=>'u', 'Ŭ'=>'u', 'Ů'=>'u', 'ů'=>'u', 'ű'=>'u', 'Ű'=>'u', 'Ǖ'=>'u', 'ǔ'=>'u', 'Ǜ'=>'u', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'У'=>'u', 'ǚ'=>'u', 'ǜ'=>'u', 'Ǚ'=>'u', 'Ǘ'=>'u', 'ǖ'=>'u', 'ǘ'=>'u', 'ü'=>'ue',
    'в'=>'v', 'ו'=>'v', 'В'=>'v',
    'ש'=>'w', 'ŵ'=>'w', 'Ŵ'=>'w',
    'ы'=>'y', 'ŷ'=>'y', 'ý'=>'y', 'ÿ'=>'y', 'Ÿ'=>'y', 'Ŷ'=>'y',
    'Ы'=>'y', 'ž'=>'z', 'З'=>'z', 'з'=>'z', 'ź'=>'z', 'ז'=>'z', 'ż'=>'z', 'ſ'=>'z', 'Ж'=>'zh', 'ж'=>'zh'
    );
    return  strtr($s, $replace);
}

$contents = file_get_contents_utf8($file);

$contents = normalizeChars($contents);

// escape special characters in the query

$pattern = preg_quote($searchfor, '/');

// finalise the regular expression, matching the whole line

$pattern = "/^.*$pattern.*\$/mi";

    // search, and store all matching occurences in $matches
    if(preg_match_all($pattern, $contents, $matches))
    {
        echo utf8_encode(implode("\n", $matches[0]));
    }
    else
    {
        echo utf8_encode("No matches found");
    }

Now i am explaining why i use the above method. When you call file_get_content. it will destroy UTF8 encoding. For this either you can use the above mb_convert_encoding used in file_get_contents_utf8 function or straightly use utf8_encode like $contents = utf8_encode(file_get_contents_utf8($file));

And then i use normalizeChars function to deal with accentuated letters.In that i use strtr function whose main work is to Translate characters or replace substrings.

I hope this will resolve your issue.

And Thanx again @Barmar for reopening this question.I hope you will not disappoint with my answer.

Manish
  • 3,443
  • 1
  • 21
  • 24