Problem using parse_url() on extracted url from a csv through fgetcsv()

Question

I do have a quite strange happening to me, and i can't seem to figure where is my problem I have a csv file I use to export datas. It's filled with urls and other stuff. I have extracted URL in this the array $urlsOfCsv

I extracts csv lines into an array this way :

        $request->file('file')->move(public_path('uploads/temp/'),'tempcsv.csv');

        $file = fopen(public_path('uploads/temp/').'tempcsv.csv',"r");

        $lines = [];

        fgetcsv($file, 10000, ",");
        $o=0;
        while (($data = fgetcsv($file, 0, "\t")) !== FALSE) {
            $lines[$o]= $data;
            $o++;
        }
        fclose($file);
        File::delete($file);


        $urlsOfCsv = array_column($lines,0);

but I can't extract domain with parse_url() because I'm getting this strange thing :

foreach($urlsOfCsv as $url){
            var_dump($url);
            var_dump(parse_url($url));  
        }

will give me result like this :

string(41) "https://www.h4d.com/" array(1) { ["path"]=> string(41) "_h_t_t_p_s_:_/_/_w_w_w_._h_4_d_._c_o_m_/_" }
string(73) "https://www.campussuddesmetiers.com/" array(1) { ["path"]=> string(73) "_h_t_t_p_s_:_/_/_w_w_w_._c_a_m_p_u_s_s_u_d_d_e_s_m_e_t_i_e_r_s_._c_o_m_/_" }
string(69) "http://altitoy-ternua.com/?lang=es" array(2) { ["path"]=> string(53) "_h_t_t_p_:_/_/_a_l_t_i_t_o_y_-_t_e_r_n_u_a_._c_o_m_/_" ["query"]=> string(15) "_l_a_n_g_=_e_s_" }
string(81) "https://www.opquast.com/communaute/jobs/" array(1) { ["path"]=> string(81) "_h_t_t_p_s_:_/_/_w_w_w_._o_p_q_u_a_s_t_._c_o_m_/_c_o_m_m_u_n_a_u_t_e_/_j_o_b_s_/_" }

I don't even have the 'host' key inside the array.

Any idea why I get this result ?

I tried lot of things with regex to use some other function. But i get either empty results or anything.

I suppose this has something to do with the csv stuff, but I can't find where.

`string(41) "https://www.h4d.com/"` - the value `https://www.h4d.com/` does not look like it should consist of 41 bytes. So I am guessing you have some sort of character encoding problem here. — CBroe, Feb 01 '23 at 09:38
`["path"]=> string(41) "_h_t_t_p_s_:_/_/_w_w_w_._h_4_d_._c_o_m_/_"` - in that debug output it is even more obvious, that there must be extra bytes. Check the encoding of your CSV file. If that is not in UTF-8, you'll probably have to convert the character encoding first. — CBroe, Feb 01 '23 at 09:40
Yes that was obviously it !! I managed to find somthing that workout just fine. Thank you very much !! — Thomas Locatelli, Feb 01 '23 at 11:10

score 0 · Answer 1 · answered Feb 01 '23 at 11:08

Thanks to Cbroe I manage to found the solution that was indeed pretty obvious. I had bad encoding in my csv. After a little bit of research i found my file to be encoded in UTF-16. I tried convert encoding that way ( which is probably not optimal given the double loop ) :

while (($data = fgetcsv($file, 0, "\t")) !== FALSE) {
   for($i=0;$i<count($data);$i++){
      $data[$i] = mb_convert_encoding( $data[$i],'UTF-8','UTF-16');
   }
   $lines[$o]= $data;
   $o++;
}

And now it works just fine. parse_url() will give me the awaited result ( UrlParser::getDomain() also works for me ).

Problem using parse_url() on extracted url from a csv through fgetcsv()

1 Answers1