0

I'm trying to load pre-trained word embeddings for the Arabic language (Mazajak embeddings: http://mazajak.inf.ed.ac.uk:8000/). The embeddings file does not have a particular extension and I'm struggling to get it to load. What's the usual process to load these embeddings?

I've tried doing with open("get_sg250", encoding = encoding) as file: file.readlines() for different encodings but it seems like none of them are the answer (utf-8 does not work at all), if I try windows-1256 I get gibberish:

e.g.

['8917028 300\n',
 '</s> Hل®:0\x16ء:؟X§؛R8ڈ؛\xa0سî9K\u200fƒ::m¤9¼»“8¤p\u200c؛tعA:UU¾؛“_ع9‚Nƒ¹®G§¹قفگ؛ww$؛\u200eba:\x14.„:R¸پ:0–\x0b:–ü\x06:×#¦؛Yٍ²؛m ظ:{\x14¦:µ\x01‡:ه\x17S¹Yr¯:j\x03-¹ff€9×£P¸\n',
 'W‚؛UUه9¼»é¹""§؛\u200c¶د:UU؟:\u200eb؟¹{\x14\u200d¸,ù19ïî\u200d؛ئ\x12¯؛\x00\x00ا:\u200c6°7A§a؛ذé„؛ذi†؛®G\x14:حجŒ8\x03\u200cè9ه\x17¸؛ق]¦؛ڈآ5¸قفا9حج^:\x00€ٹ؛q=²:\x00\x00¢9\x14®أ9×£T¹لz‚:\x1bèG؛®G7؛ڑ™<:m\xa0ƒ¹""´9\x14®\x1d:"¢²؛®G-؛ڑ™~:±ن¸:\x18ث«:¸\x1e…؛`,8؛Hل\u200d¹±ن.:\x1f…¥؛لْ‚:ڑ™s:R¸\x0b؛ئ’\x07؛0–C؛ڈآ¸:ذéھ:ة/خ¹A\'¸:ڑ™ز:m\xa0\x1e:è´ظ::ي‡؛\n',
 '×\x05؛Œ%8؛ش\x06~؛أُu:\x00\x00\n',
 ":‰ˆ\x149\x14®?؛ِ(\x05:«ھ…:)\\‡833G:Haط؛\x1f…¼:¼»'9\x00\x00 ؛=\n",
 '6؛R¸‚¹¼;€؛\x1bè¾؛\x1bèw؛قف؛:A§\x1a؛""j؛K~J:Hل\x14؛ىرد:\u200c6\x0c؛–|ب؛‚Nm:cةد·:mک؛‰ˆھ9\x00\x00ü9DD(¹ذi\x1f:ذé¬؛,ù™9¼»\x1e:wwƒ؛\x03\u200cF87ذ©·×£Q؛\x1f…w؛ئ\x12ح؛\x00\x00\x007ٍ‹U8\x0etZ6“ك«؛cةط؛Haد؛–ü¼؛33?¹Œ%َ9أُخ9=\n',
 '‹؛ق]ع:ڈآ/؛0–ق¹¤pُ¹Dؤخ:¤p¤؛\x1bèت9\u200ebé¹ùE‹:–üb7=ٹ؛:؟Xv؛×£c؛ِ(·؛è4\xa0؛cة‹؛0\x16ˆ؛ئ’U:""#؛ة/j:R8،:أُى9ذé€:ىQX:\x1f…L:""›؛K\u200f•؛ڈآں؛‰ˆ8¸ww´:""o؛è´…؛\n',
 'W·؛¤pگ:{”¶؛\x0etJ¹\u200eb>:ùإة؛`¬أ؛ِ(ü9K\u200f™:‚N؛:لz;:ِ(ٹ:Œ¥ˆ؛§\n',
 'ں؛ِ¨\xad:ڑ™q؛\u200c6\x19:×£H9¤p\x1c:\x03\u200cخ¹–üٹ8UU\x13؛Hلؤ¹è´ء؛ïnژ؛®Gک:è´¯9\x0etN؛O\x1b\x0b؛\x00\x00Z:\n',
 'Wڑ؛""J؛؟طخ:\x03\u200c¹:لْ¬؛\u200c6ک9ڑ™D؛\x1bèT8ق]ƒ:¼»س:0–-:~±³:,y‰:è´،¸jƒأ:m\xa0]:A\'د:j\x03\x15؛Haد:""½:wwù¹ه\x17ء؛×#س:&؟œ9×£5؛Hلz¹\\ڈ€¹)\\¨؛O\x1bْ¹ه\x17\x1b¹ڈB×؛\x03\u200c™؛ىQز¹لz¤¹ذi\x1c:\\ڈژ9ùإV¹R¸€:ùإü9ww?9‰\x08\u200d:~±ؤ¹‚Nù¹‰ˆ\x10¹UUn؛\x11\x11ƒ؛ٍ‹چ8‰ˆ½:\x1bèî¹O\x1bè¶`¬´؛=\n',
 '¢:\n',

I've also tried using pickle but that also doesn't work.

Any suggestions on what I could try out?

skidjoe
  • 493
  • 7
  • 16
  • All *ready_to_download* files at http://mazajak.inf.ed.ac.uk:8000/#embedding-page are _binary_ ones (`.bin` extension)… See https://fileinfo.com/extension/bin or https://file.org/extension/bin etc… – JosefZ Mar 17 '22 at 22:52
  • @JosefZ I see, I just realized that. Is there a way to download this file straight to a private server through terminal? I've tried ```wget http://mazajak.inf.ed.ac.uk:8000/get_sg_250``` but I don't seem to get the .bin file but instead a 10GB file which is unreadable. – skidjoe Mar 18 '22 at 10:44
  • If I click any *download* (e.g. `get_sg_250`) then my browser offers to save it as `get_sg_250.bin`. Anyway, with or without the `.bin` extension, the _binary_ content is the same - and reading _binary_ data without knowing their _structure_ isn't easy (if possible), imho… They provide the API in python and it can be [downloaded from here](http://mazajak.inf.ed.ac.uk:8000/get_api) - could it help?. – JosefZ Mar 18 '22 at 15:52

0 Answers0