-3

I need to make a script of conversion of UTF-8 file into decimal. In this process, I will be introducing a file into the code and run it against the conversion mechanism. I have file which is purely UTF-8 (correct me if I am wrong) which I will attach below. My main intention is to produce a short type integer data from the UTF-8 conversion.

00000000: f712 db12 141a ac1c 2018 ee2e 3b20 7a19  ........ ...; z.
00000010: 8f15 2813 c911 ca10 4710 cd0e 980e 3d11  ..(.....G.....=.
00000020: 6a17 3a21 3b30 ab49 e269 ff7f ff7f ff7f  j.:!;0.I.i......
00000030: ff7f ff7f ff7f 3067 7b19 67c8 2092 998f  ......0g{.g. ...
00000040: 65af edbd 00ad d191 0080 0080 9382 2b91  e.............+.
00000050: 85af 6fba 10b4 1a9f c287 0080 3286 7fa2  ..o.........2...
00000060: 8fd0 5bf9 db15 0228 6433 b33c 3e42 c742  ..[....(d3.<>B.B
00000070: 9f3f 213d d33a bb38 b534 a930 072c 9e26  .?!=.:.8.4.0.,.&
00000080: 8f1e ab14 2609 2aff 44f9 63f7 56f8 65fb  ....&.*.D.c.V.e.
00000090: 9d00 bf0a 1b0f 8f13 7917 551b 351f 4f22  ........y.U.5.O"
000000a0: c524 6f25 bc24 5523 d322 5123 de23 f523  .$o%.$U#."Q#.#.#
000000b0: 1224 d023 2924 2a24 e523 6e23 bb22 3422  .$.#)$*$.#n#."4"
000000c0: 5021 0a21 bd1f b01d 751c 021b eb19 e118  P!.!....u.......
000000d0: ac18 7b18 9118 ec18 fc18 a818 f018 7f18  ..{.............
000000e0: 6a18 2918 8417 5c17 fe16 dc15 cd14 5e14  j.)...\.......^.
000000f0: 0514 c513 5213 5513 b613 ec13 4714 9514  ....R.U.....G...
00000100: ee14 a214 8314 4614 9c13 d512 6512 9611  ......F.....e...
00000110: 6110 d20f 7d0f 800f d10f c20f 1710 8f10  a...}...........
00000120: 3b11 da11 8012 9013 1414 7914 8d14 9514  ;.........y.....
00000130: e613 4a13 de12 4d12 9411 7b11 2a11 9310  ..J...M...{.*...
00000140: 2011 6611 9c11 2112 d012 0713 7e13 c313   .f...!.....~...
00000150: 6214 6414 4f14 ad13 d713 4c13 9e12 db11  b.d.O.....L.....
00000160: f910 f810 0411 1411 0511 fb10 7411 8611  ............t...
00000170: d011 7012 3f12 e712 0413 1913 5613 5f13  ..p.?.......V._.
00000180: fb12 b912 a412 8212 6612 4512 8911 6911  ........f.E...i.
00000190: a611 1212 0212 3012 a012 d912 8813 9f13  ......0.........
000001a0: fb13 8514 1214 3a14 7a14 4314 8a14 3314  ......:.z.C...3.
000001b0: e813 b313 fb12 a212 ca12 3912 9412 6612  ..........9...f.
000001c0: c212 7712 3712 2812 4512 cd12 cd12 e312  ..w.7.(.E.......
000001d0: cf12 fc12 3f13 d512 cc12 9012 4912 4012  ....?.......I.@.
000001e0: 4412 4d12 cc11 a611 cd11 f211 dc11 2a12  D.M...........*.
000001f0: 7112 9812 cd12 4713 8313 a513 7c13 a413  q.....G.....|...
00000200: cf13 db13 e213 2b14 1214 1014 bf13 6413  ......+.......d.

The file in hexl-mode

The problem that I am encountering is I am not sure what UTF-8 file type is. I googled but still no luck. Without enough information I won't be able to make an algorithm for the process. Oh, please do share your idea of the code needed to make the algorithm up and running. I need to get this script up and running reading files as soon as possible. Your contribution to this problem will be greatly appreciated.

FYI: By getting the short integer, I will be able to plot a graph of the signal received from a detector. The initial file format was .ht3 and I have extract a sample into .txt file format. C language would be great too (because that's all I am for now)

Update 1: I was wrong about utf-8. It is a binary data. Below is the sample raw data of it. (I just got this data)

> ˜€¨ Ó.; zè(… GÕò=j:!;0´I‚iˇˇˇˇˇˇ0g{g»
> íôèeØÌΩ≠—ëÄÄìÇ+ëÖØo∫¥ü¬áÄ2Ü¢è–[˘€(d3≥<>B«Bü?!=”:ª8µ4©0,û&è´&  *ˇD˘c˜V¯e˚ùø
> èyU5O"≈$o%º$U#”"Q#fi#ı#$–#)$*$Â#n#ª"4"P!
> !Ω∞uη¨{ëϸ®j)Ñ\˛‹Õ^≈RU∂ÏGïÓ¢ÉFú’eña“}Ä—¬è;⁄ÄêyçïÊJfiMî{*ì
> fú!–~√bdO≠◊Lû€˘¯˚tÜ–p?ÁV_˚π§ÇfEâi¶0†Ÿàü˚Ö:zCä3Ë≥˚¢ 9îf¬w7(EÕÕ„œ¸?’ÃêI@DMöÕÚ‹*qòÕGÉ•|§œ€‚+ødBÄ/”€?W‚µÿ⁄ŒÆ∏‹∂—|†fiè∑mDtÔÒV€÷ ŸÍ´*Û˜ûÃMÀ$`öÇ~^Ñ2\‘n±≥#,&K#ZNÖt˚∫È    3#≈‡”≥õ!û”≠]„(rŸß∞Ø„uâ¶<#0ˇFïfYˇˇˇˇˇ[gÙE…`óˇí¿≤ǬiØ√ëÄÄZźëüØÇºrµxû^áÄ
> ã.ƃ◊@˛H$\/¬9˙@ÊB@2=x:u7æ3/ü*Ï$èı“˛'˘v˜‚¯/˚°ëSî<T∑"≈$Ø%%$…"s#u# #Ï#e#˜#Ò#˘#~#¢"≥""!®ä¸∑6b”}®.T8–?rÛ)íËJ¢~¯∞o§d∆Ä+f-flxLܰ€S4D›£W´≥º(áNŒ‰Y«“i~ev0ó¨6¯s—@Ïô&~•뛿ì§◊Íyîܶ[0∆ø‹ä€MEf±ÿÙù7*ü·Ÿß¶LP∞/%,†VT+ó≠ÕhÕÃÒîπ¶  ôu4wè˘…SJ~èáM∏È¥í¨µ¬¢ùJÙˇ∞ÜûAÅâÖt4:m
> úx,˚Ÿ´fi≠ˇ˝BA~Ca1Ô}óAÄjm»∑ºö‡ªqwYG¶ùfi◊π{‹≥¬=L=æ
> ıˇ∂—◊flŸ—‚Ë‘¬SA˛˜˛4£*ŸÙ“„x√˘¡7òˆ°è)ª:UXˇˇˇˇˇˇˇ4ñfi”ôtâh≠‹¡Ñ≥úîÄÄÄÇÉ©3º®∏ê•oâÄJÉx§‘–›˜veá',2Á;ZB[CÍAe>Ñ983¨-_(s$è-@Ş,¯€ıI˜H˙˘˛:v Éohj!Ò#%÷$j#ç"7#3#ñ#Ö#6#À#C$$”#›"\""{!
> t=Z|ôNU
> s7.g|A»÷ÁãÇÒ;˛ˇfl„kìúØõxB≈æÈ"®Äô3ùyœS>së   ó—}-vä_jP‡>EŸÌÒ–Gı|⁄≤O!˛ÏeôÌVwóÔÓ¸‹µüÑEtP˜cw£
> %jòï·‚”¿6™.Á‹éAÊ∆⁄TR?Wøƒ≤Çú—R7-_:O9a(1ógüΩ=-úÂæÌ—ºâôt©U.
> XÄ0=YR|tŒ¶<Êõèf •ÕJ|)bJ]pA=‰¸π’‚˙›
> F
> ÙÛtm†œ„¢>®∞¡£∏êj>~˝$˝N=’œı≈í ∑âflIQ¢*√¯˜≈ó¡˛
> û¨|è+B>^ˇˇˇˇˇˇˇAÔÈ    ¢Sô‘´
> ºz∞çÄÄÄtàp´ó∏Ö±rô§ÉÄ∏Ñr•≈Œãı˛'0!9N@ÑBºAô>k8Q1™*&#: áˇ¯Kˆx˜z˚û~µblV9Ò
> o#∫$˙#´""q"`#…#∂#„# $X$<$é#fl"ñ"Ú h
> {44&ë0®5xÇZlíp¬Øzü-9);Óh8%jÆä°à‰|Öß<j ∑ñ Bc±«PìÁœ∑ÅùSüó;)¿!ñ∫©XÒ™>◊¢9µ⁄)-uÏJyƒ
> ⁄ÊÑò⁄íyQ⁄Äpp´çΩ⁄“ 

The file before this file in in hexl mode. Does that mean it is in hexadecimal?

Jabberwocky
  • 48,281
  • 17
  • 65
  • 115
fizsics
  • 1
  • 3
  • UTF-8 is a Unicode text encoding, used to express the very large number of code points ("characters") that Unicode makes available. Are you sure that's what you have? Edit the question to also show what kind of data you expect to be able to extract from the `raw.txt` file, that can help. It's not at all clear to me what "decimal" means, here. – unwind Sep 28 '17 at 09:31
  • 1
    Your goal is unclear. I'm not sure what you mean by "to decimal" because UTF-8 is a format used to store Unicode text, and is not a file format per-se. The sample data you're showing seems to be a proprietary binary format that does not contain UTF-8 data, apart from a few readable characters here and there. – SirDarius Sep 28 '17 at 09:32
  • Yeah thats what I have. That file is in utf 8. I need to convert every single one of them into decimal. – fizsics Sep 28 '17 at 09:35
  • Which file is in utf-8? Show the first few lines of that file. And show the corresponding output you want. Otherwise the question is unclear. – Jabberwocky Sep 28 '17 at 09:37
  • 1
    This is confusing. What you show here is an hexadecimal dump of binary data... Is your file the binary data, or the hexadecimal dump? – SirDarius Sep 28 '17 at 09:37
  • An UTF-8-encoded file cannot start with `0xf7`. The first four `unsigned short`s in the file are 4855, 4859, 6676, 7340, does that make sense? If so it's just binary data, no UTF-8 involved. – unwind Sep 28 '17 at 09:37
  • The file is in unicode but I need to convert them into integers. And from what I was told, it is short type integers. Im not sure if Im making sense. – fizsics Sep 28 '17 at 09:37
  • @fizsics your question doesn't make sense util you show us some examples of the conversion you want. – Jabberwocky Sep 28 '17 at 09:38
  • When you say you want to convert it to decimal, that suggests a text format. Are you saying you want to convert the unicode values to decimal strings, then place those decimal strings in a text file? Kind of odd. – Tom Karzes Sep 28 '17 at 09:38
  • You do understand that "decimal" means base 10, right? – Tom Karzes Sep 28 '17 at 09:39
  • I get that decimal means base 10. "The first four unsigned shorts in the file are 4855, 4859, 6676, 7340, does that make sense? If so it's just binary data, no UTF-8 involved. " How did you get those values? – fizsics Sep 28 '17 at 09:42
  • @fizsics for clarifications please [edit] your question. Don't describe your files, __show them__. – Jabberwocky Sep 28 '17 at 09:43
  • It should be binary data then. I was told it's in utf-8 so I was confused too. – fizsics Sep 28 '17 at 09:44
  • Sigh.. so you have that binary file starting with `f712 db12 141a ac1c`. So what output do you want? Give an example.... – Jabberwocky Sep 28 '17 at 09:46
  • Read a pair of one-byte values. Then combine then, either little- or big-endian, e.g. `v = (v1 << 8) | v2;` There is nothing "decimal" about this whatsoever. You can store the result in an `int` or a `short` as you prefer, as long as it's at least 16 bits. – Tom Karzes Sep 28 '17 at 09:47
  • @MichaelWalz tbh Im not sure. All I know is I need them in numbers because I need to make a graph out of all this data. – fizsics Sep 28 '17 at 09:54
  • @fizsics If you're not sure what you want, it's somewhat hard to help. But Tom Karzes' comment looks promising. – Jabberwocky Sep 28 '17 at 09:58
  • @TomKarzes why is it odd? Could you elaborate more on the process? Thanks – fizsics Sep 28 '17 at 10:01
  • Google gives a few different uncommon results for "ht3 file format". Unless you can find an actual technical description of the file format, you have essentially no hope of converting it. – aschepler Sep 28 '17 at 10:21
  • It's odd because the only reason to display them in decimal is if you intend for a person to read them, as numbers. Is that your intent? It seems unlikely. I have never just sat down with a list of numbers to read, and I'd forget them as soon as I read them. – Tom Karzes Sep 28 '17 at 10:29
  • @TomKarzes I need them for graphic representation. I think I know where I should be heading from now. Thank you so much. btw j.:!;0.I.i...... and 00000070 are these hexadecimal? – fizsics Sep 28 '17 at 11:23

1 Answers1

1

From all the comments I think I've guessed what you want.

But one thing is certain: your question is totally unrelated to UTF-8.

Your file starts with following bytes: f7 12 db 12 14 1a ac 1c and as someone commented the first 4 unsigned shorts in your file are probably 4855, 4859, 6676, 7340 which sounds plausible for data you want to represent in a graphic.

Hexadecimal   decimal
---------------------
12f7          4855
12db          4859
141a          6676
ac1c          7340

You probably see a pattern here.

So this is what you need to do in peudocode:

Repeat until end of file:

  • read two bytes from the file and store in b1 and b2
  • the short value you want is (b2 << 8) | b1

Example of what happens with the first 2 bytes:

  • On the first iteration you will get f7 in b1 and 12 in b2.
  • b2 << 8 is 1200 and 1200 | f7 is 12f7 which is 4855 in decimal.
Jabberwocky
  • 48,281
  • 17
  • 65
  • 115