Writing to a text file, binary vs ascii

Question

So I am having the hardest time trying to understand this concept. I have a program that reads a text file, and writes it to another file and replaces the most common words with unsigned chars. But what I cannot for the life of me understand is how then do I determine the difference between the two.

If I write to the new file the original char I read in or an unsigned char value corresponding to 1-255, how then do I determine the difference when I go back in reverse to the original file contents?

Give an example input and the corresponding output you desire. From your post, it's hard to determine what you're trying to do. An example will provide a reference for answers, as well as help them sort out any confusion you might have. — Thanatos, Apr 26 '14 at 01:15
What does "replaces the most common words with unsigned chars" mean? — ooga, Apr 26 '14 at 01:15
_'with unsigned chars'_ You mean raw binary data vs. human readable text, do you? — πάντα ῥεῖ, Apr 26 '14 at 01:15
What I mean is that I read in a text file a char at a time, and if that word or punctuation or space corresponds to something in a key bank then it replaces that punctuation space or word with a unsigned char to represent a list of 1-255 most common words. But what I am having a hard time with is when I go back and try to go from this compressed file to the original. How do I determine if I'm reading one of these unsigned chars over say a char in a word. — , Apr 26 '14 at 01:28

score 1 · Accepted Answer · answered Apr 26 '14 at 01:22

When you write a file as binary, then a number such as "1253553" is written using 2 or 4 bytes (depending on the size of the int on the platform). So, in a binary file, you will see a sequence of 2 or 4 bytes representing that number. For chars, it should not make a difference as each char is represented on one byte.

score 0 · Answer 2 · answered Apr 26 '14 at 01:34

0

Usually, you have to have some well known and obvious way to determine the format of your file.

One way to do this is to create your own file extension. You could naively expect that any file with that extension is in your compressed format, but it's actually quite likely other files out there have the same extension (e.g., ".dat" is probably a bad choice). So, you'll want to take further steps, like having the first few bytes of the file be something that is unlikely to be there in any other file (some "magic numbers"). Let's use two bytes, and let's simply choose 0xAB 0xCD as those two bytes.

So, when your program is presented with a file that has the proper extension, open it and read the first two bytes. If they're 0xAB and 0xCD, you can assume you're reading your special format.

This isn't a very strong way of accomplishing this task, but it is one way of doing it. You could get more extravagant if you like.

For more information, you might want to read the Wikipedia page on the subject. It's a start.

answered Apr 26 '14 at 01:34

Steve

6,334
4
39
67

I am using a different file type for the compressed version. But the compressed file isn't completely compressed, only partially. The parts that are compressed are compressed as a unsigned char byte ranging from 1-255 because that's the number of words in the list that are to be compressed. If the char isn't to be compressed it is just added as the char I read in from the original file. – Apr 26 '14 at 01:49
Yes, and? Like you said in your question, you need a way to identify that a file is in this format. I've given you some options, there are others, but you need to come up with some way. Is there any other software that should be able to read your compressed file? Is a human supposed to be able to read it? Maybe just add a keyword (like the Wikipedia page says .gif files do with `GIF87a` at the beginning) to the beginning of your file. – Steve Apr 26 '14 at 02:06
The issue isn't determining the file I want, I have already built that into it based on the extension of the file. My problem is when I'm reading this file, how can I determine if the char is one of the unsigned ones I added based on my key list or just one that was written to the file directly from the original. It's a compression/decompression application. – Apr 26 '14 at 02:14
Preceed your special characters with something like a null character (0) and just make sure that 0 never appears anywhere else. Then when you encounter it, you know the next one is "special". – Steve Apr 26 '14 at 02:31
Is there not another way that wouldn't be adding more memory to the file? I had a friend try to explain to me that comparing the variable to 0 would do the trick but I tried that in a test program and it didn't work. Something about looking at ASCII text or the decimal value. – Apr 26 '14 at 02:37
Nope. Cannot be done without extra markup. ASCII characters are just byte values in the 0-255 range, same as your unsigned char. http://www.asciitable.com – Kevin Lam Apr 26 '14 at 04:18
A byte is a collection of 8 bits, and is the fundamental unit used in computers (this is a simplified statement, but it'll do). It can hold 256 distinct values. That's all it is. One interpretation for the value 65 could be ASCII (which is 'A'), another is as a number (which is 65), and another interpretation of it might mean you look up the value in a table and it equates to "Hello". If you add 1 to the value 127, what is the answer? One interpretation says 128 (unsigned char) and another is -128 (signed char). However, that interpretation, or what it means, is up to *you*. – Steve Apr 26 '14 at 21:25

Writing to a text file, binary vs ascii

2 Answers2