Understanding _character encoding_ requires a firm grasp on [[Bit|bits]] and [[Byte|bytes]]. In this post I will try to make it as clear as possible how [[ASCII]] and [[UTF-8]] works by doing it _by hand_… It can be thought of as a follow-up to the excellent [What every programmer absolutely, positively needs to know about encodings and character sets to work with text](http://kunststube.net/encoding/). Let’s play around with the letter “a”: ```sh printf "a" > a.txt ``` In this case `$LANG` in your shell is set to `UTF-8`, so the [[Byte|bytes]] being written to the file will follow the rules of UTF-8. In other words, “a” will be _encoded_ to bytes by following the UTF-8 standard. When we use `cat`, we will see the bytes again interpreted as UTF-8: ```sh cat a.txt ``` ```fallback a ``` So that is just an [[UTF-8]] interpretation of the file. But which bytes does the file really contain? With `xxd` we can make a binary dump: ```sh xxd -b a.txt ``` ```fallback 00000000: 01100001 a ``` In UTF-8, “a” is 8 bits (1 byte). Let’s try another kind of dump - the hexadecimal dump - or hexdump: ```sh xxd a.txt ``` ```fallback 00000000: 61 a ``` Now you are seeing `61` - which is the hexadecimal representation of `01100001`. You may not know what a hexdump is or how to interpret hexadecimal numbers or how to count with them, but the most simple facts are: - Hexa means 6 (think _hexagon_) - Decimal means 10 (think _decilitre_) In the context of [[Hexadecimal]], [[Decimal]] means we have the symbols 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. These are 10 symbols. Hexa tells us we have 6 other symbols - A, B, C, D, E and F. Hexadecimal is also called “base 16”. Hexadecimal is a very nice way of counting computer data. I will not explain how it works, because [LiveOverflow](https://www.youtube.com/watch?v=mT1V7IL2FHY) ([[LiveOverflow]]) and [David J. Malan](https://youtu.be/6V1sr0XV%5FNg?t=247) ([[David J. Malan]]) have great videos on it already. > [!TIP] December > **December** was originally the 10th month in the Roman calendar (from Latin decem = ten). October was the 8th (think [[Octave]]). Let’s try to write a hexdump manually to create a file that will show an “a” if interpreted as UTF-8: ```sh echo "0: 61" | xxd -r ``` ```fallback a ``` Let’s get a `xxd` formatted hexdump by writing a hexdump ourselves: ```sh echo "0: 61" | xxd -r | xxd ``` ```fallback 00000000: 61 a ``` We can convert it to binary like so: ```sh echo "0: 61" | xxd -r | xxd -b ``` ```fallback 00000000: 01100001 a ``` Can you write your name using this method? You can use `man ascii` to figure out what hexadecimals you have to use. Hey, isn’t it fun being 4 years old again? ```sh printf '41 6e 64 65 72 73' | xxd -revert -plain ``` ```fallback Anders ``` Let’s try some slicing and dicing with offsets and the `dd` utility. First, make a new file: ```sh printf "cat dog giraffe lion monkey bird" > animals.txt ``` How does it look? ```sh xxd animals.txt ``` ```fallback 00000000: 6361 7420 646f 6720 6769 7261 6666 6520 cat dog giraffe 00000010: 6c69 6f6e 206d 6f6e 6b65 7920 6269 7264 lion monkey bird ``` How can we get lion from this? We know the “l” is at offset 0000010 right? Let’s use `dd` and use a block size of 1 byte. Then we skip the first 16 bytes (0000010 in hexadecimal is 16 in decimal). ```sh dd if=animals.txt bs=1 skip=16 ``` ```fallback lion monkey bird ``` Wow! With ASCII, each letter is 1 byte, so we need 4 bytes to catch the lion: ```sh dd if=animals.txt bs=1 skip=16 count=4 ``` ```fallback lion ``` Unstoppable! ![Figure 1: Lions are pretty cool.](https://d33wubrfki0l68.cloudfront.net/e1e28e40cbf3bc850ae7418679b020a96fb4641b/62254/ox-hugo/lion.jpg) Figure 1: Lions are pretty cool cats and kittens. We could also use `xxd` in a roundabout way: ```sh xxd -seek 16 -len 4 -plain animals.txt | xxd -revert -plain ``` ```fallback lion ``` If you don’t understand what is happening here, you should remove parts of the pipeline to reveal the data (like [up](https://github.com/akavel/up)). Now that you know how to write bits by hand, I recommend opening a file and activate `hexl-mode` in Emacs. Fun fact: Some people use `xxd` to get a poor man’s hex editor inside Vim by dumping and reverting the whole buffer by using `%!xxd`. How about Windows vs. Unix newlines? Those things are annoying. Could you convert them by hand, instead of using `dos2unix` and `linux2dos`? I’ll leave it up to you. ## Python bonus round - [Docs](https://docs.python.org/3/library/stdtypes.html#bytes) - [Strings, Unicode, and Bytes in Python 3: Everything You Always Wanted to Know](https://medium.com/better-programming/strings-unicode-and-bytes-in-python-3-everything-you-always-wanted-to-know-27dc02ff2686) Writing your name: ```python print(bytes.fromhex('61 6e 64 65 72 73').decode("UTF-8")) ``` ```fallback anders ``` With the `\x` escape sequence: ```python print(b"\x61\x6e\x64\x65\x72\x73".decode("UTF-8")) ``` ```fallback anders ``` Catching the lion: ```python with open("animals.txt", "rb") as binary_file: binary_file.seek(16) lion = binary_file.read(4) print(lion.decode("UTF-8")) ``` ```fallback lion ``` The Python `bytes` type can be created from ASCII characters or hex escape sequences, so all of these are the same: ```python a = b"\x61\x6e\x64\x65\x72\x73" a2 = b"anders" a3 = b"a\x6e\x64\x65\x72\x73" print(a == a2 == a3) ``` ```fallback True ``` The `bytes` type is immutable, so we need to use the `bytearray` class to modify sequences of bytes. ```python a = bytearray(b"anders") ``` A bytearray is a sequence of integers (0-255), so the `bytearray` above looks like this: | `a[0]` | `a[1]` | `a[2]` | `a[3]` | `a[4]` | `a[5]` | |--------|--------|--------|--------|--------|--------| | 97 | 110 | 100 | 101 | 114 | 115 | To modify a single element in the bytearray we have to pass a decimal value. We can use `ord()` to convert “A” to decimal: ```python a = bytearray(b"anders") a[0] = ord(b"A") print(a) ``` ```fallback bytearray(b'Anders') ``` To go from decimal to hex: ```python print(hex(65)) ``` ```fallback 0x41 ``` When slicing, we get a bytearray back: ```python a = bytearray(b'ANDers') print(a[0:3]) ``` ```fallback bytearray(b'AND') ``` So to replace we don’t use [[Decimal|decimals]]: ```python a = bytearray(b"anders") a[0:3] = b"AND" print(a) ``` ```fallback bytearray(b'ANDers') ``` How do we write these things to files? ```python with open("anders.txt", "wb") as f: # "and" interpreted as ASCII and the rest is interpreted as hexadecimal f.write(b"and\x65\x72\x73") ``` Let’s read the file again: ```python with open("anders.txt", "rb") as f: print(f.read()) ``` ```fallback b'anders' ``` When we read and open in binary mode (`wb/rb`) it means that we will be reading and writing with `bytes` (`b"anders"`). It’s possible to create a file-like object in memory by using `io.BytesIO` ([[BytesIO]]). You might want to do this when you have some library that wants to write binary data to a file. You could for example generate a plot with `matplotlib` and add it to a PDF. To “emulate” the `f` variable from above, we could do this: ```python import io f_in_memory = io.BytesIO() f_in_memory.write(b"and\x65\x72\x73") f_in_memory.seek(0) print(f_in_memory.read()) f_in_memory.close() ``` ```fallback b'anders' ``` If you don’t seek to the beginning (0), you would get an empty value here. Another cool alternative is to use `SpooledTemporaryFile` that uses the [[BytesIO]] or `StringIO` up until a certain size: ```python import tempfile with tempfile.SpooledTemporaryFile(max_size=100, mode="w+t", encoding="utf-8") as temp: print("temp: {!r}".format(temp)) for i in range(3): temp.write("This line is repeated over and over.\n") print(temp._rolled, temp._file) ``` ```fallback temp: <tempfile.SpooledTemporaryFile object at 0x1065837f0> False <_io.TextIOWrapper encoding='utf-8'> False <_io.TextIOWrapper encoding='utf-8'> True <_io.TextIOWrapper name=3 mode='w+t' encoding='utf-8'> ``` ## Resources - [Unicode & Character Encodings in Python: A Painless Guide – Real Python](https://realpython.com/python-encodings-guide) - [Strings, Unicode, and Bytes in Python 3: Everything You Always Wanted to Know](https://medium.com/better-programming/strings-unicode-and-bytes-in-python-3-everything-you-always-wanted-to-know-27dc02ff2686) - [Python Cookbook, 3rd Edition](https://learning.oreilly.com/library/view/python-cookbook-3rd/9781449357337/) - [io - Text, Binary, and Raw Stream I/O Tools — PyMOTW 3](https://pymotw.com/3/io/index.html) - [Working with Binary Data in Python | DevDungeon](https://www.devdungeon.com/content/working-binary-data-python) - [The deal with numbers: hexadecimal, binary and decimals - bin 0x0A](https://www.youtube.com/watch?v=mT1V7IL2FHY) - [Abstraction by Professor David J. Malan - YouTube](https://www.youtube.com/watch?v=6V1sr0XV%5FNg) - [CS50 Lectures 2018](https://www.youtube.com/playlist?list=PLhQjrBD2T382eX9-tF75Wa4lmlC7sxNDH) - [Fluent Python - Chapter 4. Text versus Bytes](https://learning.oreilly.com/library/view/fluent-python/9781491946237/ch04.html#strings%5Fbytes%5Ffiles) - [DEFCON 28 Safe Mode - PHV - Take Down The Internet! With Scapy - YouTube](https://www.youtube.com/watch?v=G9mp5jH69Tg) ## Related concepts [[Endianess]]