Understanding _character encoding_ requires a firm grasp on [[Bit|bits]] and [[Byte|bytes]]. In this post I will try to make it as clear as possible how [[ASCII]] and [[UTF-8]] works by doing it _by hand_…
It can be thought of as a follow-up to the excellent [What every programmer absolutely, positively needs to know about encodings and character sets to work with text](http://kunststube.net/encoding/).
Let’s play around with the letter “a”:
```sh
printf "a" > a.txt
```
In this case `$LANG` in your shell is set to `UTF-8`, so the [[Byte|bytes]] being written to the file will follow the rules of UTF-8. In other words, “a” will be _encoded_ to bytes by following the UTF-8 standard. When we use `cat`, we will see the bytes again interpreted as UTF-8:
```sh
cat a.txt
```
```fallback
a
```
So that is just an [[UTF-8]] interpretation of the file. But which bytes does the file really contain? With `xxd` we can make a binary dump:
```sh
xxd -b a.txt
```
```fallback
00000000: 01100001 a
```
In UTF-8, “a” is 8 bits (1 byte). Let’s try another kind of dump - the hexadecimal dump - or hexdump:
```sh
xxd a.txt
```
```fallback
00000000: 61 a
```
Now you are seeing `61` - which is the hexadecimal representation of `01100001`.
You may not know what a hexdump is or how to interpret hexadecimal numbers or how to count with them, but the most simple facts are:
- Hexa means 6 (think _hexagon_)
- Decimal means 10 (think _decilitre_)
In the context of [[Hexadecimal]], [[Decimal]] means we have the symbols 0, 1, 2, 3, 4, 5, 6, 7, 8 and 9. These are 10 symbols. Hexa tells us we have 6 other symbols - A, B, C, D, E and F. Hexadecimal is also called “base 16”. Hexadecimal is a very nice way of counting computer data. I will not explain how it works, because [LiveOverflow](https://www.youtube.com/watch?v=mT1V7IL2FHY) ([[LiveOverflow]]) and [David J. Malan](https://youtu.be/6V1sr0XV%5FNg?t=247) ([[David J. Malan]]) have great videos on it already.
> [!TIP] December
> **December** was originally the 10th month in the Roman calendar (from Latin decem = ten). October was the 8th (think [[Octave]]).
Let’s try to write a hexdump manually to create a file that will show an “a” if interpreted as UTF-8:
```sh
echo "0: 61" | xxd -r
```
```fallback
a
```
Let’s get a `xxd` formatted hexdump by writing a hexdump ourselves:
```sh
echo "0: 61" | xxd -r | xxd
```
```fallback
00000000: 61 a
```
We can convert it to binary like so:
```sh
echo "0: 61" | xxd -r | xxd -b
```
```fallback
00000000: 01100001 a
```
Can you write your name using this method? You can use `man ascii` to figure out what hexadecimals you have to use. Hey, isn’t it fun being 4 years old again?
```sh
printf '41 6e 64 65 72 73' | xxd -revert -plain
```
```fallback
Anders
```
Let’s try some slicing and dicing with offsets and the `dd` utility. First, make a new file:
```sh
printf "cat dog giraffe lion monkey bird" > animals.txt
```
How does it look?
```sh
xxd animals.txt
```
```fallback
00000000: 6361 7420 646f 6720 6769 7261 6666 6520 cat dog giraffe
00000010: 6c69 6f6e 206d 6f6e 6b65 7920 6269 7264 lion monkey bird
```
How can we get lion from this? We know the “l” is at offset 0000010 right? Let’s use `dd` and use a block size of 1 byte. Then we skip the first 16 bytes (0000010 in hexadecimal is 16 in decimal).
```sh
dd if=animals.txt bs=1 skip=16
```
```fallback
lion monkey bird
```
Wow! With ASCII, each letter is 1 byte, so we need 4 bytes to catch the lion:
```sh
dd if=animals.txt bs=1 skip=16 count=4
```
```fallback
lion
```
Unstoppable!

Figure 1: Lions are pretty cool cats and kittens.
We could also use `xxd` in a roundabout way:
```sh
xxd -seek 16 -len 4 -plain animals.txt | xxd -revert -plain
```
```fallback
lion
```
If you don’t understand what is happening here, you should remove parts of the pipeline to reveal the data (like [up](https://github.com/akavel/up)).
Now that you know how to write bits by hand, I recommend opening a file and activate `hexl-mode` in Emacs.
Fun fact: Some people use `xxd` to get a poor man’s hex editor inside Vim by dumping and reverting the whole buffer by using `%!xxd`.
How about Windows vs. Unix newlines? Those things are annoying. Could you convert them by hand, instead of using `dos2unix` and `linux2dos`? I’ll leave it up to you.
## Python bonus round
- [Docs](https://docs.python.org/3/library/stdtypes.html#bytes)
- [Strings, Unicode, and Bytes in Python 3: Everything You Always Wanted to Know](https://medium.com/better-programming/strings-unicode-and-bytes-in-python-3-everything-you-always-wanted-to-know-27dc02ff2686)
Writing your name:
```python
print(bytes.fromhex('61 6e 64 65 72 73').decode("UTF-8"))
```
```fallback
anders
```
With the `\x` escape sequence:
```python
print(b"\x61\x6e\x64\x65\x72\x73".decode("UTF-8"))
```
```fallback
anders
```
Catching the lion:
```python
with open("animals.txt", "rb") as binary_file:
binary_file.seek(16)
lion = binary_file.read(4)
print(lion.decode("UTF-8"))
```
```fallback
lion
```
The Python `bytes` type can be created from ASCII characters or hex escape sequences, so all of these are the same:
```python
a = b"\x61\x6e\x64\x65\x72\x73"
a2 = b"anders"
a3 = b"a\x6e\x64\x65\x72\x73"
print(a == a2 == a3)
```
```fallback
True
```
The `bytes` type is immutable, so we need to use the `bytearray` class to modify sequences of bytes.
```python
a = bytearray(b"anders")
```
A bytearray is a sequence of integers (0-255), so the `bytearray` above looks like this:
| `a[0]` | `a[1]` | `a[2]` | `a[3]` | `a[4]` | `a[5]` |
|--------|--------|--------|--------|--------|--------|
| 97 | 110 | 100 | 101 | 114 | 115 |
To modify a single element in the bytearray we have to pass a decimal value. We can use `ord()` to convert “A” to decimal:
```python
a = bytearray(b"anders")
a[0] = ord(b"A")
print(a)
```
```fallback
bytearray(b'Anders')
```
To go from decimal to hex:
```python
print(hex(65))
```
```fallback
0x41
```
When slicing, we get a bytearray back:
```python
a = bytearray(b'ANDers')
print(a[0:3])
```
```fallback
bytearray(b'AND')
```
So to replace we don’t use [[Decimal|decimals]]:
```python
a = bytearray(b"anders")
a[0:3] = b"AND"
print(a)
```
```fallback
bytearray(b'ANDers')
```
How do we write these things to files?
```python
with open("anders.txt", "wb") as f:
# "and" interpreted as ASCII and the rest is interpreted as hexadecimal
f.write(b"and\x65\x72\x73")
```
Let’s read the file again:
```python
with open("anders.txt", "rb") as f:
print(f.read())
```
```fallback
b'anders'
```
When we read and open in binary mode (`wb/rb`) it means that we will be reading and writing with `bytes` (`b"anders"`).
It’s possible to create a file-like object in memory by using `io.BytesIO` ([[BytesIO]]). You might want to do this when you have some library that wants to write binary data to a file. You could for example generate a plot with `matplotlib` and add it to a PDF.
To “emulate” the `f` variable from above, we could do this:
```python
import io
f_in_memory = io.BytesIO()
f_in_memory.write(b"and\x65\x72\x73")
f_in_memory.seek(0)
print(f_in_memory.read())
f_in_memory.close()
```
```fallback
b'anders'
```
If you don’t seek to the beginning (0), you would get an empty value here.
Another cool alternative is to use `SpooledTemporaryFile` that uses the [[BytesIO]] or `StringIO` up until a certain size:
```python
import tempfile
with tempfile.SpooledTemporaryFile(max_size=100, mode="w+t", encoding="utf-8") as temp:
print("temp: {!r}".format(temp))
for i in range(3):
temp.write("This line is repeated over and over.\n")
print(temp._rolled, temp._file)
```
```fallback
temp: <tempfile.SpooledTemporaryFile object at 0x1065837f0>
False <_io.TextIOWrapper encoding='utf-8'>
False <_io.TextIOWrapper encoding='utf-8'>
True <_io.TextIOWrapper name=3 mode='w+t' encoding='utf-8'>
```
## Resources
- [Unicode & Character Encodings in Python: A Painless Guide – Real Python](https://realpython.com/python-encodings-guide)
- [Strings, Unicode, and Bytes in Python 3: Everything You Always Wanted to Know](https://medium.com/better-programming/strings-unicode-and-bytes-in-python-3-everything-you-always-wanted-to-know-27dc02ff2686)
- [Python Cookbook, 3rd Edition](https://learning.oreilly.com/library/view/python-cookbook-3rd/9781449357337/)
- [io - Text, Binary, and Raw Stream I/O Tools — PyMOTW 3](https://pymotw.com/3/io/index.html)
- [Working with Binary Data in Python | DevDungeon](https://www.devdungeon.com/content/working-binary-data-python)
- [The deal with numbers: hexadecimal, binary and decimals - bin 0x0A](https://www.youtube.com/watch?v=mT1V7IL2FHY)
- [Abstraction by Professor David J. Malan - YouTube](https://www.youtube.com/watch?v=6V1sr0XV%5FNg)
- [CS50 Lectures 2018](https://www.youtube.com/playlist?list=PLhQjrBD2T382eX9-tF75Wa4lmlC7sxNDH)
- [Fluent Python - Chapter 4. Text versus Bytes](https://learning.oreilly.com/library/view/fluent-python/9781491946237/ch04.html#strings%5Fbytes%5Ffiles)
- [DEFCON 28 Safe Mode - PHV - Take Down The Internet! With Scapy - YouTube](https://www.youtube.com/watch?v=G9mp5jH69Tg)
## Related concepts
[[Endianess]]