Background:
Encoding and decoding of data became relevant with the need to exchange information between devices, programs, and platforms that may interpret characters and their representation in memory differently. In Python, this issue became acute due to the strict separation between strings (str) and bytes (bytes) starting from Python 3.x, where strings are sequences of Unicode characters, and bytes are sequences of bytes.
Problem:
When working with files, networks, and external systems, it is often necessary to convert data between byte representations and strings. Incorrect use of encodings can lead to UnicodeEncodeError and UnicodeDecodeError, data integrity issues, and problems with supporting different languages.
Solution:
In Python, to convert a string to bytes, the .encode() method is used, and for the reverse conversion — the .decode() method. The most common encoding is "utf-8":
text = "Hello, World!" encoded = text.encode('utf-8') # Encode the string to bytes print(encoded) # b'\xd0\x9f\xd1\x80...' decoded = encoded.decode('utf-8') # Decode back to string print(decoded) # 'Hello, World!'
Key features:
open(..., encoding='utf-8')).What happens if you try to decode a byte string with an incorrect encoding?
Attempting to decode bytes using an incorrect encoding will raise an error or result in an incorrect string. For example:
b = b'\xd0\x9f\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82' wrong = b.decode('latin-1') print(wrong) # Will output gibberish instead of "Hello"
Can you concatenate a string and a byte string directly with +?
No. This will raise a TypeError exception.
s = "abc" b = b"def" # s + b # TypeError: can only concatenate str (not "bytes") to str
Can you write a text file without explicitly specifying the encoding?
Yes, but this is considered bad practice because it uses the system’s default encoding, which depends on OS settings, leading to incompatibility between platforms.
A programmer writes a log file in Windows without specifying the encoding. The log opens in Linux or Mac, but displays gibberish instead of Cyrillic.
Pros:
Cons:
A programmer always specifies encoding='utf-8' when working with files:
with open('log.txt', 'w', encoding='utf-8') as f: f.write('The program completed successfully')
Pros:
Cons: