String interning; discovering how python is saving my RAM.

When it comes to code conundrums, a quick visit to google or chatgpt usually satisfies my curiosity. However, I felt reluctant to read a bunch of blog posts and opted to write some code instead. This led me to discovering string interning in Python.

I have written a fair amount of code and by no means I am an expert. But over the years there has been a natural decline in discovering new things. That's the reason why I am so excited to share this article.

Unpacking the internal workings of CPython by observing a series of code snippets complimented by poring over specifics parts of python docs and aided some excellent articles. The very articles which I expertly avoided earlier, not because I chose not click but because I did not possess the knowledge to type the right search terms at the beginning.

A high level view of the following article: 1. Context and observing the behaviour: I explain my motivation to understand parts of Python behaviour and the onset of my quest. 2. Learning about the internals: I observed enough and was ready for some answers. I dig into python docs, continue observing and finally find out some truths. 3. Where does this come to play: I make my conclusions while carrying on looking at more interesting angles of string interning.

how python saves memory by string interning

Context and observing the behaviour.

I was watching a video on programming basics as I was interested in how people explain these concepts to beginners. At one point there was a discussion on "pass by value" and "pass by reference". I have read about them in my undergraduate course. But I rarely think of them when I code in Python.

Instead of watching a video on the topic, I began to write some code to observe how arguments are passed to a function in Python.

I start with something familiar. Passing a list to a function and mutating the list will mutate the variable outside the function.

pets = ["dogs", "kittens"]
def add_pets(pets):
    pets.append("hamsters")


add_pets(pets)
print(pets)

# output
['dogs', 'kittens', 'hamsters']

To further confirm my understanding, I used built-in id function to verify that pets inside the function and pets outside the function point to the same object.

pets = ["dogs", "kittens"]
def add_pets(pets):
    pets.append("hamsters")
    print(f"ID of pets inside add_pets {id(pets)}")

add_pets(pets)
print(f"ID of pets outside add_pets {id(pets)}")

# output
ID of pets inside add_pets 4302154624
ID of pets outside add_pets 4302154624

Lists are mutable in Python, but strings are not. So I try the following code.

pet = "Panda"
def zoo(pet):
    pet += " Cow"
    print(f"ID of '{pet}': {id(pet)}")

zoo(pet)
print(pet)
print(f"ID of '{pet}': {id(pet)}")

# output
ID of 'Panda Cow': 4304152176
Panda
ID of 'Panda': 4304151280

Both the value and the memory location of the variable pet inside the function are different from those outside.

So does that mean, lists are passed by reference but strings are passed by value? I just wanted to try running one more piece of code before Googling that.

pet = "Panda"
def zoo(pet):
    print(f"ID of '{pet}' inside foo: {id(pet)}")

print(f"ID of '{pet}' outside foo: {id(value)}")
zoo(pet)

The result of the above code was a bit unexpected for me.

# ----output----
ID outside foo: 4343585264
ID inside foo: 4343585264

It seems that inside the function the variable that are passed refer to the same object location of the variables outside the function. When the variable gets reassigned, as in the case of immutable string, the variable starts pointing to a different memory location.

Learning about the internals.

I quickly found the FAQ page on docs.python.org. Anwering a question about call-by-refernce in Python, it reads as following --

"Remember that arguments are passed by assignment in Python. Since assignment just creates references to objects, there’s no alias between an argument name in the caller and callee, and so no call-by-reference per se."

"assignment just creates references to objects", that matches with what I have observed before. When I called function zoo by passing pet, the pet inside the function had the same reference as outside. The quote from the docs talks about assignment in general rather than being limited to passing variables to function. That made me write the following code.

color = "azure"
cloud = color

id(color) == id(cloud)
# ----output----
True

rainforest = "amazon"
cloud = rainforest

id(color) == id(rainforest)
# ----output----
True

colors = ["azure", "black"]
clouds = colors

id(colors) == id(clouds)
# ----output----
True

Since cloud was assigned to color or rainforest it referenced to either of their memory locations, as do colors and clouds. What if the variables are assigned directly to the values instead?

color = "azure"
cloud = "azure"

id(color) == id(cloud)
# ----output----
True

colors = ["azure", "black"]
clouds = ["azure", "black"]

id(colors) == id(clouds)
# ----output----
False

The strings color and cloud both reference the same memory location even though they have not been assigned to each other. However, the lists colors and clouds are pointing to different locations in the memory.

I printed the bytecode of the above and observed the parts relating to strings and list assignment.

  1           2 LOAD_CONST               0 ('azure')
              4 STORE_NAME               0 (color)

  2           6 LOAD_CONST               0 ('azure')
              8 STORE_NAME               1 (cloud)
...
  7         102 LOAD_CONST               0 ('azure')
            104 LOAD_CONST               3 ('black')
            106 BUILD_LIST               2
            108 STORE_NAME               4 (colors)

  8         110 LOAD_CONST               0 ('azure')
            112 LOAD_CONST               3 ('black')
            114 BUILD_LIST               2
            116 STORE_NAME               5 (clouds)

Whenever there is a string literal LOAD_CONST operation is being performed. What exactly is this LOAD_CONST?

Python docs have a minimal definition.

Pushes co_consts[consti] onto the stack.

I find out that co_consts is the list (in tuple form) of all constants in bytecode. Python code is not directly interpreted at the time of execution instead it is first compiled. One of the outputs of compiling is bytecode, which is the set of instructions executed by the python virtual machine.

Everything in python is an object and that applies to the code that is compiled. We can access the code object of a function like this:

def foo():
    a = "jello"

foo.__code__
#---- output ----
<code object foo at 0x100e1c920, file "<stdin>", line 1>

And the code object hosts the co_consts tuple.

foo.__code__.co_consts
#----output----
(None, 'jello')

A fun fact I discovered during this process, I can see None part of co_consts of my foo. But I don't use None anywhere. It turns out Python only ever has one instance of None at all times. All the None objects we use and pass around in our code point to a single instance of None.

Where does this come to play

Once I discovered string interning I quickly imagined how all the memory must have been getting saved, and my code is secretly being optimised without my knowledge. While it is definitely an optimization that speeds up a lot of computation, it's apparently not necessarily in action all the time.

What I found to be puzzling at first is the fact that strings with spaces are not interned.

color = "ms azure"
cloud = "ms azure" 
id(color) == id(cloud)
# ----output----
False

I found this excellent deep dive on Youtube, and its shown the CPython code checks for alphanumeric strings with underscores but no spaces when deciding to intern a string.

From CPython's codeobject.c

/* all_name_chars(s): true iff s matches [a-zA-Z0-9_]* */
static int
all_name_chars(PyObject *o)
{
    const unsigned char *s, *e;

    if (!PyUnicode_IS_ASCII(o))
        return 0;

    s = PyUnicode_1BYTE_DATA(o);
    e = s + PyUnicode_GET_LENGTH(o);
    for (; s != e; s++) {
        if (!Py_ISALNUM(*s) && *s != '_')
            return 0;
    }
    return 1;
}

When we think about it, strings are interned at compile time. So it doesn't come handy in many cases of programming which often deal with string input over HTTP, command line or files. To confirm I ran the following code.

hardcode_animal = "chimpanzee"
input_animal = input("enter animal name: ")

print("Are hardcode_animal and input_animal have same id:", id(hardcode_animal) == id(input_animal))
#---output--- 
enter animal name: chimpanzee
Are hardcode_animal and input_animal have same id: False

This brings us to the question, where is it actually used? In managing (hundreds of) thousands of lines of code that every python program needs. Even a simple print("hello world") requires the several functions and classes to be imported into the interpreter. Imagine a more complex program with a long list of third party dependencies. Names of functions, methods, classes and other variables are the perfect candidates for string interning.

By interning strings, Python is saving memory on any repeated string literals in the code. This further comes into play during dictionary key lookups, in case of string keys. Any string literal comparisons exploit performance upgrade as the interpreter can quickly check if two interned strings share the same memory address, avoiding character-by-character comparisons.

So that's my adventure into CPython's string interning. Understanding how strings are interned at compile time, and the nuances of when and where this optimization comes into play, sheds light on the intricacies of Python's memory management. While string interning may not be evident in everyday code, its impact is profound in managing large codebases and optimizing the performance of Python programs.

Further reading:

If you liked this article, you may like some of my previous articles:

How Django keeps track of your model classes?

System Design Case Study: How do millions of people book meetings with God?

Published: Fri 12 January 2024
By Sri

In Python.