Lessons Learned from Debugging Python

When developing large-scale backend code, predicting system behavior from an application is crucial for engineers to build a more scalable and stable architecture. Unfortunately, high-level scripting languages tend to hide actual system behavior from developers, making troubleshooting, debugging, and scaling very difficult. Python is not an exception to this problem. Recently we had the chance to debug a memory-related problem with Python code. Just like many dynamically typed languages, Python comes with many convenient features, including automatic garbage collection, so developers can focus on writing business logic. However, simply relying on correctness of business logic can sometimes obscure problems that happen at lower levels.

In this blog, I explore our team’s debugging efforts that led to the discovery of an unpatched memory problem in Python 2.7 that was shipped with RHEL/CentOS 7.3. While new development employs Python 3, Python 2.7 with RHEL/CentOS 7 remains a popular technology stack widely trusted by many enterprise customers. This case study is also indicative of Endgame’s approach to bug discovery and disclosure, the challenges we encounter, and solutions we apply when integrating open libraries within our own software development process, including our migration to Go.

 

Background

As part of the engineering backend team we constantly process hundreds of thousands of messages, and face operational constraints that demand high precision on scale, speed, and scope. Last year, we received an incident report stating that requests were not being properly processed. Most Linux distros maintain comprehensive logs, usually under /var/log directory with different filenames to distinguish different components. This is usually a good place to start for debugging, because it provides comprehensive information on the state of the overall system.

From the log files, we found an interesting hint that indicates that our Python process was killed by the kernel’s Out Of Memory manager. The kernel log showed something similar to the output below:

[230026.780153] [55149]  1000 55149  1155060     1206     111      211             0 postmaster
[230026.780155] [62104]  1000 62104  1155060     1204     111      211             0 postmaster
[230026.780158] [66951]  1000 66951  1155060     1212     111      211             0 postmaster
[230026.780162] Out of memory: Kill process 125331 (python) score 681 or sacrifice child

 

The Out Of Memory manager is a Linux kernel’s component. It maintains a “badness score” for each process and then terminates a process if the score goes relatively high. The heuristic to determine the badness is associated with original memory size of the process, CPU time, runtime, with some kernel internal adjustment. In the end, the more memory the process uses, the higher the score. This prevents the entire system from crashing because of insufficient remaining memory space for the kernel.

Further tracing of logs showed that the process id matched a Python process that runs database queries using an Object-Relational Mapping library. Why was our Python code selected to be killed? Did we have any bad queries that loaded too much information on process memory? Do any of the libraries we use have a memory leak? We needed to analyze this problem step by step.

 

Validating SQL

ORM frameworks are frequently used because they provide convenient ways to write applications that interact with databases without involving SQL, resulting in less context switching for our engineers, faster development speed, and more readable code. ORM does this by generating SQL on behalf of developers, providing an additional abstraction layer. However, ORM may not always conduct the exact query that the developer intended. It may fetch more information from the database or it may run more queries than needed. These often make debugging more difficult for engineers.

Fortunately, most ORM libraries provide methods that show actual SQL queries that are run on the Relational Database Management System (RDMS). Even without such features, most RDMSs offer features to log statements on the server side. Because what the database sees is more important, we extracted SQL statements from the database server and validated that the query was constructed and executed as intended.

Python code

result = session.query(Task).filter_by(account_id=account_id, ready=True).all()

 

Actual Query on observed on server side

< 2017-12-09 00:22:32.283 UTC >LOG:  statement: SELECT tasks.task_id AS tasks_task_id, tasks.account_id AS tasks_account_id, tasks.user_id AS tasks_user_id
        FROM tasks
        WHERE tasks.account_id = '####' AND tasks.ready = 'true'

 

The generated query looks as expected. If our query matches expectations, then the problem must be located at a lower layer, so we moved on to the next step, which is tracing our code using Python Debugger (pdb).

 

Python Debugger

Python comes with a handy debugger. If the source code is accessible, the simplest way to run the debugger is to import the pdb module and then add a line ‘pdb.set_trace()’ above the breakpoint. From here, each function call can be explored. Even better, one can import additional modules that may be helpful with debugging. Because we are interested in memory usage of code, we measure the memory usage using top. Among many metric top reports, we are interested in the “resident memory” column (RES), which reflects the amount of memory being used. With one terminal running the debugger and another running top, we are ready for memory analysis.

 

 

Stepping through each line of the code while observing resident memory size, we trace the code to the section where our process allocates some space to hold the database query results. Since the Python runtime handles garbage collection, we expected the space allocated from ORM query to return memory back to the operating system when the variables go out of scope.

 

 

However, even with the variable out of scope, the process still didn't free up the memory. Memory usage was still above several GB. Did our library have a memory leak? After quickly Googling for any known memory leak and stepping through the library code, we didn’t find any obvious indication of a memory leak.

Did our code still hold references to objects? Most garbage collectors work by using reference counting mechanisms. The runtime keeps a count of references to each object created and then periodically checks if there is an object with 0 reference count. If so, the runtime deletes the object and returns the memory space back to the operating system. This is convenient because engineers do not need to oversee memory management themselves; the system will take care of it. However, that also means that the space won’t be freed until all references to the object are discarded. If any code or library still holds references to the objects without revealing it does, the space won’t be cleaned by garbage collector. This can be checked quickly by running the get_count() function of gc module before and after our code snippet. We did not see any evidence of additional references being created.

We next explored the possibility of our debugging session influencing the timing of the garbage collection. Based on how the code read and how Python dynamically cleans up the memory, it seemed like the memory space was released. To confirm we solved the memory problem, we tried triggering garbage collection manually by running the Garbage Collector Interface. However, it didn’t help. The resident memory usage stayed above several GB.

Next, we removed third-party library code from our logic to eliminate variables that may distract from identifying the root cause and see if the symptom persisted. We stripped out third-party libraries as much as possible without changing the original logic in the code. Indeed, we were able to reproduce the same symptom without importing third-party library calls, just purely from business logic.

At this point, we realized that it was not the logic of our application, nor the library code. Something else was happening, so we decided to inspect system calls.

 

Isolating problems

When inspecting system calls, it is important to eliminate as much noise as possible. Because system call tracing on high level languages generates lots of noise, we created the bare minimal code that reproduced the same symptom. This can be achieved by focusing on the section of the suspicious code and repeatedly eliminating lines that do not change the behavior of the core section. Here’s a simplified version of the code snippet:

 

Note: Be careful. The following code may eat 1.5 GB of your memory!

1  import gc
2  import pdb
3  d = [ {‘EndgameRules’: 1} for x in xrange(10000000)]
4  pdb.set_trace()
5  del d
6  gc.collect()
7  # we need to pause here

 

Here, we construct a list of small objects, in this case dictionaries containing only one item. This is typically what the ORM library does. When we select a certain record using ORM, it constructs a list of small objects containing data from table columns. The only reason that we generate large lists is to make the memory increase more noticeable since small changes in memory are more difficult to identify. Using this code we reproduce the resident memory usage above 1.5 GB. In this case, we intentionally executed gc.collect(). Most Python developers would agree that the memory allocated in line 3 should be returned to the system at line 5. However in reality, the memory usage still remains in line 5. Even after we trigger garbage collection, the memory usage remains high. Let’s take out gc and pdb module from our code snippet and observe the system calls it generates.

 

Deeper Dive

We already checked whether we could reproduce this problem on versions of Python greater than 2.7.6 and noticed a difference. The next step was to figure out what was causing the difference. We compared system calls on this snippet to see how they differ on Python 2.7.6 CentOS 7 and Python 2.7.12 on Ubuntu 16.04. Strace is an awesome utility that can be used for system call analysis. The easiest way to run strace is to execute strace on the command line interface followed by the actual command one wants to trace. It is usually a good idea to redirect the output to a file, as the output of strace is usually very long. For us, it is strace python python_code.py > strace_output.txt.

The following sample code creates and deletes a list of Python dictionaries.

a = [ {"EndgameRules": x} for x in xrange(5000000)]
del a

 

Strace output for Python 2.7.6 on CentOS 7.3

munmap(0x7f0b2c39e000, 8192)            = 0
brk(NULL)                               = 0x1f2e000
brk(0x1f4f000)                          = 0x1f4f000
brk(NULL)                               = 0x1f4f000
brk(0x1f70000)                          = 0x1f70000
brk(NULL)                               = 0x1f70000
brk(0x1f91000)                          = 0x1f91000
brk(NULL)                               = 0x1f91000
brk(0x1fb2000)                          = 0x1fb2000
brk(NULL)                               = 0x1fb2000
brk(0x1fd4000)                          = 0x1fd4000
… Truncated …
brk(NULL)                               = 0x5e664000
brk(0x5e685000)                         = 0x5e685000
munmap(0x7f814c1a8000, 40218624)        = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f8156d695e0}, {0x7f815708c540, [], SA_RESTORER, 0x7f8156d695e0}, 8) = 0
brk(NULL)                               = 0x5e685000
brk(NULL)                               = 0x5e685000
brk(0x1404000)                          = 0x1404000
brk(NULL)                               = 0x1404000
exit_group(0)                           = ?
+++ exited with 0 +++

 

Strace output for Python 2.7.12 on Ubuntu 16.04

mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac821000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac7e1000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac7a1000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac761000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac721000
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7ac6e1000
… Truncated …
munmap(0x7f913380c000, 262144)          = 0
munmap(0x7f913384c000, 262144)          = 0
munmap(0x7f913388c000, 262144)          = 0
munmap(0x7f91338cc000, 262144)          = 0
munmap(0x7f913390c000, 262144)          = 0
munmap(0x7f913394c000, 262144)          = 0
munmap(0x7f913398c000, 262144)          = 0
munmap(0x7f91339cc000, 262144)          = 0
munmap(0x7f90e0d21000, 40218624)        = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f91349e9390}, {0x534a20, [], SA_RESTORER, 0x7f91349e9390}, 8) = 0
brk(0x162e000)                          = 0x162e000
munmap(0x7f90d9aa1000, 262144)          = 0
exit_group(0)                           = ?
+++ exited with 0 +++

 

Notice that we see more brk() calls in Python 2.7.6 and more mmap() calls in Python 2.7.12. mmap() is a well-known system call that “creates a new mapping in the virtual address space of the calling process.”

But what is brk()? According to the Linux man page, “brk() changes the location of the program break… Increasing the program break has the effect of allocating memory to the process; decreasing the break deallocates memory.” Strace output of Python 2.7.6 shows that the process increases the program break, but never decreases it even after ‘del a’ statement is executed. We do see a tiny release of 40 MB with munmap, but Python 2.7.6 has not returned the big chunk that was allocated with the brk() call.

Python 2.7.12, on the other hand, uses mmap() to allocate memory. The allocated memory is cleaned upon ‘del a’ with munmap() calls, returning most of its memory back to the OS. This is expected behavior. So why the discrepancy?

 

Why the Difference?

Python 2.7.6 was released in Nov 2013. This bug is actually not new and was referenced initially in reports which were followed by patches and referenced in Python’s documentation.

There was a change in the arena allocator in version 2.7.7 to use mmap() if available. Arena allocator is used for objects smaller than 512 bytes in Python and uses memory mapping called arenas with a fixed size of 256 KiB.

From this documentation, we also validated that this only becomes a big problem when a process expands Arena space by generating many small objects under 512 bytes. This happens commonly with code utilizing an ORM.

This behavior is not a memory leak. Python’s runtime will reuse all of the expanded space when it needs to hold other objects if they are not occupied. However, no other process on the same system can use the memory even if Python decides not to use those spaces. To use an analogy, the runtime behaves like a kid who won’t share toys after using them. That means if a user creates a long running job, such as a web service that happens to allocate many objects under 512 bytes, the memory won’t be returned to the system until the Python process terminates.

To work around this problem, keep Arena space under control by limiting the number of small objects Python uses to create an object list. For example, fetch fewer rows from the database by being more strict on the query filter. However, even with this proper measure, predicting this problem is not easy because it is not easy to observe this behavior from Python code that fetches data using an ORM.

We only saw the bug in Python 2.7 that was shipped with RHEL/CentOS 7.3. Python 2.7.7 or later available on Ubuntu 17.04. Macs do not have this problem as they use later releases. We disclosed this bug to Redhat and the vendor addressed the bug in the RHEL 7.5 release. We appreciate this patch and hope this example highlights our approach to debugging, some of the key decisions to make, and the challenges of writing a scalable application.

 

Conclusion

Thanks to dynamically typed languages and third-party libraries, engineers can easily build something quickly and implement their ideas without as many obstacles. However, dynamically typed languages combined with lots of libraries introduces too much abstraction that doesn’t necessarily make computers behave as expected from the source code. When building an application that processes a lot of data in parallel, correctness of our mental model is crucial. Application code reflects engineers’ mental models, and if the system does not abide by this model, then it can lead to countless hours spent debugging.

Knowing this, we have migrated most of our components that require high performance and scalability away from Python and into Go. Go’s runtime still provides some level of abstraction, but we find it more efficient, straight forward, and less unexpected or surprising. For the most part, a process implemented in Go behaves according to how it is written. We still use Python when appropriate: writing quick prototypes, small tool development, or small scale work. But new projects will most likely be written in Go.

If you find yourself relying on dynamically typed language and third-party libraries, you should ask yourself if there is any magical behavior or convenience feature that would lead the system to behave differently than expected in your mental model. This gap may be difficult to identify until there is a problem. In the case of the Python bug, we didn’t recognize the deviation from our mental model until it came to our attention that requests weren’t being processed as intended. To ensure our systems behave as intended, we constantly seek ways to limit abstraction and gain efficiencies. While learning a new language such as Go may seem like additional work, we believe it has helped us increase our reliability and productivity, while limiting our time spent debugging through layers of abstraction.