Python

我如何找到生产系统中Python进程中正在使用内存的内容？

发布于 2021-01-29 15:10:29

我的生产系统偶尔会出现内存泄漏，而这是我在开发环境中无法复制的。我在开发环境中使用了Python内存事件探查器（特别是Heapy），但取得了一些成功，但是它无法帮助我解决无法重现的问题，并且我不愿意使用Heapy来检测生产系统需要花点时间来完成它的工作，并且它的线程化远程接口在我们的服务器中无法正常工作。

我想我想要的是一种转储生产Python进程（或至少gc.get_objects）快照，然后离线分析快照以查看其在哪里使用内存的方法。
我如何获得像这样的python进程的核心转储？
一旦有了一个，我该如何做些有用的事情？

关注者

被浏览

1 个回答

面试哥 2021-01-29

为面试而生，有面试问题，就找面试哥。
我将从最近的经历中进一步了解布雷特的回答。推土机包是
很好的维护，尽管进步，像添加tracemalloc在Python 3.4
STDLIB，其gc.get_objects计数图是我去到的工具来解决内存泄漏。在下面，我使用dozer > 0.7在撰写本文时尚未发布的内容（好吧，因为我最近在那里做了一些修复）。

例

让我们看一个不平凡的内存泄漏。我将在此处使用Celery
4.4，并最终揭示导致泄漏的功能（由于这是错误/功能，因此可以将其称为纯粹的错误配置，由无知引起）。所以这是一个Python 3.6 VENV
在哪里pip install celery < 4.5。并具有以下模块。

演示
```
import time

import celery


redis_dsn = 'redis://localhost'
app = celery.Celery('demo', broker=redis_dsn, backend=redis_dsn)

@app.task
def subtask():
    pass

@app.task
def task():
    for i in range(10_000):
        subtask.delay()
        time.sleep(0.01)


if __name__ == '__main__':
    task.delay().get()
```
基本上是一个计划一堆子任务的任务。有什么问题吗？

我将用于procpath分析Celery节点的内存消耗。pip install procpath。我有4个终端：
1. procpath record -d celery.sqlite -i1 "$..children[?('celery' in @.cmdline)]" 记录Celery节点的进程树统计信息
2. docker run --rm -it -p 6379:6379 redis 运行Redis，它将充当Celery经纪人和结果后端
3. celery -A demo worker --concurrency 2 用2个工人运行节点
4. python demo.py 最后运行示例
（4）将在2分钟内完成。

Then I use sqliteviz (pre-built
version) to visualise what procpath
has recorder. I drop the celery.sqlite there and use this query:
```
SELECT datetime(ts, 'unixepoch', 'localtime') ts, stat_pid, stat_rss / 256.0 rss
FROM record
```
And in sqliteviz I create a line chart trace with X=ts, Y=rss, and add
split transform By=stat_pid. The result chart is:

This shape is likely pretty familiar to anyone who fought with memory leaks.

Finding leaking objects

Now it’s time for dozer. I’ll show non-instrumented case (and you can
instrument your code in similar way if you can). To inject Dozer server into
target process I’ll use Pyrasite. There
are two things to know about it:
- To run it, ptrace has to be configured as “classic ptrace permissions”: echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope, which is may be a security risk
- There are non-zero chances that your target Python process will crash
With that caveat I:
- pip install https://github.com/mgedmin/dozer/archive/3ca74bd8.zip (that’s to-be 0.8 I mentioned above)
- pip install pillow (which dozer uses for charting)
- pip install pyrasite
After that I can get Python shell in the target process:
```
pyrasite-shell 26572
```
And inject the following, which will run Dozer’s WSGI application using
stdlib’s wsgiref‘s server.
```
import threading
import wsgiref.simple_server

import dozer


def run_dozer():
    app = dozer.Dozer(app=None, path='/')
    with wsgiref.simple_server.make_server('', 8000, app) as httpd:
        print('Serving Dozer on port 8000...')
        httpd.serve_forever()

threading.Thread(target=run_dozer, daemon=True).start()
```
Opening http://localhost:8000 in a browser there should see something like:

After that I run python demo.py from (4) again and wait for it to finish.
Then in Dozer I set “Floor” to 5000, and here’s what I see:

Two types related to Celery grow as the subtask are scheduled:
- celery.result.AsyncResult
- vine.promises.promise
weakref.WeakMethod has the same shape and numbers and must be caused by the
same thing.

Finding root cause

At this point from the leaking types and the trends it may be already clear
what’s going on in your case. If it’s not, Dozer has “TRACE” link per type,
which allows tracing (e.g. seeing object’s attributes) chosen object’s
referrers (gc.get_referrers) and referents (gc.get_referents), and
continue the process again traversing the graph.

But a picture says a thousand words, right? So I’ll show how to use
objgraph to render chosen object’s
dependency graph.
- pip install objgraph
- apt-get install graphviz
Then:
- I run python demo.py from (4) again
- in Dozer I set floor=0, filter=AsyncResult
- and click “TRACE” which should yield
Then in Pyrasite shell run:
```
objgraph.show_backrefs([objgraph.at(140254427663376)], filename='backref.png')
```
The PNG file should contain:

Basically there’s some Context object containing a list called _children
that in turn is containing many instances of celery.result.AsyncResult,
which leak. Changing Filter=celery.*context in Dozer here’s what I see:

So the culprit is celery.app.task.Context. Searching that type would
certainly lead you to Celery task
page.
Quickly searching for “children” there, here’s what it says:

trail = True

If enabled the request will keep track of subtasks started by this task, and
this information will be sent with the result (result.children).

Disabling the trail by setting trail=False like:
```
@app.task(trail=False)
def task():
    for i in range(10_000):
        subtask.delay()
        time.sleep(0.01)
```
Then restarting the Celery node from (3) and python demo.py from (4) yet
again, shows this memory consumption.

Problem solved!