我如何找到生产系统中Python进程中正在使用内存的内容?
我的生产系统偶尔会出现内存泄漏,而这是我在开发环境中无法复制的。我在开发环境中使用了Python内存事件探查器(特别是Heapy),但取得了一些成功,但是它无法帮助我解决无法重现的问题,并且我不愿意使用Heapy来检测生产系统需要花点时间来完成它的工作,并且它的线程化远程接口在我们的服务器中无法正常工作。
我想我想要的是一种转储生产Python进程(或至少gc.get_objects)快照,然后离线分析快照以查看其在哪里使用内存的方法。
我如何获得像这样的python进程的核心转储?
一旦有了一个,我该如何做些有用的事情?
-
我将从最近的经历中进一步了解布雷特的回答。推土机包是
很好的维护,尽管进步,像添加tracemalloc
在Python 3.4
STDLIB,其gc.get_objects
计数图是我去到的工具来解决内存泄漏。在下面,我使用dozer > 0.7
在撰写本文时尚未发布的内容(好吧,因为我最近在那里做了一些修复)。例
让我们看一个不平凡的内存泄漏。我将在此处使用Celery
4.4,并最终揭示导致泄漏的功能(由于这是错误/功能,因此可以将其称为纯粹的错误配置,由无知引起)。所以这是一个Python 3.6 VENV
在哪里pip install celery < 4.5
。并具有以下模块。演示
import time import celery redis_dsn = 'redis://localhost' app = celery.Celery('demo', broker=redis_dsn, backend=redis_dsn) @app.task def subtask(): pass @app.task def task(): for i in range(10_000): subtask.delay() time.sleep(0.01) if __name__ == '__main__': task.delay().get()
基本上是一个计划一堆子任务的任务。有什么问题吗?
我将用于
procpath
分析Celery节点的内存消耗。pip install procpath
。我有4个终端:procpath record -d celery.sqlite -i1 "$..children[?('celery' in @.cmdline)]"
记录Celery节点的进程树统计信息docker run --rm -it -p 6379:6379 redis
运行Redis,它将充当Celery经纪人和结果后端celery -A demo worker --concurrency 2
用2个工人运行节点python demo.py
最后运行示例
(4)将在2分钟内完成。
Then I use sqliteviz (pre-built
version) to visualise whatprocpath
has recorder. I drop thecelery.sqlite
there and use this query:SELECT datetime(ts, 'unixepoch', 'localtime') ts, stat_pid, stat_rss / 256.0 rss FROM record
And in sqliteviz I create a line chart trace with
X=ts
,Y=rss
, and add
split transformBy=stat_pid
. The result chart is:This shape is likely pretty familiar to anyone who fought with memory leaks.
Finding leaking objects
Now it’s time for
dozer
. I’ll show non-instrumented case (and you can
instrument your code in similar way if you can). To inject Dozer server into
target process I’ll use Pyrasite. There
are two things to know about it:- To run it, ptrace has to be configured as “classic ptrace permissions”:
echo 0 | sudo tee /proc/sys/kernel/yama/ptrace_scope
, which is may be a security risk - There are non-zero chances that your target Python process will crash
With that caveat I:
pip install https://github.com/mgedmin/dozer/archive/3ca74bd8.zip
(that’s to-be 0.8 I mentioned above)pip install pillow
(whichdozer
uses for charting)pip install pyrasite
After that I can get Python shell in the target process:
pyrasite-shell 26572
And inject the following, which will run Dozer’s WSGI application using
stdlib’swsgiref
‘s server.import threading import wsgiref.simple_server import dozer def run_dozer(): app = dozer.Dozer(app=None, path='/') with wsgiref.simple_server.make_server('', 8000, app) as httpd: print('Serving Dozer on port 8000...') httpd.serve_forever() threading.Thread(target=run_dozer, daemon=True).start()
Opening
http://localhost:8000
in a browser there should see something like:After that I run
python demo.py
from (4) again and wait for it to finish.
Then in Dozer I set “Floor” to 5000, and here’s what I see:Two types related to Celery grow as the subtask are scheduled:
celery.result.AsyncResult
vine.promises.promise
weakref.WeakMethod
has the same shape and numbers and must be caused by the
same thing.Finding root cause
At this point from the leaking types and the trends it may be already clear
what’s going on in your case. If it’s not, Dozer has “TRACE” link per type,
which allows tracing (e.g. seeing object’s attributes) chosen object’s
referrers (gc.get_referrers
) and referents (gc.get_referents
), and
continue the process again traversing the graph.But a picture says a thousand words, right? So I’ll show how to use
objgraph
to render chosen object’s
dependency graph.pip install objgraph
apt-get install graphviz
Then:
- I run
python demo.py
from (4) again - in Dozer I set
floor=0
,filter=AsyncResult
- and click “TRACE” which should yield
Then in Pyrasite shell run:
objgraph.show_backrefs([objgraph.at(140254427663376)], filename='backref.png')
The PNG file should contain:
Basically there’s some
Context
object containing alist
called_children
that in turn is containing many instances ofcelery.result.AsyncResult
,
which leak. ChangingFilter=celery.*context
in Dozer here’s what I see:So the culprit is
celery.app.task.Context
. Searching that type would
certainly lead you to Celery task
page.
Quickly searching for “children” there, here’s what it says:trail = True
If enabled the request will keep track of subtasks started by this task, and
this information will be sent with the result (result.children
).Disabling the trail by setting
trail=False
like:@app.task(trail=False) def task(): for i in range(10_000): subtask.delay() time.sleep(0.01)
Then restarting the Celery node from (3) and
python demo.py
from (4) yet
again, shows this memory consumption.Problem solved!