Debugging memory leaks

In Scrapy, objects such as Requests, Responses and Items have a finitelifetime: they are created, used for a while, and finally destroyed.

From all those objects, the Request is probably the one with the longestlifetime, as it stays waiting in the Scheduler queue until it’s time to processit. For more info see Architecture overview.

As these Scrapy objects have a (rather long) lifetime, there is always the riskof accumulating them in memory without releasing them properly and thus causingwhat is known as a “memory leak”.

To help debugging memory leaks, Scrapy provides a built-in mechanism fortracking objects references called trackref,and you can also use a third-party library called Guppy for more advanced memory debugging (see below for moreinfo). Both mechanisms must be used from the Telnet Console.

Common causes of memory leaks

It happens quite often (sometimes by accident, sometimes on purpose) that theScrapy developer passes objects referenced in Requests (for example, using thecb_kwargs or metaattributes or the request callback function) and that effectively bounds thelifetime of those referenced objects to the lifetime of the Request. This is,by far, the most common cause of memory leaks in Scrapy projects, and a quitedifficult one to debug for newcomers.

In big projects, the spiders are typically written by different people and someof those spiders could be “leaking” and thus affecting the rest of the other(well-written) spiders when they get to run concurrently, which, in turn,affects the whole crawling process.

The leak could also come from a custom middleware, pipeline or extension thatyou have written, if you are not releasing the (previously allocated) resourcesproperly. For example, allocating resources on spider_openedbut not releasing them on spider_closed may cause problems ifyou’re running multiple spiders per process.

Too Many Requests?

By default Scrapy keeps the request queue in memory; it includesRequest objects and all objectsreferenced in Request attributes (e.g. in cb_kwargsand meta).While not necessarily a leak, this can take a lot of memory. Enablingpersistent job queue could help keeping memory usagein control.

Debugging memory leaks with trackref

trackref is a module provided by Scrapy to debug the most common cases ofmemory leaks. It basically tracks the references to all live Requests,Responses, Item and Selector objects.

You can enter the telnet console and inspect how many objects (of the classesmentioned above) are currently alive using the prefs() function which is analias to the print_live_refs() function:

  1. telnet localhost 6023
  2.  
  3. >>> prefs()
  4. Live References
  5.  
  6. ExampleSpider 1 oldest: 15s ago
  7. HtmlResponse 10 oldest: 1s ago
  8. Selector 2 oldest: 0s ago
  9. FormRequest 878 oldest: 7s ago

As you can see, that report also shows the “age” of the oldest object in eachclass. If you’re running multiple spiders per process chances are you canfigure out which spider is leaking by looking at the oldest request or response.You can get the oldest object of each class using theget_oldest() function (from the telnet console).

Which objects are tracked?

The objects tracked by trackrefs are all from these classes (and all itssubclasses):

A real example

Let’s see a concrete example of a hypothetical case of memory leaks.Suppose we have some spider with a line similar to this one:

  1. return Request("http://www.somenastyspider.com/product.php?pid=%d" % product_id,
  2. callback=self.parse, cb_kwargs={'referer': response})

That line is passing a response reference inside a request which effectivelyties the response lifetime to the requests’ one, and that would definitelycause memory leaks.

Let’s see how we can discover the cause (without knowing ita priori, of course) by using the trackref tool.

After the crawler is running for a few minutes and we notice its memory usagehas grown a lot, we can enter its telnet console and check the livereferences:

  1. >>> prefs()
  2. Live References
  3.  
  4. SomenastySpider 1 oldest: 15s ago
  5. HtmlResponse 3890 oldest: 265s ago
  6. Selector 2 oldest: 0s ago
  7. Request 3878 oldest: 250s ago

The fact that there are so many live responses (and that they’re so old) isdefinitely suspicious, as responses should have a relatively short lifetimecompared to Requests. The number of responses is similar to the numberof requests, so it looks like they are tied in a some way. We can now goand check the code of the spider to discover the nasty line that isgenerating the leaks (passing response references inside requests).

Sometimes extra information about live objects can be helpful.Let’s check the oldest response:

  1. >>> from scrapy.utils.trackref import get_oldest
  2. >>> r = get_oldest('HtmlResponse')
  3. >>> r.url
  4. 'http://www.somenastyspider.com/product.php?pid=123'

If you want to iterate over all objects, instead of getting the oldest one, youcan use the scrapy.utils.trackref.iter_all() function:

  1. >>> from scrapy.utils.trackref import iter_all
  2. >>> [r.url for r in iter_all('HtmlResponse')]
  3. ['http://www.somenastyspider.com/product.php?pid=123',
  4. 'http://www.somenastyspider.com/product.php?pid=584',
  5. ...]

Too many spiders?

If your project has too many spiders executed in parallel,the output of prefs() can be difficult to read.For this reason, that function has a ignore argument which can be used toignore a particular class (and all its subclases). Forexample, this won’t show any live references to spiders:

  1. >>> from scrapy.spiders import Spider
  2. >>> prefs(ignore=Spider)

scrapy.utils.trackref module

Here are the functions available in the trackref module.

  • class scrapy.utils.trackref.object_ref[source]
  • Inherit from this class (instead of object) if you want to track liveinstances with the trackref module.
  • scrapy.utils.trackref.printlive_refs(_class_name, ignore=NoneType)[source]
  • Print a report of live references, grouped by class name.

Parameters:ignore (class or classes tuple) – if given, all objects from the specified class (or tuple ofclasses) will be ignored.

  • scrapy.utils.trackref.getoldest(_class_name)[source]
  • Return the oldest object alive with the given class name, or None ifnone is found. Use print_live_refs() first to get a list of alltracked live objects per class name.
  • scrapy.utils.trackref.iterall(_class_name)[source]
  • Return an iterator over all objects alive with the given class name, orNone if none is found. Use print_live_refs() first to get a listof all tracked live objects per class name.

Debugging memory leaks with Guppy

trackref provides a very convenient mechanism for tracking down memoryleaks, but it only keeps track of the objects that are more likely to causememory leaks (Requests, Responses, Items, and Selectors). However, there areother cases where the memory leaks could come from other (more or less obscure)objects. If this is your case, and you can’t find your leaks using trackref,you still have another resource: the Guppy library.If you’re using Python3, see Debugging memory leaks with muppy.

If you use pip, you can install Guppy with the following command:

  1. pip install guppy

The telnet console also comes with a built-in shortcut (hpy) for accessingGuppy heap objects. Here’s an example to view all Python objects available inthe heap using Guppy:

  1. >>> x = hpy.heap()
  2. >>> x.bytype
  3. Partition of a set of 297033 objects. Total size = 52587824 bytes.
  4. Index Count % Size % Cumulative % Type
  5. 0 22307 8 16423880 31 16423880 31 dict
  6. 1 122285 41 12441544 24 28865424 55 str
  7. 2 68346 23 5966696 11 34832120 66 tuple
  8. 3 227 0 5836528 11 40668648 77 unicode
  9. 4 2461 1 2222272 4 42890920 82 type
  10. 5 16870 6 2024400 4 44915320 85 function
  11. 6 13949 5 1673880 3 46589200 89 types.CodeType
  12. 7 13422 5 1653104 3 48242304 92 list
  13. 8 3735 1 1173680 2 49415984 94 _sre.SRE_Pattern
  14. 9 1209 0 456936 1 49872920 95 scrapy.http.headers.Headers
  15. <1676 more rows. Type e.g. '_.more' to view.>

You can see that most space is used by dicts. Then, if you want to see fromwhich attribute those dicts are referenced, you could do:

  1. >>> x.bytype[0].byvia
  2. Partition of a set of 22307 objects. Total size = 16423880 bytes.
  3. Index Count % Size % Cumulative % Referred Via:
  4. 0 10982 49 9416336 57 9416336 57 '.__dict__'
  5. 1 1820 8 2681504 16 12097840 74 '.__dict__', '.func_globals'
  6. 2 3097 14 1122904 7 13220744 80
  7. 3 990 4 277200 2 13497944 82 "['cookies']"
  8. 4 987 4 276360 2 13774304 84 "['cache']"
  9. 5 985 4 275800 2 14050104 86 "['meta']"
  10. 6 897 4 251160 2 14301264 87 '[2]'
  11. 7 1 0 196888 1 14498152 88 "['moduleDict']", "['modules']"
  12. 8 672 3 188160 1 14686312 89 "['cb_kwargs']"
  13. 9 27 0 155016 1 14841328 90 '[1]'
  14. <333 more rows. Type e.g. '_.more' to view.>

As you can see, the Guppy module is very powerful but also requires some deepknowledge about Python internals. For more info about Guppy, refer to theGuppy documentation.

Debugging memory leaks with muppy

You can use muppy from Pympler.

If you use pip, you can install muppy with the following command:

  1. pip install Pympler

Here’s an example to view all Python objects available inthe heap using muppy:

  1. >>> from pympler import muppy
  2. >>> all_objects = muppy.get_objects()
  3. >>> len(all_objects)
  4. 28667
  5. >>> from pympler import summary
  6. >>> suml = summary.summarize(all_objects)
  7. >>> summary.print_(suml)
  8. types | # objects | total size
  9. ==================================== | =========== | ============
  10. <class 'str | 9822 | 1.10 MB
  11. <class 'dict | 1658 | 856.62 KB
  12. <class 'type | 436 | 443.60 KB
  13. <class 'code | 2974 | 419.56 KB
  14. <class '_io.BufferedWriter | 2 | 256.34 KB
  15. <class 'set | 420 | 159.88 KB
  16. <class '_io.BufferedReader | 1 | 128.17 KB
  17. <class 'wrapper_descriptor | 1130 | 88.28 KB
  18. <class 'tuple | 1304 | 86.57 KB
  19. <class 'weakref | 1013 | 79.14 KB
  20. <class 'builtin_function_or_method | 958 | 67.36 KB
  21. <class 'method_descriptor | 865 | 60.82 KB
  22. <class 'abc.ABCMeta | 62 | 59.96 KB
  23. <class 'list | 446 | 58.52 KB
  24. <class 'int | 1425 | 43.20 KB

For more info about muppy, refer to the muppy documentation.

Leaks without leaks

Sometimes, you may notice that the memory usage of your Scrapy process willonly increase, but never decrease. Unfortunately, this could happen eventhough neither Scrapy nor your project are leaking memory. This is due to a(not so well) known problem of Python, which may not return released memory tothe operating system in some cases. For more information on this issue see:

The improvements proposed by Evan Jones, which are detailed in this paper,got merged in Python 2.5, but this only reduces the problem, it doesn’t fix itcompletely. To quote the paper:

Unfortunately, this patch can only free an arena if there are no moreobjects allocated in it anymore. This means that fragmentation is a largeissue. An application could have many megabytes of free memory, scatteredthroughout all the arenas, but it will be unable to free any of it. This isa problem experienced by all memory allocators. The only way to solve it isto move to a compacting garbage collector, which is able to move objects inmemory. This would require significant changes to the Python interpreter.

To keep memory consumption reasonable you can split the job into severalsmaller jobs or enable persistent job queueand stop/start spider from time to time.