Tutorial

There is also automatically generated documentation.

Installing

What the profiler does

It can record information about all reachable objects at some point in time, and has functions to analyse that information.

It knows about all built-in types, but none that are defined in extension modules.

What the profiler doesn't do

Monitor memory continuously. You have to explicitly tell it to scan memory.
Anything friendly, unfortunately (yet). The interface is quite sparse, and some things are clunky.

A simple example

We'll use this code to go through a few of the features of the profiler:

>>> foo = {}
>>>   for i in range(1000):
...   foo[i] = range(100)
	
>>> while True:
...   pass

It's a totally useless piece of code, but never mind. It can go in example.py.

Running the profiler

You need to tell the profiler when to scan memory for objects. A good place to do it here is before the while True loop, so before while True, put:

>>> import code
>>> from sizer import scanner
>>> objs = scanner.Objects()
>>> code.interact(local = {'objs': objs})

That will do the scanning and then bring up an interpreter console to play around in.

Looking at the collected information

Let's find out what the biggest single objects here are:

>>> from sizer import formatting
>>> formatting.printsizes(objs, count = 10)

which should give something like:

    Size    Total Object
   12188    12188 dict at -0x4830bf0c
    3716    15904 dict at -0x48312994
    3410    19314 str at 0x814f898
    2672    21986 dict at -0x482a21c4
    2641    24627 str at 0x8149808
    2514    27141 str at 0x8151280
    2429    29570 str at 0x8168528
    2163    31733 str at 0x81fdf28
    2048    33781 dict at -0x482a20fc
    1796    35577 dict at -0x4828cd7c

printsizes has three useful keyword parameters, sorted, threshold and count - the automatic documentation has details of them (at the link above).

You can look at individual objects from this list, by looking them up in objs, which will give you a wrapper for the real object.

For example, objs[-0x4830bf0c] is a wrapper for the dictionary object at the top of the list.

>>> w = objs[-0x4830bf0c]

Now we can look at what object that represents:

>>> w.obj

which prints out a very long dictionary (which is in fact example.foo).

The wrapper for an object obj will be in objs[id(obj)]. For example, objs[id(sys)] is the wrapper for the sys module.

Other interesting fields of wrappers are:

w.size - the number of bytes of memory used by the object, in this case 12188.
w.type - the class of the object.
w.children - a list of wrappers to objects which this object references.

Dictionary wrappers, such as w, have a couple of extra fields, w.keys and w.values, which are lists of wrappers of the dictionary's keys and values, respectively.

The wrappers are of type wrapper.ObjectWrapper.

Adding parent information

At the moment, it's hard to see where a given object can be found. We can fix that like so:

>>> from sizer import annotate
>>> annotate.markparents(objs)

Now each wrapper will have a parents field:

>>> w.parents
[wrap dict at -0x481f14e4]
>>> w.parents[0].parents
[wrap module __main__, wrap frame at 0x8193e9c]

which shows that the dictionary is contained in some other dictionary (which happens to be __main__.__dict__), which is in turn contained in both __main__ and a frame object.

Filtering out useless junk

You might not be interested in some objects. For example, many of the biggest objects here are docstrings. You can use the operations module to remove any objects you like:

>>> from sizer import operations
>>> nostr = operations.fix(operations.filterouttype(objs, str))

Now nostr will contain everything in objs that isn't a string. fix is needed because filterouttype doesn't repair the structure of the wrappers after removing some of them, so otherwise some functions (particularly those in module annotate) will fail.

Anything that can be done on objs can now be done on nostr.

There are other filtering operations: operations.filtersize, operations.filtertype and operations.filterd. filterd is the most general.

Again, you must use operations.fix on the returned dictionaries before you call any function from annotate on them.

Sorting and grouping objects

You can sort objects by size using operations.sorted, operations.toplist, operations.top and operations.pos.

You can also find the total memory use for each type, using (for example):

>>> formatting.printsizesop(operations.bytype(objs), threshold = 1000)

Note: you must use printsizesop rather than printsizes here, since bytype doesn't return a dictionary of wrappers.

In this case this prints out:

    Size    Total Object
  423860   423860 <type 'list'>
  216657   640517 <type 'str'>
  130592   771109 <type 'dict'>
   41160   812269 <type 'type'>
   33548   845817 <type 'tuple'>
   19456   865273 <type 'code'>
   14088   879361 <type 'int'>
   13332   892693 <type 'function'>
    4760   897453 <type 'builtin_function_or_method'>
    2080   899533 <type 'classobj'>
    1680   901213 <type 'frame'>

which is a list of all types taking up more than 1000 bytes of memory.

Incidentally, let's find out if most space is taken up by a single big list or lots of smaller ones:

# Get wrappers for only list objects.
>>> lists = operations.filtertype(objs, list)

# Sort and print those lists by size.
>>> formatting.printsizesop(operations.bysize(lists))

This prints:

    Size    Total Object
  420000   420000 420
    1100   421100 1100
     896   421996 896
     756   422752 756
     192   422944 32
     168   423112 24
     164   423276 164
     116   423392 116
      80   423472 40
      76   423548 76
      72   423620 72
      72   423692 36
      60   423752 20
      56   423808 28
      52   423860 52

Although it's not clear, the "Object" column gives the size of each single object, and the "Size" column gives the total of all objects of that size. So you can see that almost all the space used up by lists is used up by lists of 420 bytes. We can get a dictionary containing only 420-byte lists with operations.filtersize(lists, 420).

Miscellaneous

Another function is operations.diff, which takes two set of wrappers and returns a set containing the wrappers found in the second one but not the first one.

For example, if you run the scanner at two different points in time, you can pass the results of the two scans to diff to get the new objects from the second scan.

Note: at the moment, you can't use most functions of annotate on the set that diff returns. Running operations.fix will result in most things in the set being removed.

Collecting results by the structure of code

You can use the annotate.findcreators function to find out which functions created the largest and most objects.

Note: You must have a patched Python in order to use this function - it will not work at all without it.

We'll use this example, which can go in creatorex.py:

def a():
  for i in range(100):
    c()

def b():
  for i in range(200):
    c()

def c():
  global keep
  keep.append(range(1000))

keep = []
a()
b()

Each call to a() will make 100 lists, and each call to b() will make 200 lists, through a call to c(). They're appended to keep so that there's still a reference to them.

Now, let's get a set of objects scanned and make the profiler put some creation information together:

import creatorex
from sizer import scanner, annotate, formatting
objs = scanner.Objects()
creators = annotate.findcreators(objs)

Now we can print out which lines of code created the most objects:

>>> formatting.printsizes(creators.back, count=9)
    Size    Total Object
 5527200  5527200 creatorex.py:11
   86744  5613944 <interpreter>:0
   41802  5655746 /usr/local/lib/python2.5/linecache.py:101
   34156  5689902 /usr/local/lib/python2.5/encodings/__init__.py:30
   24222  5714124 /usr/local/lib/python2.5/os.py:44
   18756  5732880 /usr/local/lib/python2.5/site.py:61
   15432  5748312 /usr/local/lib/python2.5/site-packages/sizer/scanner.py:38
    7881  5767444 /usr/local/lib/python2.5/os.py:49
    7613  5775057 /usr/local/lib/python2.5/site-packages/sizer/annotate.py:12

At the top you can see line 11 of creatorex.py, which is keep.append(range(1000)). So that line is creating most objects (no surprise there).

Also, <interpreter> is not a real line of code. It represents all the objects that were created when no Python code was running (for example, objects created very early on).

Now, we can find out what functions called c() in order to make those objects. The back field has as its keys (calling file, calling line) tuples:

>>> fromc = creators.back[("creatorex.py", 11)]
>>> formatting.printsizes(fromc.back)
    Size    Total Object
 3684800  3684800 creators.py:7
 1842400  5527200 creators.py:3
       0  5527200 creators.py:11

So b is shown as making most objects.

Now let's look at the lines which called a(), which called c(), which made objects, by going back from fromc:

>>> froma = fromc.back[("creatorex.py", 3)]
>>> formatting.printsizes(froma.back)
    Size    Total Object
 1842400  1842400 creators.py:14
       0  1842400 creators.py:3

So these were made from the a() call, as you'd expect.

The second line here gives the objects created by a() when it was at the bottom of the stack - i.e. no other function had called it. In this case, there's nothing.

Unfortunately, you can't reference that as froma.back[("creators.py", 3)], like any other line, at the moment. You have to use froma.back[None] instead.

As it happens, this problem is also the reason for the dummy <interpreter> file. I'll try to fix it soon.

There are a couple of extra fields in things like creators: size, which gives the total size of objects created by the functions, and members, which gives the objects themselves.

The interface for this might change - it hasn't been here for long.

Collecting results by the structure of data

Looking at single objects is not that useful. Often the individual objects are small and most space is used up by large collections of them.

For this reason there is a function annotate.groupby, which collects objects into groups.

It takes a set of wrappers, and a subset of the wrappers, as a list, dictionary or set, to use as heads of groups. It then assigns each object in the wrappers to a group and returns a set of group wrappers.

You can use the set of wrappers it returns as a normal set of object wrappers. Everything above can be done to it (if something can't, it's a bug) - the groups behave as if they were single objects.

Note: For best results, your copy of Python should be patched. You will still get results if it's not, but they might not be as accurate.

There is also a function annotate.simplegroupby, which groups objects into modules, threads and class instances, depending on three keyword parameters given to it. It works by finding all modules, threads and classes in the objects given to it and passing those as the set of objects to group by.

Here's an example piece of code:

>>> class A(object):
...   def __init__(self):
...     self.list = [1,2,3,4,5]
		
>>> as = [ A() for i in range(100) ]

Before calling groupby, the A instances and their lists are treated separately:

>>> objs = scanner.Objects()
>>> a = objs[id(as[0])]
>>> a
wrap A at -0x4871a034

# The size of the object is apparently 16 bytes
>>> a.size
16

# The first thing here is the instance's __dict__
>>> a.children
(wrap {'list': list at -0x48678c64}, wrap type at 0x81c8e44)

# The second thing here is the list itself
>>> a.children[0].children
(wrap 'list', wrap list at -0x48678c64, wrap None, wrap None)

>>> a.children[0].children[1].obj
[1, 2, 3, 4, 5]
>>> a.children[0].children[1].size
40

Now we'll group the objects into instances:

>>> groups = annotate.simplegroupby(objs, classes=True)

The size is more useful now:

>>> ag = groups[id(a)]
>>> ag
wrap A at -0x4871a034
>>> ag.size
180

A bit of an increase! The members field will tell you which objects were put into this group:

>>> import pprint
>>> pprint.pprint(ag.members)
{-1211011092: wrap list at -0x48678c64,
 -1211002292: wrap {'list': list at -0x48678c64},
 -1210993780: wrap A at -0x4871a034}

There's the instance itself, the list and a dictionary. In fact, the dictionary is taking up most of the space:

>>> d = ag.members[-1211002292]
>>> d
wrap {'list': list at -0x48678c64]}
>>> d.size
124

In this case, you could save space (if you were creating lots of instances of A) by using slots.

Using simplegroupby with modules = True is a good way of finding out approximately how much memory each module is using.

For example:

>>> import xmlrpclib # Just to make the list of modules a bit more interesting
>>> objs = scanner.Objects()
>>> mods = annotate.simplegroupby(objs, modules = True)
>>> formatting.printsizes(mods)
    Size    Total Object
   69184    69184 module sizer.sizes
   44315   113499 module linecache
   44190   157689 module bisect
   43949   201638 module xmlrpclib
   38795   240433 module sizer.scanner
   25844   266277 module codecs
   20927   287204 module copy
   20878   308082 module encodings
   17474   325556 module base64
   12185   337741 module sys

Unfortunately, the global functions of the profiler are counted here. I'll fix this soon. The results of scanning, grouping etc. are not counted, since the scanner ignores these.

Making graphs

If you have pydot installed, you can make graphs of objects (the kind with nodes and edges, not the y-against-x kind). With a large amount of objects, this will take a long time and just produce a mess, so it's sensible to group the objects first.

For example, on the code above, run:

from sizer import graph
graph.makegraph(mods, count = 15, proportional = True)

This will return the name of a PostScript file containing a graph of the biggest 15 modules (count = 15), with a node to represent each module. References between modules will be given as an edge, with the area of each node proportional to the size of the module (proportional = True).