Python descriptors made simple

Descriptors, introduced in Python 2.2, provide a way to add managed attributes to objects. They are not used much in everyday programming, but it’s important to learn them to understand a lot of the “magic” that happens in the standard library and third-party packages.

The problem

Imagine we are running a bookshop with an inventory management system written in Python. The system contains a class called Book  that captures the author, title and price of physical books.

Our simple Book class works fine for a while, but eventually bad data starts to creep into the system. The system is full of books with negative prices or prices that are too high because of data entry errors. We decide that we want to limit book prices to values between 0 and 100. In addition, the system contains a Magazine class that suffers from the same problem, so we want our solution to be easily reusable.

The descriptor protocol

The descriptor protocol is simply a set of methods a class must implement to qualify as a descriptor. There are three of them:

  • __get__(self, instance, owner)
  • __set__(self, instance, value)
  • __delete__(self, instance)

__get__ accesses a value stored in the object and returns it.

__set__ sets a value stored in the object and returns nothing.

__delete__ deletes a value stored in the object and returns nothing.

Using these methods, we can write a descriptor called Price that limits the value stored in it to between 0 and 100.

A few details in the implementation of Price deserve mentioning.

An instance of a descriptor must be added to a class as a class attribute, not as an instance attribute. Therefore, to store different data for each instance, the descriptor needs to maintain a dictionary that maps instances to instance-specific values. In the implementation of Price, that dictionary is self.values.

A normal Python dictionary stores references to objects it uses as keys. Those references by themselves are enough to prevent the object from being garbage collected. To prevent Book instances from hanging around after we are finished with them, we use the WeakKeyDictionary from the weakref standard module. Once the last strong reference to the instance passes away, the associated key-value pair will be discarded.

Using descriptors

As we saw in the last section, descriptors are linked to classes, not to instances, so to add a descriptor to the Book class, we must add it as a class variable.

The price constraint for books is now enforced.

How descriptors are accessed

So far we’ve managed to implement a working descriptor that manages the price attribute on our Book class, but how it works might not be clear. It all feels a bit too magical, but not to worry. It turns out that descriptor access is quite simple:

  • When we try to evaluate b.price and retrieve the value, Python recognizes that price is a descriptor and calls Book.price.__get__.
  • When we try to change the value of the price attribute, e.g. b.price = 23 , Python again recognizes that price is a descriptor and substitutes the assignment with a call to Book.price.__set__.
  • And when we try to delete the price attribute stored against an instance of Book, Python automatically interprets that as a call to Book.price.__delete__.

The number 1 descriptor gotcha

Unless we fully understand the fact that descriptors are linked to classes and not to instances, and therefore need to maintain their own mapping of instances to instance-specific values, we might be tempted to write the Price descriptor as follows:

But once we start instantiating multiple Book instances, we’re going to have a problem.

The key is to understand that there is only one instance of Price for Book, so every time the value in the descriptor is changed, it changes for all instances. That behaviour in itself is useful for creating managed class attributes, but it is not what we want in this case. To store separate instance-specific values, we need to use the WeakRefDictionary.

The property built-in function

Another way of building descriptors is to use the property built-in function. Here is the function signature:

fget, fset and fdel are methods to get, set and delete attributes, respectively. doc is a docstring.

Instead of defining a single class-level descriptor object that manages instance-specific values, property works by combining instance methods from the class. Here is a simple example of a Publisher class from our inventory system with a managed name property. Each method passed into property has a print statement to illustrate when it is called.

If we make an instance of Publisher and access the name attribute, we can see the appropriate methods being called.

That’s it for this basic introduction to descriptors. If you want a challenge, take what you have learned and try to reimplement the @property decorator. There is enough information in this post to allow you to figure it out.

A quick guide to nonlocal in Python 3

Python 3 introduced the nonlocal  keyword that allows you to assign to variables in an outer, but non-global, scope. An example will illustrate what I mean.

msg  is declared in the outside function and assigned the value "Outside!". Then, in the inside function, the value "Inside!" is assigned to it. When we run outside, msg has the value "Inside!" in the inside function, but retains the old value in the outside function.

We see this behaviour because Python hasn’t actually assigned to the existing msg variable, but has created a new variable called msg in the local scope of inside that shadows the name of the variable in the outer scope.

Preventing that behaviour is where the nonlocal keyword comes in.

Now, by adding nonlocal msg to the top of inside, Python knows that when it sees an assignment to msg, it should assign to the variable from the outer scope instead of declaring a new variable that shadows its name.

The usage of nonlocal is very similar to that of global, except that the former is used for variables in outer function scopes and the latter is used for variable in the global scope.

Some confusion might arise about when nonlocal should be used. Take the following function, for instance.

It would be reasonable to expect that without using nonlocal the insertion of the "inside": 2 key-value pair in the dictionary would not be reflected in outside. Reasonable, but incorrect, because the dictionary insertion is not an assignment, but a method call. In fact, inserting a key-value pair into a dictionary is equivalent to calling the __setitem__ method on the dictionary object.

I will leave it there for now. If you want to learn more about the nonlocal keyword, check out PEP 3104.

The two ways to sort a list in Python

Today I’m going to take a look at another element of the language that tends to trip up Python beginners – the difference between sorted(my_list)  and my_list.sort().

The built-in function sorted sorts the list that is passed into it, and returns a new list while preserving the old one.

On the other hand, the sort method on list objects sorts the list in place, destroying the original ordering.

Using a list’s sort method is the equivalent assigning the output of sorted back to the original list.

However, that particular way of doing things is frowned upon. Only use sorted

sorted and list.sort both accept the key and reverse parameters. The cmp parameter, which allowed you to pass in a custom comparator function, has been removed in Python 3. key should be used instead.

The difference between range and xrange in Python

Today I’m going to take a look at another difference between Python 2 and 3 that can trip up people making the switch. Python 2 used to have two functions that could be used to iterate a certain number of times in for  loops, range  and xrange . In Python 3, there is no xrange , but the range  function behaves like xrange  in Python 2.

The way things were

You probably remember that in Python 2 you could generate indexes in for  loops in two ways:

The difference between these two built in functions is not immediately obvious when used in this way. Let’s take a look at the output of each function in the interactive interpreter.

As you can see, range  returns a normal list , but xrange  returns an xrange  object. An xrange  object is similar to a generator: it produces the necessary index on demand instead of producing the entire list up front. Therefore it can be slightly faster and more memory efficient. According to the Python 2 documentation, the xrange  type offers the following guarantee:

The advantage of the xrange type is that an xrange object will always take the same amount of memory, no matter the size of the range it represents.

xrange deprecated in Python 3

In Python 3, xrange  has been removed and the only option for generating iterable sequences of consecutive numbers is range . Actually, it is more correct to say that the Python 2 range  function has been removed and xrange  has been renamed to range .

For the most part, this change is easy to handle: just use range  when you would have used either range  or xrange  in Python 2. The only place you might be tripped up is if you actually need the list  that range  used to return. Luckily, all you have to do in that case is pass the Python 3 range  object to the list  built-in function.

Pythonic iteration

Before I finish, I’ll just mention a way to make your code more Pythonic. Quite frequently, when people want to use any of the range functions, it is because they want a way to index another sequence type, e.g.

Sometimes people even declare a counter variable outside the loop, just so they have an index.

There is no need to do either of these things. In particular, that range(len(seq))  idiom is one of classic markers of amateur Python code. What you really need is the enumerate  function, which automatically generates an index for whatever sequence you are iterating over.

Ta-da! Once you start using enumerate , you’ll never go back.

Private methods and attributes in Python

Unlike Java, which enforces access restrictions on methods and attributes, Python takes the view that we are all adults and should be allowed to use the code as we see fit. Nevertheless, the language provides a few facilities to indicate which methods and attributes are public and which are private, and some ways to dissuade people from accessing and using private things.

Normal attribute access

Let’s take a look at how normal attribute access works.

As we can see, there are no restrictions on accessing or assigning to the bar  attribute of our instance. The attribute is also included in __dict__ .

Making it private

Now let’s make bar  “private”. We can do that by adding two leading underscores to the name.

What has happened here is that the name of __bar  has been changed by the  interpreter so that it is not easily accessible outside the class. If we take a look at __dict__  again, we will see that it has been renamed to _Foo__bar , and can be accessed and assigned using that name.

This is called “name mangling”. Attributes whose names start with two underscores are renamed in the format _classname__attrname .

We only have to use the mangled name outside the class. Inside, we access the attribute in the normal way.

Getters and setters

After learning about “private” attributes, sometimes new Python programmers get the idea that they can use getters and setters to manage accessing and assigning attributes, so they write something like this.

It might work, but it’s not Python. Direct attribute access is the natural and Pythonic way to do things, so any solution to mediated attribute access should maintain that interface. There are a few ways to do it, such as overriding __getattr__  and __setattr__ , but the best way is to use managed attributes.

Here we have created a managed bar  attribute that stores its data in the private __bar  attribute. When getting and setting the value of __bar , we can run whatever code we want for validation, logging, etc., provided we go through the interface provided by the two decorated bar  functions. Useful, eh?

Private methods

Methods can be made private in the same way, by naming them with two leading underscores and no trailing underscores.

And just like private attributes, they are accessible by name inside the class.

A word about single underscores

So far we have dealt with names that start with two underscores, but it’s quite common to see names that start with a single underscore. They are not private in the same sense. Name mangling does not occur. A single underscore is mostly just a weak indication that the thing in question is meant to be used internally and is not part of the public interface of the class, module, etc., that it is inside.

In classes, attributes and methods that start with a single underscore are treated normally.

However, single underscores are not purely a stylistic thing. They do affect how the import  statement works.

PEP8 says:

_single_leading_underscore: weak “internal use” indicator. E.g. from M import * does not import objects whose name starts with an underscore.

This means that if we have a function called _hello_world  in a module called helloworld , and we import *  from it, then the _hello_world  function will not be pulled into the current scope.

It is possible to override the default hiding of objects with single leading underscores. __all__  is a list of the names of public objects exported by a module. If we add '_hello_world'  to the list, then it will be pulled in with the wildcard import.

The single underscore only affects wildcard imports, which we should avoid anyway. We can still grab the function specifically using from helloworld import _hello_world .

And that’s pretty much all you need to know about private attributes in Python!

Multi-line strings in Python

At some point, you will want to define a multi-line string and find that the obvious solutions just don’t feel clean. In this post, I’m going to take a look at three ways of defining them and give you my recommendation.

Concatenation

The first way of doing it, and the way that immediately comes to mind, is to just add the strings to each other, like so:

In my opinion, this looks extremely ugly. You can make it a bit better by omitting the + signs. This also works:

That’s better, but still a bit of an eyesore. Let’s move on.

Multi-line string syntax

Another way is to use the built-in multi-line string syntax, like so:

That’s much better than concatenation, but it has one conspicuous wart. You can’t indent the subsequent lines to be at the same level of indentation as the first one. The space will be interpreted as part of the string. The first line will be flush with the margin and the subsequent lines will be indented.

Tuple syntax

There is another way to do it that doesn’t suffer from the ugliness of concatenation or the indentation wart of the multi-line syntax, and that is to use the tuple syntax.

I have no idea why this works, but it does:

Note that you have to add the line breaks into the strings, because they’re not put in automatically. Nonetheless, I think this is by far the nicest, most readable method.

The difference between input and raw_input in Python

One of the first things that people notice when they ditch Python 2 and start coding in Python 3 – apart from the fact that print  is not a function – is that the raw_input  function has disappeared. So this Python 2 code:

must be converted to this in Python 3:

The change comes because the Python developers realized that they had made a dangerous mistake back in the early days. If you recall, the Python 2 version of the input  function used to be equivalent to this:

This allowed you to easily write programs that take input from the user and evaluate it as an int or a float or whatever type it is. For example:

In Python 3, input  behaves like raw_input  in Python2, and the raw_input  function has been removed completely, so you have to do something like this (assuming you want to accept integer values):

Effectively, in the Python 2 version of the input  function, the string read from the prompt was evaled. To understand the danger of eval , you should take a look at this article by Ned Batchelder.

Automatically evaling whatever anybody decides to type at the prompt maybe makes things a little easier in a teaching context, because students don’t have to learn to convert strings to their intended types, but it also leaves the program open to executing arbitrary code that the user types in, revealing private information about your system or damaging it in some way.

Take this for example:

That will print the current working directory of your program.

Or if someone really wants to screw things up for you, they could just execute a recursive delete of your home directory. DO NOT RUN THIS CODE:

There is no need for Python to contain such footguns, regardless of their dubious teaching value.

To get the old behaviour of input  (which I hope I have convinced you that you do not want), replace your calls to it with eval(input()) . In fact, that is exactly what the automatic porting tool 2to3  does.

How to modify a list in place in Python

If I had a penny for every time somebody asked about this particular problem, I’d have about thirty cents. Not enough to retire on, but enough to justify a blog post, at the very least.

Did you ever see a piece of code that looks like this?

The purpose of it is to remove any even numbers from the list numbers . And it looks like it should work, right?

The problem, explained

If you’re an experienced programmer the mistake will be obvious, but beginners can expect to scratch their heads when they examine the final list and see that it still has some even numbers in it.

Whereas what we wanted to see is this.

What is going on here? We can illuminate the problem if we write out the code without any of the syntactic niceties of the Python for  loop.

Be your own debugger

At this point, things should be getting clearer. Let’s “step through” the while  loop to see exactly how the execution goes.

  • On the first iteration the loop counter  i  is equal to 0. 1 (the value of the first element in numbers ) is assigned to elem . 1 is not divisible by 2 so the if  block is not called.
  • On the second iteration i  is equal to 1. 2 (the value of the second element in numbers) is assigned to elem . 2 is clearly divisible by 2, so the if  block is called and the second element is removed from the list. The length of the list is reduced by 1. The element that was at the third position is now at the second position and so on. The list now looks like this.

  • On the third iteration i  is equal to 2. 3 (the value of the third element in numbers , is assigned to elem . The 2 that used to be the third element, but is now the second, has been skipped entirely. It can never be removed from the list.
  • The 4 in the sixth position of the initial list is skipped in the same way.

Now that you understand how the code is going wrong, let’s see how to fix it.

Three solutions

There are several ways to refactor the code so that it gives us the output we want. Let’s examine each one in turn.

First, we could build up a new list that contains only the elements in numbers  that are not divisible by 2, either with a loop or a list comprehension. Here’s what that looks like.

Or without the list comprehension.

Second, we could change the while  loop to not increment the loop counter when we remove an item from the list.

Third, we could make a copy of the list, iterate over that, and modify the original. The [:]  slicing syntax, which copies the entire list, comes in handy here.

So which one of these options should you use? There seems to be some agreement that if you can deal with the memory overhead of copying the list, then the last example is the most Pythonic. Otherwise, the most memory efficient method is to modify the loop counter, as in the second last example.

How to write a Reddit bot in Python

Something I have seen a lot of interest in is writing bots to interact with Reddit and provide useful services to the community. In this post, I’m going to show you how to build one.

Introducing BitesizeNewsBot

The bot we’re going to write is called BitesizeNewsBot. It sits on the “new” queue in the /r/worldnews subreddit and posts summaries of the articles that people link to.

I ran it for a few days last week and, after I worked out some of the kinks, it was ticking along nicely.

bitesize_news_bot

The code

Here is the full code for the bot. It’s under 80 lines! Take a minute to read it and then we will step through what it is doing.

Connecting to Reddit using PRAW

Due to some heroic open source contributions, the Python Reddit API Wrapper is a really mature and stable library that gives you access to everything in the Reddit API. The library even has its own subreddit – /r/praw . We’re going to use it to login using the bot’s account, periodically fetch the new submissions in /r/worldnews, and post comments on the submissions that contain compact summaries of the linked articles.

Here’s how we log in with PRAW:

Then we enter a loop of fetching the new submissions from /r/worldnews, summarizing them, and posting them back as comments. On each iteration, we sleep for ten minutes to be a good citizen of Reddit.

This line fetches the ten newest submissions:

When we get the submissions, we can iterate over them and prepare the summaries.

Summarizing text using PyTLDR

Automatic text summarization is a topic I am really interested in. I’ve implemented several summarization algorithms, but the point of this post is to show you how to make a bot, not how to do advanced natural language processing, so we’re going to use a great library called PyTLDR.

PyTLDR implements several summarization algorithms, but the one we’re going to use is TextRank. The summarization function looks like this:

The summarize_web_page  function takes either a string containing the article text, of a URL. If we give it a URL, as we are doing here, it uses Goose extractor behind the scenes to fetch the article text from the web page.

The function also takes a length parameter. If this is a value between zero and one, it represents the summary length as a fraction of the length of the original article. If it is greater than one, it represents a number of sentences. We have picked three as our summary length, which seems to strike the right balance between providing a useful summary and copying large pieces of the article.

The output of the summary function is a list of sentences. Before returning the summary from the function, we join them with newlines.

In the main loop, we call the function as follows:

Commenting on submissions

Once we have got the summary, we can generate the comment and post it on the article submission. Before we do that, though, we have to do a sanity check on the summary. Because article extraction from web pages is inherently unreliable, sometimes the summarize_web_page  function will return an empty string. This piece of code in our main loop checks for that case and moves on to the next submission if we can’t generate a sensible summary for the current one:

Posting the comment can fail in many ways, so we need to catch several exceptions. As we want the same handler for each one (just print the exception and move on to the next iteration of the loop), we can catch them all in one line:

Keeping track of seen submissions

We don’t want the bot to comment more than once on any article, so we keep track of them in a set that stores the unique identifiers of each post once the comment with the summary has been posted.

To persist the set of posts from one run of the program to the next, we will pickle it and store it in a file on disk.

At startup, we try to restore the set from disk, like so:

At shutdown, we store the set on disk:

We are using the register decorator from the atexit  module to make sure that no matter how our program quits the save_seen_posts  function is called. It will be called even if you hit Ctrl-C in the terminal.

In the main loop, we add the submission ID to the SEEN  set right after posting the comment:

With the set in place, we check whether the bot has already commented on the submission before trying to generate a summary:

We also check if the submission is a self-post, because by definition they do not link to any news article.

And that’s really all there is to writing a simple Reddit bot. You can make it more complicated if you want, but BitesizeNewsBot demonstrates the basics.