Python for Perl Programmers

There are certain Perl idioms that every Perl programmer uses: "while (<>) { foo; }" and "foo ~= s/old/new/g" both come to mind.

When I was learning Python I was pretty peeved that certain Python books don't get to that kind of thing until much later chapters. One didn't cover that kind of thing until the end! As [a long-time Perl user](https://everythingsysadmin.com/2011/03/overheard-at-the-office-perl-e.html) this annoyed and confused me.

While they might have been trying to send a message that Python has better ways to do those things, I think the real problem was that the audience for a general Python book is a lot bigger than the audience for a book for Perl people learning Python. Imagine how confusing it would be to a person learning their first programming language if their book started out comparing one language you didn't know to a different language you didn't know!

So here are the idioms I wish were in Chapter 1. I'll be updating this document as I think of new ones, but I'm trying to keep this to be a short list.

Processing every line in a file

Perl:

while (<>) {
    print $_;
}

Python:

for line in file('filename.txt'):
    print line

To emulate the Perl <> technique that reads every file on the command line or stdin if there is none:

import fileinput
for line in fileinput.input():
     print line

If you must access stdin directly, that is in the "sys" module:

import sys
for line in sys.stdin:
     print line

However, most Python programmers tend to just read the entire file into one huge string and process it that way. I feel funny doing that. Having used machines with very limited amounts of RAM, I tend to try to keep my file processing to a single line at a time. However, that method is going the way of the dodo.

contents = file('filename.txt').read()
all_input = sys.stdin.read()

If you want the file to be one string per line, with the newline removed just change read() to readlines()

list_of_strings = file('filename.txt').readlines()
all_input_as_list = sys.stdin.readlines()

Regular expressions

Python has a very powerful RE system, you just have to enable it with "import re". Any place you can use a regular expression you can also use a compiled regular expresion. Python people tend to always compile their regular expressions; I guess they aren't used to writing throw-away scripts like in Perl:

import re
RE_DATE = re.compile(r'\d\d\d\d-\d{1,2}-\d{1,2}')
for line in sys.stdin:
     mo = re.search(RE_DATE, line)
     if mo:
          print mo.group(0)

There is re.search() and re.match(). re.match() only matches if the string starts with the regular expression. It is like putting a "^" at the front of your regex. re.search() is like putting a ".*" at the front of your regex. Since match comes before search alphabetically, most Perl users find "match" in the documentation, try to use it, and get confused that r'foo' does not match 'i foo you'. My advice? Pretend match doesn't exist (just kidding).

The big change you'll have to get used to is that the result of a match is an object, and you pull various bits of information from the object. If nothing is found, you don't get an object, you get None, which makes it easy to test for in a if/then. An object is always True, None is always false. Now that code above makes more sense, right?

Yes, you can put parenthesis around parts of the regular expression to extract out data. That's where the match object that gets returned is pretty cool:

import re
for line in sys.stdin:
     mo = re.search(r'(\d\d\d\d)-(\d{1,2})-(\d{1,2})', line)
     if mo:
          print mo.group(0)

The first thing you'll notice is that the "mo =" and the "if" are on separate lines. There is no "if x = re.search() then" idiom in Python like there is in Perl. It is annoying at first, but eventually I got used to it and now I appreciate that I can't accidentally assign a variable that I meant to compare.

Let's look at that match object that we assigned to the variable "mo" earlier:

  • mo.group(0) -- The part of the string that matched the regex.
  • mo.group(1) -- The first ()'ed part
  • mo.group(2) -- The second ()'ed part
  • mo.group(1,3) -- The first and third matched parts (as a tuple)
  • mo.groups() -- A tuple containing all the matched parts.

The perl s// substitutions are easily done with re.sub() but if you don't require a regular expression "replace" is much faster:

>>> re.sub(r'\d\d+', r'', '1 22 333 4444 55555')
'1    '

>>> re.sub(r'\d+', r'', '9876 and 1234')
' and '

>>> re.sub(r'remove', r'', 'can you remove from')
'can you  from'

>>> 'can you remove from'.replace('remove', '')
'can you  from'

You can even do multiple parenthesis substitutions as you would expect:

>>> re.sub(r'(\d+) and (\d+)', r'yours=\1 mine=\2', '9876 and 1234')
'yours=9876 mine=1234'

After you get used to that, read the ""pydoc re" page":http://docs.python.org/library/re.html for more information.

String manipulations

I found it odd that Python folks don't use regular expressions as much as Perl people. At first I though this was due to the fact that Python makes it more cumbersome ('cause I didn't like to have to do 'import re'). It turns out that Python string handling can be more powerful. For example the common Perl idiom "s/foo/bar" (as long as "foo" is not a regex) is as simple as:

credit = 'i made this'
print credit.replace('made', 'created')

or

print 'i made this'.replace('made', 'created')

It is kind of fun that strings are objects that have methods. It looks funny at first.

Notice that replace returns a string. It doesn't modify the string. In fact, strings can not be modified, only created. Python cleans up for automatically, and it can't do that very easily if things change out from under it. This is very Lisp-like. This is odd at first but you get used to it. Wait... by "odd" I mean "totally fucking annoying". However, I assure you that eventually you'll see the benefits of string de-duplication and (I'm told) speed.

It does mean, however, that accumulating data in a string is painfully slow:

s = 'this is the first part\n'
s += 'i added this.\n'
s += 'and this.\n'
s += 'and then this.\n'

The above code is bad. Each assignment copies all the previous data just to make a new string. The more you accumulate, the more copying is needed. The Pythonic way is to accumulate a list of the strings and join them later.

s = []
s.append('this is the first part\n')
s.append('i added this.\n')
s.append('and this.\n')
s.append('and then this.\n')
print ''.join(s)

It seems slower, but it is actually faster. The strings stay in their place. Each addition to "s" is just adding a pointer to where the strings are in memory. You've essentially built up a linked list of pointers, which are much more light-weight and faster to manage than copying those strings around. At the end, you join the strings. Python makes one run through all the strings, copying them to a buffer, a pointer to which is sent to the "print" routine. This is about the same amount of work as Perl, which internally was copying the strings into a buffer along the way. Perl did copy-bytes, copy-bytes, copy-bytes, copy-bytes, pass pointer to print. Python did append-pointer 4 times then a highly optimized copy-bytes, copy-bytes, copy-bytes, copy-bytes, pass pointer to print.

joining and splitting.

This killed me until I got used to it. The join string is not a parameter to join but is a method of the string type.

Perl:

new = join('|', str1, str2, str3)

Python:

new = '|'.join([str1, str2, str3])

Python's join is a function of the delimiter string. It hurt my brain until I got used to it.

Oh, the join() function only takes one argument. What? It's joining a list of things... why does it take only one argument? Well, that one argument is a list. (see example above). I guess that makes the syntax more uniform.

Splitting strings is much more like Perl... kind of. The parameter is what you split on, or leave it blank for "awk-like splitting" (which heathens call "perl-like splitting" but they are forgetting their history).

Perl:

my @values = split('|', $data);

Python:

values = data.split('|');

You can split a string literal too.  In this example we don't give split() any parameters so that it does "awk-like splitting".
print 'one two three four'.split()
['one', 'two', 'three', 'four']

If you have a multi-line string that you want to break into its individual lines, bigstring.splitlines() will do that for you.

Getting help

pydoc foo

except it doesn't work half the time because you need to know the module something is in . I prefer the "quick search" box on http://docs.python.org or "just use Google".

I have not read ""Python for Unix and Linux System Administration":http://www.amazon.com/dp/0596515820/safocus-20" but the table of contents looks excellent. I have read most of Python Cookbook (the first edition, there is a 2nd edition out too) and learned a lot. Both are from O'Reilly and can be read on Safari Books Online.

That's it!

That's it! Those few idioms make up most of the Perl code I usually wrote. Learning Python would have been so much easier if someone had showed me the Python equivalents early on.

One last thing... As a sysadmin there are a few modules that I've found useful:

 

14 Comments

Once you get comfortable with Python basics, I highly recommend checking out David Beazley's "Generator Tricks for System Programmers". This is a great compromise between the classic one-line-at-a-time processing you mention in your post, but allows you to treat the source as if all the lines were read into memory already.

Would be nice if you also included an example of doing regex-based search/replace when there's ()'s involved.

You missed the following page that does more of the same for Python 2.X:

http://wiki.python.org/moin/PerlPhrasebook

- Paddy.

Do you know of any good references for pexpect? I use it a lot, but i wouldn't say i've gotten used to it (i'm not really an expect expert either), and i get bitten a lot.

Do you recommend any book of Python for system admins? or any suggestion to start learning Python for system administration.

I am pretty sure that "odd" behavior in python is also there in perl. I recall helping debu a problem for some bioinformatics software in perl about 10 years back and they were appending little pieces to a perl string about a million times and wondered why the script took forever. Of has perl changed since then?

Perl will only recopy the string if it needs to because the buffer allocated to it has run out of space. It will allocate a new (larger) buffer, copy the string there, and do the operation in that new buffer. If you keep appending small strings to a string, it may have to recopy the string every time, or every other time, or less depending on how much extra space it allocates. However, you can modify the string in-place without causing it to need to recopy if the change doesn't affect the length of the string. In other words, you can "poke" a replacement character at position 3 to change '123456' to '123x56'.

Python, however, can not change a string. Ever. They are allocated with the exact amount of space required, no extra, and even then they are immutable. You can't take a string and change the 3rd char in. Methods like replace() return a new string with the result; the old string is still there. Something like "x += 'foo'" creates a new string and destroys the old one invisibly.

Perl regexes are implicitly compiled unless they are the result of scalar interpolation. That's why you only ocasionally see things like qr/RE/ or /RE/o in scripts. In Python you have to be explicit all the time I guess.

argparse >> gflags

On the split part, in python it would be:
values = data.split('|')

Good point. I've updated the doc. Thanks!

Thanks

I'm really having a hard time understanding how to easily and quickly interpolate variables into strings, especially for repetitive stuff like debug messages.

In perl, I surely spend half my day typing

print "at this point in the program, here is foo:$foo bar:$bar" if $debug

In python, there are several ways to do this but it appears that they either require more typing, especially closing the quote on my string and using +, or referring to each substituted string in two places, like

print " foo:%s bar:%s" % str(foo),str(bar)

which is terrible because if I want to add another variable baz, I have to modify two places in the line.

I need this to be really easy because my generate and test cycle frequently includes such print statements.

Do python programmers just do the entire generate and test cycle in some different way?

Thanks for writing this. You wrote "Imagine how confusing it would be to a person learning their first programming language if their book started out comparing one language you didn't know to a different language you didn't know!" I've been learning about Scala lately, and I know the basics of Java, but most tutorials assume that you know Java really well, which is annoying.

Credits