Up: Python

Input and Output

slide 001 Hello, and welcome to the fifth episode of the Software Carpentry lecture on Python. In this episode, we’ll have a look at input and output.
slide 002 In the previous episodes, we’ve used print to see what our programs are doing. This is fine for tutorials…
slide 003 …but in real programs, we’ll want to save data to files.
slide 004 And read data from them.
slide 005 Python’s tools for input and output are pretty simple, and owe a lot to those invented for C. Basically:
slide 006 A file is a actually a sequence of bytes.
slide 007 But it’s often more useful to treat it as a sequence of lines.
slide 008 For our examples, we’ll work with a file containing several haikus taken from an online competition run by Salon magazine in 1998.
slide 009 Let’s start by asking, “How many characters are in the file?”
slide 010 What we’ll actually find out is how many bytes are in the file.
slide 011 We’ll assume right now that each character is stored in one byte.
slide 012 But we’ll revisit the subject later.
slide 013 Here’s our program.
slide 014 It starts by opening the file using the built-in function open. This creates a file object that keeps track of the program’s connection to the file.
slide 015 The first argument to open is the name of the file the program wants to work with.
slide 016 The second is the letter 'r', which signals that it wants to read from the file.
slide 017 If all goes well, the program assigns the result of open to the variable reader, which will be the program’s connection to the file.
slide 018 The program can now call the method read to read in the entire content of the file…
slide 019 …and assign it to the variable data.
slide 020 Since the program is done with the file, it calls close.
slide 021 This isn’t strictly necessary in small programs, but it’s a good habit to get into, since most operating systems limit the number of files a program can have open at one time.
slide 022 Finally, we use len to find out how many characters are in the variable data, and print that out.
slide 023 Again, what we’re really reporting is the number of bytes, not the number of characters, but we’ll assume for now that there’s a one-to-one match.
slide 024 When we run our program, it tells us the file is 293 characters long.
slide 025 Our first program read the entire file into memory, then calculated its length. If the file might be very large—say, a few terabytes—it would be better to read it in chunks, so that we don’t overflow memory.
slide 026 Here’s a program that does that. After opening the file as before…
slide 027 …we pass the value 64 to the read method to indicate that we only want the next 64 bytes of data.
slide 028 This method will return an empty string if there is nothing left to read.
slide 029 The program then goes into a loop. As long as its last attempt to read from the file returned some data…
slide 030 …it prints out how much data it read…
slide 031 …then tries to read some more data.
slide 032 As a check, we print the length of data after the loop is over. This should be zero, since the program should stay in the loop as long as it’s actually getting data from the file.
slide 033 Sure enough, the output is four full blocks of 64 characters, one partial block of 37, and then 0 at the end of the file.
slide 034 You might think from this example that reading in blocks is the right way to do it, but the extra complexity is really only warranted if…
slide 035 …the file really might be very large (or infinitely long, like a stream of readings from a lab instrument). Remember, premature optimization is the root of much evil.
slide 036 read is the most fundamental way to get data from a file, but it’s more common to read data a line at a time.
slide 037 To show how this works, here’s a program that calculates the average length of the lines in a file.
slide 038 After opening the file, it uses readline to read the next line of text from the file.
slide 039 As with read, this will return an empty string when there’s nothing left in the file, so our program loops until it gets an empty string.
slide 040 And inside the loop, it uses another readline call to try to get that next line.
slide 041 When we run this program on our haiku file, it tells us that the average line is a little over nineteen and a half characters long.
slide 042 Memory permitting, it’s often more convenient to read all the lines in the file at once.
slide 043 Here’s a program that uses readlines (with an ‘s’ on the end) to do that.
slide 044 It returns a list of strings, which the program assigns to the variable contents.
slide 045 The program then loops over that list with for, instead of using a while to read from the file until it’s exhausted.
slide 046 Again, the output is a little over nineteen and a half characters per line.
slide 047 Reading the file’s contents as a list of lines, then looping over that list, is a very common idiom.
slide 048 So Python allows programs to just loop over a file line by line.
slide 049 Here’s the average line length program done that way.
slide 050 The for loop assigns the lines in the file to the variable line one after another, halting automatically when the file is exhausted.
slide 051 And yes, it’s output is 19.53333…
slide 052 Of course, we have to put data in files somehow.
slide 053 In Python, we can do this with two other file methods called write and writelines.
slide 054 Here’s a program that writes to temp.txt using these two methods.
slide 055 As before, we open the file with open.
slide 056 The first argument is the file we want to write to. Its previous content will be overwritten if it already exists…
slide 057 …and it will be created if it doesn’t.
slide 058 The difference between this call and the ones we’ve seen before is that the second argument is the string 'w' instead of the string 'r', which signals that we want to write to the file.
slide 059 The program then uses the file object’s write method to write a string to the file.
slide 060 Alternatively, it can use writelines to write each string in a list.
slide 061 But if we run the program, then look in temp.txt, the output is all crammed together.
slide 062 The reason is that Python only writes what we tell it to, and we didn’t tell it to write any end-of-line characters.
slide 063 We have to modify the program to add a newline '\n' at the end of each line.
slide 064 When we run this program, we get the output we want.
slide 065 Rather than putting newline characters everywhere, many programmers find it simpler to write to files using print with a redirect, which is written as a double greater-than sign.
slide 066 Here’s our program rewritten to use this idiom.
slide 067 The file we want to print to appears on the right of the >>; other than that, it looks like any other print statement.
slide 068 And just like a regular print, this automatically adds a newline after everything.
slide 069 So let’s use what we’ve learned so far to copy one file’s contents to another.
slide 070 Here’s the first version:
slide 071 The first three lines read in everything from the source file…
slide 072 …while the second three write it all out to the destination file.
slide 073 As before, this probably won’t work with a terabyte of data…
slide 074 …but in almost all cases, that doesn’t matter.
slide 075 Here’s another version that will work for a terabyte, provided it’s a terabyte of text.
slide 076 After opening both files, we read the input a line at a time, writing data out as we go, then close both files.
slide 077 This does assume the file is text, though.
slide 078 Or at least that the end-of-line character appears fairly frequently. If it doesn’t, readline may be asked to read an enormous block of data into memory, which gets us back to our “can’t read a terabyte” problem.
slide 079 This version looks similar to the previous one, but doesn’t make an exact copy of the original file.
slide 080 Instead of using write, it prints to the file.
slide 081 The problem is that Python keeps the newline character at the end of the input line when it’s reading.
slide 082 And print automatically adds a newline at the end of what it outputs.
slide 083 So this program actually produces a double-spaced copy of the file.
slide 084 Here’s a better alternative.
slide 085 It defines BLOCKSIZE to be 1024. (Python doesn’t actually let us define constants, so by convention, any variable that we want to treat as a constant definition is spelled in all capitals.) We then use the “read in a loop” idiom we saw earlier to read up to 1024 bytes at a time, writing them out as we go.
slide 086 This file copying program is efficient, and will handle files of any size, but it’s harder to understand than our first two: needlessly so, unless we expect very large input files.
slide 087

  1. No comments yet.
  1. No trackbacks yet.