Commit a5b5ea88 authored by Michiel Cottaar's avatar Michiel Cottaar
Browse files

Bug fixes from running notebooks

parent 72845f6d
%% Cell type:markdown id: tags:
# Basic python
This tutorial is aimed at briefly introducing you to the main language
features of python, with emphasis on some of the common difficulties
and pitfalls that are commonly encountered when moving to python.
When going through this make sure that you _run_ each code block and
look at the output, as these are crucial for understanding the
explanations. You can run each block by using _shift + enter_
(including the text blocks, so you can just move down the document
with shift + enter).
It is also possible to _change_ the contents of each code block (these pages
are completely interactive) so do experiment with the code you see and try
some variations!
> **Important**: We are exclusively using Python 3 in FSL - as of FSL 6.0.4 we
> are using Python 3.7. There are some subtle differences between Python 2 and
> Python 3, but instead of learning about these differences, it is easier to
> simply forget that Python 2 exists. When you are googling for Python help,
> make sure that the pages you find are relevant to Python 3 and *not* Python
> 2! The official Python docs can be found at https://docs.python.org/3/ (note
> the _/3/_ at the end!).
## Contents
* [Basic types](#Basic-types)
- [Strings](#Strings)
+ [Format](#Format)
+ [String manipulation](#String-manipulation)
- [Tuples and lists](#Tuples-and-lists)
+ [Adding to a list](#Adding-to-a-list)
+ [Indexing](#Indexing)
+ [Slicing](#Slicing)
- [List operations](#List-operations)
+ [Looping over elements in a list (or tuple)](#Looping)
+ [Getting help](#Getting-help)
- [Dictionaries](#Dictionaries)
+ [Adding to a dictionary](#Adding-to-a-dictionary)
+ [Removing elements from a dictionary](#Removing-elements-dictionary)
+ [Looping over everything in a dictionary](#Looping-dictionary)
- [Copying and references](#Copying-and-references)
* [Control flow](#Control-flow)
- [Boolean operators](#Boolean-operators)
- [If statements](#If-statements)
- [For loops](#For-loops)
- [While loops](#While-loops)
- [A quick intro to conditional expressions and list comprehensions](#quick-intro)
+ [Conditional expressions](#Conditional-expressions)
+ [List comprehensions](#List-comprehensions)
* [Functions](#functions)
* [Exercise](#exercise)
---
<a class="anchor" id="Basic-types"></a>
# Basic types
Python has many different types and variables are dynamic and can change types (like MATLAB). Some of the most commonly used in-built types are:
* integer and floating point scalars
* strings
* tuples
* lists
* dictionaries
N-dimensional arrays and other types are supported through common modules (e.g., [numpy](https://numpy.org/), [scipy](https://docs.scipy.org/doc/scipy-1.4.1/reference/), [scikit-learn](https://scikit-learn.org/stable/)). These will be covered in a subsequent exercises.
%% Cell type:code id: tags:
```
a = 4
b = 3.6
c = 'abc'
d = [10, 20, 30]
e = {'a' : 10, 'b': 20}
print(a)
```
%% Cell type:markdown id: tags:
Any variable or combination of variables can be printed using the function `print()`:
%% Cell type:code id: tags:
```
print(d)
print(e)
print(a, b, c)
```
%% Cell type:markdown id: tags:
> _*Python 3 versus python 2*_:
>
> Print - for the print statement the brackets are compulsory for *python 3*, but are optional in python 2. So you will see plenty of code without the brackets but you should never use `print` without brackets, as this is incompatible with Python 3.
>
> Division - in python 3 all division is floating point (like in MATLAB), even if the values are integers, but in python 2 integer division works like it does in C.
---
<a class="anchor" id="Strings"></a>
## Strings
Strings can be specified using single quotes *or* double quotes - as long as they are matched.
Strings can be indexed like lists (see later).
For example:
%% Cell type:code id: tags:
```
s1 = "test string"
s2 = 'another test string'
print(s1, ' :: ', s2)
```
%% Cell type:markdown id: tags:
You can also use triple quotes to capture multi-line strings. For example:
%% Cell type:code id: tags:
```
s3 = '''This is
a string over
multiple lines
'''
print(s3)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="Format"></a>
### Format
More interesting strings can be created using an [f-string](https://realpython.com/python-f-strings/),
which is very useful in print statements:
%% Cell type:code id: tags:
```
x = 1
y = 'PyTreat'
s = f'The numerical value is {x} and a name is {y}'
print(s)
print(f'A name is {y} and a number is {x}')
```
%% Cell type:markdown id: tags:
Note the `f` before the initial quote. This lets python know to fill in the variables between the curly brackets.
There are also other options along these lines, which will be discussed in the next practical.
This is the more modern version, although you will see plenty of the other alternatives in "old" code
(to python coders this means anything written before last week).
<a class="anchor" id="String-manipulation"></a>
### String manipulation
The methods `lower()` and `upper()` are useful for strings. For example:
%% Cell type:code id: tags:
```
s = 'This is a Test String'
print(s.upper())
print(s.lower())
```
%% Cell type:markdown id: tags:
Another useful method is:
%% Cell type:code id: tags:
```
s = 'This is a Test String'
s2 = s.replace('Test', 'Better')
print(s2)
```
%% Cell type:markdown id: tags:
Strings can be concatenated just by using the `+` operator:
%% Cell type:code id: tags:
```
s3 = s + ' :: ' + s2
print(s3)
```
%% Cell type:markdown id: tags:
If you like regular expressions then you're in luck as these are well supported in python using the `re` module. To use this (like many other "extensions" - called _modules_ in Python - you need to `import` it). For example:
%% Cell type:code id: tags:
```
import re
s = 'This is a test of a Test String'
s1 = re.sub(r'a [Tt]est', "an example", s)
print(s1)
```
%% Cell type:markdown id: tags:
where the `r` before the quote is used to force the regular expression specification to be a `raw string` (see [here](https://docs.python.org/3.5/library/re.html) for more info).
For more information on matching and substitutions, look up the regular expression module on the web.
Two common and convenient string methods are `strip()` and `split()`. The
first will remove any whitespace at the beginning and end of a string:
%% Cell type:code id: tags:
```
s2 = ' A very spacy string '
print('*' + s2 + '*')
print('*' + s2.strip() + '*')
```
%% Cell type:markdown id: tags:
With `split()` we can tokenize a string (to turn it into a list of strings) like this:
%% Cell type:code id: tags:
```
print(s.split())
print(s2.split())
```
%% Cell type:markdown id: tags:
By default it splits at whitespace, but it can also split at a specified delimiter:
%% Cell type:code id: tags:
```
s4 = ' This is, as you can see , a very weirdly spaced and punctuated string ... '
print(s4.split(','))
```
%% Cell type:markdown id: tags:
A neat trick, if you want to change the delimiter in some structured data (e.g.
replace `,` with `\t`), is to use `split()` in combination with another string
method, `join()`:
%% Cell type:code id: tags:
```
csvdata = 'some,comma,separated,data'
tsvdata = '\t'.join(csvdata.split(','))
tsvdata = tsvdata.replace('comma', 'tab'))
tsvdata = tsvdata.replace('comma', 'tab')
print('csvdata:', csvdata)
print('tsvdata:', tsvdata)
```
%% Cell type:markdown id: tags:
There are more powerful ways of dealing with this like csv files/strings,
which are covered in later practicals, but even this can get you a long way.
> Note that strings in python 3 are _unicode_ so can represent Chinese
> characters, etc, and is therefore very flexible. However, in general you
> can just be blissfully ignorant of this fact.
Strings can be converted to integer or floating-point values by using the
`int()` and `float()` calls:
%% Cell type:code id: tags:
```
sint='23'
sfp='2.03'
print(sint + sfp)
print(int(sint) + float(sfp))
print(float(sint) + float(sfp))
```
%% Cell type:markdown id: tags:
> Note that calling `int()` on a non-integer (e.g., on `sfp` above) will raise an error.
---
<a class="anchor" id="Tuples-and-lists"></a>
## Tuples and lists
Both tuples and lists are builtin python types and are like vectors,
but for numerical vectors and arrays it is much better to use `numpy`
arrays (or matrices), which are covered in a later tutorial.
A tuple is like a list or a vector, but with less flexibility than a full list (tuples are immutable), however anything can be stored in either a list or tuple, without any consistency being required. Tuples are defined using round brackets and lists are defined using square brackets. For example:
%% Cell type:code id: tags:
```
xtuple = (3, 7.6, 'str')
xlist = [1, 'mj', -5.4]
print(xtuple)
print(xlist)
```
%% Cell type:markdown id: tags:
They can also be nested:
%% Cell type:code id: tags:
```
x2 = (xtuple, xlist)
x3 = [xtuple, xlist]
print('x2 is: ', x2)
print('x3 is: ', x3)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="Adding-to-a-list"></a>
### Adding to a list
This is easy:
%% Cell type:code id: tags:
```
a = [10, 20, 30]
a = a + [70]
a += [80]
print(a)
```
%% Cell type:markdown id: tags:
> Similar things can be done for tuples, except for the last one: that is,
> `a += (80)` because a tuple is immutable so cannot be changed like this.
<a class="anchor" id="Indexing"></a>
### Indexing
Square brackets are used to index tuples, lists, strings, dictionaries, etc. For example:
%% Cell type:code id: tags:
```
d = [10, 20, 30]
print(d[1])
```
%% Cell type:markdown id: tags:
> _*Pitfall:*_
> Python uses zero-based indexing, unlike MATLAB
%% Cell type:code id: tags:
```
a = [10, 20, 30, 40, 50, 60]
print(a[0])
print(a[2])
```
%% Cell type:markdown id: tags:
Indices naturally run from 0 to N-1, _but_ negative numbers can be used to
reference from the end (circular wrap-around).
%% Cell type:code id: tags:
```
print(a[-1])
print(a[-6])
```
%% Cell type:markdown id: tags:
However, this is only true for -1 to -N. Outside of -N to N-1 will generate an `index out of range` error.
%% Cell type:code id: tags:
```
print(a[-7])
```
%% Cell type:code id: tags:
```
print(a[6])
```
%% Cell type:markdown id: tags:
Length of a tuple or list is given by the `len()` function:
%% Cell type:code id: tags:
```
print(len(a))
```
%% Cell type:markdown id: tags:
Nested lists can have nested indexing:
%% Cell type:code id: tags:
```
b = [[10, 20, 30], [40, 50, 60]]
print(b[0][1])
print(b[1][0])
```
%% Cell type:markdown id: tags:
but *not* an index like `b[0, 1]`. However, numpy arrays (covered in a later practical) can be indexed like `b[0, 1]` and similarly for higher dimensions.
> Note that `len` will only give the length of the top level.
> In general, numpy arrays should be preferred to nested lists when the contents are numerical.
<a class="anchor" id="Slicing"></a>
### Slicing
A range of values for the indices can be specified to extract values from a list. For example:
%% Cell type:code id: tags:
```
print(a[0:3])
```
%% Cell type:markdown id: tags:
> _*Pitfall:*_
>
> Slicing syntax is different from MATLAB in that second number is
> exclusive (i.e., one plus final index) - this is in addition to the zero index difference.
%% Cell type:code id: tags:
```
a = [10, 20, 30, 40, 50, 60]
print(a[0:3]) # same as a(1:3) in MATLAB
print(a[1:3]) # same as a(2:3) in MATLAB
```
%% Cell type:markdown id: tags:
> _*Pitfall:*_
>
> Unlike in MATLAB, you cannot use a list as indices instead of an
> integer or a slice (although this can be done in `numpy`).
%% Cell type:code id: tags:
```
b = [3, 4]
print(a[b])
```
%% Cell type:markdown id: tags:
In python you can leave the start and end values implicit, as it will assume these are the beginning and the end. For example:
%% Cell type:code id: tags:
```
print(a[:3])
print(a[1:])
print(a[:-1])
```
%% Cell type:markdown id: tags:
in the last example remember that negative indices are subject to wrap around so that `a[:-1]` represents all elements up to the penultimate one.
You can also change the step size, which is specified by the third value (not the second one, as in MATLAB). For example:
%% Cell type:code id: tags:
```
print(a[0:4:2])
print(a[::2])
print(a[::-1])
```
%% Cell type:markdown id: tags:
the last example is a simple way to reverse a sequence.
<a class="anchor" id="List-operations"></a>
### List operations
Multiplication can be used with lists, where multiplication implements replication.
%% Cell type:code id: tags:
```
d = [10, 20, 30]
print(d * 4)
```
%% Cell type:markdown id: tags:
There are also other operations such as:
%% Cell type:code id: tags:
```
d.append(40)
print(d)
d.extend([50, 60])
print(d)
d = d + [70, 80]
print(d)
d.remove(20)
print(d)
d.pop(0)
print(d)
```
%% Cell type:markdown id: tags:
> Note that `d.append([50,60])` would run but instead of adding two extra elements it only adds a single element, where this element is a list of length two, making a messy list. Try it and see if this is not clear.
<a class="anchor" id="Looping"></a>
### Looping over elements in a list (or tuple)
%% Cell type:code id: tags:
```
d = [10, 20, 30]
for x in d:
print(x)
```
%% Cell type:markdown id: tags:
> Note that the indentation within the loop is _*crucial*_. All python control blocks are delineated purely by indentation. We recommend using **four spaces** and no tabs, as this is a standard practice and will help a lot when collaborating with others.
<a class="anchor" id="Getting-help"></a>
### Getting help
The function `help()` can be used to get information about any variable/object/function in python. It lists the possible operations. In `ipython` you can also just type `?<blah>` or `<blah>?` instead:
%% Cell type:code id: tags:
```
help(d)
```
%% Cell type:markdown id: tags:
There is also a `dir()` function that gives a basic listing of the operations:
%% Cell type:code id: tags:
```
dir(d)
```
%% Cell type:markdown id: tags:
> Note that google is often more helpful! At least, as long as you find pages
> relating to Python 3 - Python 2 is no longer supported, but there is still
> lots of information about it on the internet, so be careful!
---
<a class="anchor" id="Dictionaries"></a>
## Dictionaries
These store key-value pairs. For example:
%% Cell type:code id: tags:
```
e = {'a' : 10, 'b': 20}
print(len(e))
print(e.keys())
print(e.values())
print(e['a'])
```
%% Cell type:markdown id: tags:
The keys and values can take on almost any type, even dictionaries!
Python is nothing if not flexible. However, each key must be unique
and [hashable](https://docs.python.org/3.5/glossary.html#term-hashable).
<a class="anchor" id="Adding-to-a-dictionary"></a>
### Adding to a dictionary
This is very easy:
%% Cell type:code id: tags:
```
e['c'] = 555 # just like in Biobank! ;)
print(e)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="Removing-elements-dictionary"></a>
### Removing elements from a dictionary
There are two main approaches - `pop` and `del`:
%% Cell type:code id: tags:
```
e.pop('b')
print(e)
del e['c']
print(e)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="Looping-dictionary"></a>
### Looping over everything in a dictionary
Several variables can jointly work as loop variables in python, which is very convenient. For example:
%% Cell type:code id: tags:
```
e = {'a' : 10, 'b': 20, 'c':555}
for k, v in e.items():
print((k, v))
```
%% Cell type:markdown id: tags:
The print statement here constructs a tuple, which is often used in python.
Another option is:
%% Cell type:code id: tags:
```
for k in e:
print((k, e[k]))
```
%% Cell type:markdown id: tags:
> In older versions of Python 3, there was no guarantee of ordering when using dictionaries.
> However, a of Python 3.7, dictionaries will remember the order in which items are inserted,
> and the `keys()`, `values()`, and `items()` methods will return elements in that order.
>
> If you want a dictionary with ordering, *and* you want your code to work with
> Python versions older than 3.7, you can use the
> [`OrderedDict`](https://docs.python.org/3/library/collections.html#collections.OrderedDict)
> class.
---
<a class="anchor" id="Copying-and-references"></a>
## Copying and references
In python there are immutable types (e.g. numbers) and mutable types (e.g. lists). The main thing to know is that assignment can sometimes create separate copies and sometimes create references (as in C++). In general, the more complicated types are assigned via references. For example:
%% Cell type:code id: tags:
```
a = 7
b = a
a = 2348
print(b)
```
%% Cell type:markdown id: tags:
As opposed to:
%% Cell type:code id: tags:
```
a = [7]
b = a
a[0] = 8888
print(b)
```
%% Cell type:markdown id: tags:
But if an operation is performed then a copy might be made:
%% Cell type:code id: tags:
```
a = [7]
b = a * 2
a[0] = 8888
print(b)
```
%% Cell type:markdown id: tags:
If an explicit copy is necessary then this can be made using the `list()` constructor:
%% Cell type:code id: tags:
```
a = [7]
b = list(a)
a[0] = 8888
print(b)
```
%% Cell type:markdown id: tags:
There is a constructor for each type and this can be useful for converting between types:
%% Cell type:code id: tags:
```
xt = (2, 5, 7)
xl = list(xt)
print(xt)
print(xl)
```
%% Cell type:markdown id: tags:
> _*Pitfall:*_
>
> When writing functions you need to be particularly careful about references and copies.
%% Cell type:code id: tags:
```
def foo1(x):
x.append(10)
print('x: ', x)
def foo2(x):
x = x + [10]
print('x: ', x)
def foo3(x):
print('return value: ', x + [10])
return x + [10]
a = [5]
print('a: ', a)
foo1(a)
print('a: ', a)
foo2(a)
print('a: ', a)
b = foo3(a)
print('a: ', a)
print('b: ', b)
```
%% Cell type:markdown id: tags:
> Note that we have defined some functions here - and the syntax
> should be relatively intuitive. See <a href="#functions">below</a>
> for a bit more detail on function definitions.
---
<a class="anchor" id="Control-flow"></a>
## Control flow
<a class="anchor" id="Boolean-operators"></a>
### Boolean operators
There is a boolean type in python that can be `True` or `False` (note the
capitals). Other values can also be used for True or False (e.g., `1` for
`True`; `0` or `None` or `[]` or `{}` or `""` for `False`) although they are
not considered 'equal' in the sense that the operator `==` would consider them
the same.
Relevant boolean and comparison operators include: `not`, `and`, `or`, `==` and `!=`.
For example:
%% Cell type:code id: tags:
```
a = True
print('Not a is:', not a)
print('Not 1 is:', not 1)
print('Not 0 is:', not 0)
print('Not {} is:', not {})
print('{}==0 is:', {}==0)
```
%% Cell type:markdown id: tags:
There is also the `in` test for strings, lists, etc:
%% Cell type:code id: tags:
```
print('the' in 'a number of words')
print('of' in 'a number of words')
print(3 in [1, 2, 3, 4])
```
%% Cell type:markdown id: tags:
A useful keyword is `None`, which is a bit like "null".
This can be a default value for a variable and should be tested with the `is` operator rather than `==` (for technical reasons that it isn't worth going into here). For example: `a is None` or `a is not None` are the preferred tests.
Do not use the `is` instead of the `==` operator for any other comparisons (unless you know what you are doing).
<a class="anchor" id="If-statements"></a>
### If statements
The basic syntax of `if` statements is fairly standard, though don't forget that you _*must*_ indent the statements within the conditional/loop block as this is the way of delineating blocks of code in python. For example:
%% Cell type:code id: tags:
```
import random
a = random.uniform(-1, 1)
print(a)
if a>0:
print('Positive')
elif a<0:
print('Negative')
else:
print('Zero')
```
%% Cell type:markdown id: tags:
Or more generally:
%% Cell type:code id: tags:
```
a = [] # just one of many examples
if not a:
print('Variable is true, or at least not empty')
```
%% Cell type:markdown id: tags:
This can be useful for functions where a variety of possible input types are being dealt with.
---
<a class="anchor" id="For-loops"></a>
### For loops
The `for` loop works like in bash:
%% Cell type:code id: tags:
```
for x in [2, 'is', 'more', 'than', 1]:
print(x)
```
%% Cell type:markdown id: tags:
where a list or any other sequence (e.g. tuple) can be used.
If you want a numerical range then use:
%% Cell type:code id: tags:
```
for x in range(2, 9):
print(x)
print(x)
```
%% Cell type:markdown id: tags:
Note that, like slicing, the maximum value is one less than the value specified. Also, `range` actually returns an object that can be iterated over but is not just a list of numbers. If you want a list of numbers then `list(range(2, 9))` will give you this.
A very nice feature of python is that multiple variables can be assigned from a tuple or list:
%% Cell type:code id: tags:
```
x, y = [4, 7]
print(x)
print(y)
```
%% Cell type:markdown id: tags:
and this can be combined with a function called `zip` to make very convenient dual variable loops:
%% Cell type:code id: tags:
```
alist = ['Some', 'set', 'of', 'items']
blist = list(range(len(alist)))
print(list(zip(alist, blist)))
for x, y in zip(alist, blist):
print(y, x)
```
%% Cell type:markdown id: tags:
This type of loop can be used with any two lists (or similar) to iterate over them jointly.
<a class="anchor" id="While-loops"></a>
### While loops
The syntax for this is pretty standard:
%% Cell type:code id: tags:
```
import random
n = 0
x = 0
while n<100:
x += random.uniform(0, 1)**2 # where ** is a power operation
if x>50:
break
n += 1
print(x)
```
%% Cell type:markdown id: tags:
You can also use `continue` as in other languages.
> Note that there is no `do ... while` construct.
---
<a class="anchor" id="quick-intro"></a>
### A quick intro to conditional expressions and list comprehensions
These are more advanced bits of python but are really useful and common, so worth having a little familiarity with at this stage.
<a class="anchor" id="Conditional-expressions"></a>
#### Conditional expressions
A general expression that can be used in python is: A `if` condition `else` B
For example:
%% Cell type:code id: tags:
```
import random
x = random.uniform(0, 1)
y = x**2 if x<0.5 else (1 - x)**2
print(x, y)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="List-comprehensions"></a>
#### List comprehensions
This is a shorthand syntax for building a list like a for loop but doing it in one line, and is very popular in python. It is quite similar to mathematical set notation. For example:
%% Cell type:code id: tags:
```
v1 = [ x**2 for x in range(10) ]
print(v1)
v2 = [ x**2 for x in range(10) if x!=7 ]
print(v2)
```
%% Cell type:markdown id: tags:
You'll find that python programmers use this kind of construction _*a lot*_.
---
<a class="anchor" id="functions"></a>
## Functions
You will find functions pretty familiar in python to start with,
although they have a few options which are really handy and different
from C++ or matlab (to be covered in a later practical). To start
with we'll look at a simple function but note a few key points:
* you _must_ indent everything inside the function (it is a code
block and indentation is the only way of determining this - just
like for the guts of a loop)
* you can return _whatever you want_ from a python function, but only
a single object - it is usual to package up multiple things in a
tuple or list, which is easily unpacked by the calling invocation:
e.g., `a, b, c = myfunc(x)`
* parameters are passed by reference (see section on <a
href="#Copying-and-references">copying and references</a>)
%% Cell type:code id: tags:
```
def myfunc(x, y, z=0):
r2 = x*x + y*y + z*z
r = r2**0.5
return r, r2
rad = myfunc(10, 20)
print(rad)
rad, dummy = myfunc(10, 20, 30)
print(rad)
rad, _ = myfunc(10,20,30)
print(rad)
```
%% Cell type:markdown id: tags:
> Note that the `_` is used as shorthand here for a dummy variable
> that you want to throw away.
>
> The return statement implicitly creates a tuple to return and is equivalent to `return (r, r2)`
One nice feature of python functions is that you can name the
arguments when you call them, rather than only doing it by position.
For example:
%% Cell type:code id: tags:
```
def myfunc(x, y, z=0, flag=''):
if flag=='L1':
r = abs(x) + abs(y) + abs(z)
else:
r = (x*x + y*y + z*z)**0.5
return r
rA = myfunc(10, 20)
rB = myfunc(10, 20, flag='L1')
rC = myfunc(10, 20, flag='L1', z=30)
print(rA, rB, rC)
```
%% Cell type:markdown id: tags:
You will often see python functions called with these named arguments. In fact, for functions with more than 2 or 3 variables this naming of arguments is recommended, because it clarifies what each of the arguments does for anyone reading the code.
---
<a class="anchor" id="exercise"></a>
## Exercise
Let's say you are given a single string with comma separated elements
that represent filenames and ID codes: e.g., `/vols/Data/pytreat/AAC, 165873, /vols/Data/pytreat/AAG, 170285, ...`
Write some code to do the following:
* separate out the filenames and ID codes into separate lists (ID's
should be numerical values, not strings) - you may need several steps for this
* loop over the two and generate a _string_ that could be used to
rename the directories (e.g., `mv /vols/Data/pytreat/AAC /vols/Data/pytreat/S165873`) - we will cover how to actually execute these in a later practical
* convert your dual lists into a dictionary, with ID as the key
* write a small function to determine if an ID is present in this
set of not, and also return the filename if it is
* write a for loop to create a list of all the odd-numbered IDs (you can use the `%` operator for modulus - i.e., `5 % 2` is 1)
* re-write the for loop as a list comprehension
* now generate a list of the filenames corresponding to these odd-numbered IDs
%% Cell type:code id: tags:
```
mstr = '/vols/Data/pytreat/AAC, 165873, /vols/Data/pytreat/AAG, 170285, /vols/Data/pytreat/AAH, 196792, /vols/Data/pytreat/AAK, 212577, /vols/Data/pytreat/AAQ, 385376, /vols/Data/pytreat/AB, 444600, /vols/Data/pytreat/AC6, 454578, /vols/Data/pytreat/V8, 501502, /vols/Data/pytreat/2YK, 667688, /vols/Data/pytreat/C3PO, 821971'
```
......
......@@ -192,7 +192,7 @@ method, `join()`:
```
csvdata = 'some,comma,separated,data'
tsvdata = '\t'.join(csvdata.split(','))
tsvdata = tsvdata.replace('comma', 'tab'))
tsvdata = tsvdata.replace('comma', 'tab')
print('csvdata:', csvdata)
print('tsvdata:', tsvdata)
```
......@@ -630,7 +630,7 @@ where a list or any other sequence (e.g. tuple) can be used.
If you want a numerical range then use:
```
for x in range(2, 9):
print(x)
print(x)
```
Note that, like slicing, the maximum value is one less than the value specified. Also, `range` actually returns an object that can be iterated over but is not just a list of numbers. If you want a list of numbers then `list(range(2, 9))` will give you this.
......
%% Cell type:markdown id: tags:
# Text input/output
In this section we will explore how to write and/or retrieve our data from
text files.
Most of the functionality for reading/writing files and manipulating strings
is available without any imports. However, you can find some additional
functionality in the
[`string`](https://docs.python.org/3/library/string.html) module.
Most of the string functions are available as methods on string objects. This
means that you can use the ipython autocomplete to check for them.
%% Cell type:code id: tags:
```
empty_string = ''
```
%% Cell type:code id: tags:
```
# after running the code block above,
# put your cursor after the dot and
# press tab to get a list of methods
empty_string.
```
%% Cell type:markdown id: tags:
* [Reading/writing files](#reading-writing-files)
* [Creating new strings](#creating-new-strings)
* [String syntax](#string-syntax)
* [Unicode versus bytes](#unicode-versus-bytes)
* [Converting objects into strings](#converting-objects-into-strings)
* [Combining strings](#combining-strings)
* [String formattings](#string-formatting)
* [Extracting information from strings](#extracting-information-from-strings)
* [Splitting strings](#splitting-strings)
* [Converting strings to numbers](#converting-strings-to-numbers)
* [Regular expressions](#regular-expressions)
* [Exercises](#exercises)
<a class="anchor" id="reading-writing-files"></a>
## Reading/writing files
The syntax to open a file in python is `with open(<filename>, <mode>) as
<file_object>: <block of code>`, where
* `filename` is a string with the name of the file
* `mode` is one of `'r'` (for read-only access), `'w'` (for writing a file,
this wipes out any existing content), `'a'` (for appending to an existing
file). A `'b'` can be added to any of these to open the file in "byte"-mode,
which prevents python from interpreting non-text (e.g., NIFTI) files as text.
* `file_object` is a variable name which will be used within the `block of
code` to access the opened file.
For example the following will read all the text in `README.md` and print it.
%% Cell type:code id: tags:
```
with open('README.md', 'r') as readme_file:
print(readme_file.read())
```
%% Cell type:markdown id: tags:
> The `with` statement is an advanced python feature, however you will
> probably only encounter it when opening files. In that context it merely
> ensures that the file will be properly closed as soon as the program leaves
> the `with` statement (even if an error is raised within the `with`
> statement).
You could also use the `readlines()` method to get a list of all the lines, or
simply "loop over" the file object to get the lines one by one:
%% Cell type:code id: tags:
```
with open('README.md', 'r') as readme_file:
print('First five lines...')
for i, line in enumerate(readme_file):
# each line is returned with its
# newline character still intact,
# so we use rstrip() to remove it.
print(f'{i}: {line.rstrip()}'))
print(f'{i}: {line.rstrip()}')
if i == 4:
break
```
%% Cell type:markdown id: tags:
> enumerate takes any sequence and returns 2-element tuples with the index and the sequence item
A very similar syntax is used to write files:
%% Cell type:code id: tags:
```
with open('02_text_io/my_file', 'w') as my_file:
my_file.write('This is my first line\n')
my_file.writelines(['Second line\n', 'and the third\n'])
```
%% Cell type:markdown id: tags:
Note that new line characters do not get added automatically. We can investigate
the resulting file using
%% Cell type:code id: tags:
```
!cat 02_text_io/my_file
```
%% Cell type:markdown id: tags:
> In Jupyter notebook, (and in `ipython`/`fslipython`), any lines starting
> with `!` will be interpreted as shell commands. It is great when playing
> around in a Jupyter notebook or in the `ipython` terminal, however it is an
> ipython-only feature and hence is not available when writing python
> scripts. How to call shell commands from python will be discussed in the
> `scripts` practical.
If we want to add to the existing file we can open it in the append mode:
%% Cell type:code id: tags:
```
with open('02_text_io/my_file', 'a') as my_file:
my_file.write('More lines is always better\n')
!cat 02_text_io/my_file
```
%% Cell type:markdown id: tags:
Below we will discuss how we can convert python objects to strings to store in
these files and how to extract those python objects from strings again.
<a class="anchor" id="creating-new-strings"></a>
## Creating new strings
<a class="anchor" id="string-syntax"></a>
### String syntax
Single-line strings can be created in python using either single or double
quotes:
%% Cell type:code id: tags:
```
a_string = 'To be or not to be'
same_string = "To be or not to be"
print(a_string == same_string)
```
%% Cell type:markdown id: tags:
The main rationale for choosing between single or double quotes, is whether
the string itself will contain any quotes. You can include a single quote in a
string surrounded by single quotes by escaping it with the `\` character,
however in such a case it would be more convenient to use double quotes:
%% Cell type:code id: tags:
```
a_string = "That's the question"
same_string = 'That\'s the question'
print(a_string == same_string)
```
%% Cell type:markdown id: tags:
New-lines (`\n`), tabs (`\t`) and many other special characters are supported
%% Cell type:code id: tags:
```
a_string = "This is the first line.\nAnd here is the second.\n\tThe third starts with a tab."
print(a_string)
```
%% Cell type:markdown id: tags:
You can even include unicode characters:
%% Cell type:code id: tags:
```
a_string = "Python = 🐍"
print(a_string)
```
%% Cell type:markdown id: tags:
However, the easiest way to create multi-line strings is to use a triple quote (again single or double quotes can be used). Triple quotes allow your string to span multiple lines:
%% Cell type:code id: tags:
```
multi_line_string = """This is the first line.
And here is the second.
\tThird line starts with a tab."""
print(multi_line_string)
```
%% Cell type:markdown id: tags:
If you don't want python to reintepret your `\n`, `\t`, etc. in your strings, you can prepend the quotes enclosing the string with an `r`. This will lead to python interpreting the following string as raw text.
%% Cell type:code id: tags:
```
single_line_string = r"This string is not multiline.\nEven though it contains the \n character"
print(single_line_string)
```
%% Cell type:markdown id: tags:
One pitfall when creating a list of strings is that python automatically concatenates string literals, which are only separated by white space:
%% Cell type:code id: tags:
```
my_list_of_strings = ['a', 'b', 'c' 'd', 'e']
print("The 'c' and 'd' got concatenated, because we forgot the comma:", my_list_of_strings)
```
%% Cell type:markdown id: tags:
> This will lead to a syntax warning in python 3.8 or greater
<a class="anchor" id="unicode-versus-bytes"></a>
#### unicode versus bytes
> **Note**: You can safely skip this section if you do not have any plans to
> work with binary files or non-English text in Python, and you do not want
> to know how to insert poop emojis into your code.
To encourage the spread of python around the world, python 3 switched to using
unicode as the default for strings and code (which is one of the main reasons
for the incompatibility between python 2 and 3). This means that each element
in a string is a unicode character (using [UTF-8
encoding](https://docs.python.org/3/howto/unicode.html)), which can consist of
one or more bytes. The advantage is that any unicode characters can now be
used in strings or in the code itself:
%% Cell type:code id: tags:
```
Δ = "café"
print(Δ)
```
%% Cell type:markdown id: tags:
In python 2 each element in a string was a single byte rather than a
potentially multi-byte character. You can convert back to interpreting your
sequence as a unicode string or a byte array using:
* `encode()` called on a string converts it into a bytes array (`bytes` object)
* `decode()` called on a `bytes` array converts it into a unicode string.
%% Cell type:code id: tags:
```
delta = "Δ"
print('The character', delta, 'consists of the following 2 bytes', delta.encode())
```
%% Cell type:markdown id: tags:
These byte arrays can be created directly by prepending the quotes enclosing
the string with a `b`, which tells python 3 to interpret the following as a
byte array:
%% Cell type:code id: tags:
```
a_byte_array = b'\xce\xa9'
print('The two bytes ', a_byte_array, ' become single unicode character (', a_byte_array.decode(), ') with UTF-8 encoding')
```
%% Cell type:markdown id: tags:
Especially in code dealing with strings (e.g., reading/writing of files) many
of the errors arising of running python 2 code in python 3 arise from the
mixing of unicode strings with byte arrays. Decoding and/or encoding some of
these objects can often fix these issues.
By default any file opened in python will be interpreted as unicode. If you
want to treat a file as raw bytes, you have to include a 'b' in the `mode`
when calling the `open()` function:
%% Cell type:code id: tags:
```
import os.path as op
with open(op.expandvars('${FSLDIR}/data/standard/MNI152_T1_1mm.nii.gz'), 'rb') as gzipped_nifti:
print('First few bytes of gzipped NIFTI file:', gzipped_nifti.read(10))
```
%% Cell type:markdown id: tags:
> We use the `expandvars()` function here to insert the FSLDIR environmental
> variable into our string. This function will be presented in the file
> management practical.
<a class="anchor" id="converting-objects-into-strings"></a>
### Converting objects into strings
There are two functions to convert python objects into strings, `repr()` and
`str()`. All other functions that rely on string-representations of python
objects will use one of these two (for example the `print()` function will
call `str()` on the object).
The goal of the `str()` function is to be readable, while the goal of `repr()`
is to be unambiguous. Compare
%% Cell type:code id: tags:
```
print(str("3"))
print(str(3))
```
%% Cell type:markdown id: tags:
with
%% Cell type:code id: tags:
```
print(repr("3"))
print(repr(3))
```
%% Cell type:markdown id: tags:
In both cases you get the value of the object (3), but only the `repr` returns enough information to actually know the type of the object.
Perhaps the difference is clearer with a more advanced object.
The `datetime` module contains various classes and functions to work with dates (there is also a `time` module).
Here we will look at the alternative string representations of the `datetime` object itself:
%% Cell type:code id: tags:
```
from datetime import datetime
print('str(): ', str(datetime.now()))
print('repr(): ', repr(datetime.now()))
```
%% Cell type:markdown id: tags:
Note that the result from `str()` is human-readable as a date, while the result from `repr()` is more useful if you wanted to recreate the `datetime` object.
<a class="anchor" id="combining-strings"></a>
### Combining strings
The simplest way to concatenate strings is to simply add them together:
%% Cell type:code id: tags:
```
a_string = "Part 1"
other_string = "Part 2"
full_string = a_string + ", " + other_string
print(full_string)
```
%% Cell type:markdown id: tags:
Given a whole sequence of strings, you can concatenate them together using the `join()` method:
%% Cell type:code id: tags:
```
list_of_strings = ['first', 'second', 'third', 'fourth']
full_string = ', '.join(list_of_strings)
print(full_string)
```
%% Cell type:markdown id: tags:
Note that the string on which the `join()` method is called (`', '` in this case) is used as a delimiter to separate the different strings. If you just want to concatenate the strings you can call `join()` on the empty string:
%% Cell type:code id: tags:
```
list_of_strings = ['first', 'second', 'third', 'fourth']
full_string = ''.join(list_of_strings)
print(full_string)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="string-formatting"></a>
### String formatting
Using the techniques in [Combining strings](#combining-strings) we can build simple strings. For longer strings it is often useful to first write a template strings with some placeholders, where variables are later inserted. Built into python are currently 4 different ways of doing this (with many packages providing similar capabilities):
* [formatted string literals](https://docs.python.org/3/reference/lexical_analysis.html#f-strings) (these are only available in python 3.6+)
* [new-style formatting](https://docs.python.org/3/library/string.html#format-string-syntax).
* printf-like [old-style formatting](https://docs.python.org/3/library/stdtypes.html#old-string-formatting)
* bash-like [template-strings](https://docs.python.org/3/library/string.html#template-strings)
Here we provide a single example using the first three methods, so you can recognize them in the future.
First the old print-f style. Note that this style is invoked by using the modulo (`%`) operator on the string. Every placeholder (starting with the `%`) is then replaced by one of the values provided.
%% Cell type:code id: tags:
```
a = 3
b = 1 / 3
print('%.3f = %i + %.3f' % (a + b, a, b))
print('%(total).3f = %(a)i + %(b).3f' % {'a': a, 'b': b, 'total': a + b})
```
%% Cell type:markdown id: tags:
Then the recommended new style formatting (You can find a nice tutorial [here](https://www.digitalocean.com/community/tutorials/how-to-use-string-formatters-in-python-3)). Note that this style is invoked by calling the `format()` method on the string and the placeholders are marked by the curly braces `{}`.
%% Cell type:code id: tags:
```
a = 3
b = 1 / 3
print('{:.3f} = {} + {:.3f}'.format(a + b, a, b))
print('{total:.3f} = {a} + {b:.3f}'.format(a=a, b=b, total=a+b))
```
%% Cell type:markdown id: tags:
Note that the variable `:` delimiter separates the variable identifiers on the left from the formatting rules on the right.
Finally the new, fancy formatted string literals (only available in python 3.6+).
This new format is very similar to the recommended style, except that all placeholders are automatically evaluated in the local environment at the time the template is defined.
This means that we do not have to explicitly provide the parameters (and we can evaluate the sum inside the string!), although it does mean we also can not re-use the template.
%% Cell type:code id: tags:
```
a = 3
b = 1/3
print(f'{a + b:.3f} = {a} + {b:.3f}')
```
%% Cell type:markdown id: tags:
These f-strings are extremely useful when creating print or error messages for debugging,
especially with the new support for self-documenting in python 3.8 (see
[here](https://docs.python.org/3/whatsnew/3.8.html#f-strings-support-for-self-documenting-expressions-and-debugging)):
%% Cell type:code id: tags:
```
a = 3
b = 1/3
print(f'{a + b=}')
```
%% Cell type:markdown id: tags:
Note that this prints both the expression `a + b` and the output (this block will raise an error for python <= 3.7).
<a class="anchor" id="extracting-information-from-strings"></a>
## Extracting information from strings
The techniques shown in this section are useful if you are loading data from a
small text file or user input, or parsing a small amount of output from
e.g. `fslstats`. However, if you are working with large structured text data
(e.g. a big `csv` file), you should use the I/O capabilities of `numpy` or
`pandas` instead of doing things manually - this is covered in separate
practcals.
<a class="anchor" id="splitting-strings"></a>
### Splitting strings
The simplest way to extract a sub-string is to use slicing (see previous practical for more details):
%% Cell type:code id: tags:
```
a_string = 'abcdefghijklmnopqrstuvwxyz'
print(a_string[10]) # create a string containing only the 11th character
print(a_string[20:]) # create a string containing the 21st character onward
print(a_string[::-1]) # creating the reverse string
```
%% Cell type:markdown id: tags:
If you are not sure, where to cut into a string, you can use the `find()` method to find the first occurrence of a sub-string or `findall()` to find all occurrences.
%% Cell type:code id: tags:
```
a_string = 'abcdefghijklmnopqrstuvwxyz'
index = a_string.find('fgh')
print(a_string[:index]) # extracts the sub-string up to the first occurence of 'fgh'
print('index for non-existent sub-string', a_string.find('cats')) # note that find returns -1 when it can not find the sub-string rather than raising an error.
```
%% Cell type:markdown id: tags:
You can automate this process of splitting a string at a sub-string using the `split()` method. By default it will split a string at the white space.
%% Cell type:code id: tags:
```
print('The split() method\trecognizes a wide variety\nof white space'.split())
```
%% Cell type:markdown id: tags:
To separate a comma separated list we will need to supply the delimiter to the `split()` method. We can then use the `strip()` method to remove any whitespace at the beginning or end of the string:
%% Cell type:code id: tags:
```
scientific_packages_string = "numpy, scipy, pandas, matplotlib, nibabel"
list_with_whitespace = scientific_packages_string.split(',')
print(list_with_whitespace)
list_without_whitespace = [individual_string.strip() for individual_string in list_with_whitespace]
print(list_without_whitespace)
```
%% Cell type:markdown id: tags:
> We use the syntax `[<expr> for <element> in <sequence>]` here which applies the `expr` to each `element` in the `sequence` and returns the resulting list. This is a list comprehension - a convenient form in python to create a new list from the old one.
<a class="anchor" id="converting-strings-to-numbers"></a>
### Converting strings to numbers
Once you have extracted a number from a string, you can convert it into an
actual integer or float by calling respectively `int()` or `float()` on
it. `float()` understands a wide variety of different ways to write numbers:
%% Cell type:code id: tags:
```
print(int("3"))
print(float("3"))
print(float("3.213"))
print(float("3.213e5"))
print(float("3.213E-25"))
```
%% Cell type:markdown id: tags:
<a class="anchor" id="regular-expressions"></a>
### Regular expressions
Regular expressions are used for looking for specific patterns in a longer string. This can be used to extract specific information from a well-formatted string or to modify a string. In python regular expressions are available in the [re](https://docs.python.org/3/library/re.html#re-syntax) module.
A full discussion of regular expression goes far beyond this practical. If you are interested, have a look [here](https://docs.python.org/3/howto/regex.html).
<a class="anchor" id="exercises"></a>
## Exercises
### Joining/splitting strings
The file 02_text_io/input.txt contains integers in a 2-column format (separated by spaces). Read in this file and write it back out in 2-rows separated by comma's.
%% Cell type:code id: tags:
```
input_filename = '02_text_io/input.txt'
out_filename = '02_text_io/output.txt'
output_filename = '02_text_io/output.txt'
with open(input_filename, 'r') as input_file:
...
with open(output_filename, 'w') as output_file:
...
```
%% Cell type:markdown id: tags:
### String formatting and regular expressions
Given a template for MRI files:
`s<subject_id>/<modality>_<res>mm.nii.gz`
where `<subject_id>` is a 6-digit subject-id, `<modality>` is one of T1w, T2w, or PD, and `<res>` is the resolution of the image (up to one digits behind the dot, e.g. 1.5)
Write a function that takes the subject_id (as an integer), the modality (as a string), and the resolution (as a float) and returns the complete filename (Hint: use one of the formatting techniques mentioned in [String formatting](#string-formatting)).
%% Cell type:code id: tags:
```
def get_filename(subject_id, modality, resolution):
...
```
%% Cell type:markdown id: tags:
For a more difficult exercise, write a function that extracts the subject id, modality, and resolution from a filename name (using a regular expression or by using `find` and `split` to access relevant parts of the string)
%% Cell type:code id: tags:
```
def get_parameters(filename):
...
return subject_id, modality, resolution
```
......
......@@ -74,7 +74,7 @@ with open('README.md', 'r') as readme_file:
# each line is returned with its
# newline character still intact,
# so we use rstrip() to remove it.
print(f'{i}: {line.rstrip()}'))
print(f'{i}: {line.rstrip()}')
if i == 4:
break
```
......@@ -418,7 +418,7 @@ The file 02_text_io/input.txt contains integers in a 2-column format (separated
```
input_filename = '02_text_io/input.txt'
out_filename = '02_text_io/output.txt'
output_filename = '02_text_io/output.txt'
with open(input_filename, 'r') as input_file:
...
......
%% Cell type:markdown id: tags:
# File management
In this section we will introduce you to file management - how do we find and
manage files, directories and paths in Python?
Most of Python's built-in functionality for managing files and paths is spread
across the following modules:
- [`os`](https://docs.python.org/3/library/os.html)
- [`shutil`](https://docs.python.org/3/library/shutil.html)
- [`os.path`](https://docs.python.org/3/library/os.path.html)
- [`glob`](https://docs.python.org/3/library/glob.html)
- [`fnmatch`](https://docs.python.org/3/library/fnmatch.html)
The `os` and `shutil` modules have functions allowing you to manage _files and
directories_. The `os.path`, `glob` and `fnmatch` modules have functions for
managing file and directory _paths_.
> Another standard library -
> [`pathlib`](https://docs.python.org/3/library/pathlib.html) - was added in
> Python 3.4, and provides an object-oriented interface to path management. We
> aren't going to cover `pathlib` here, but feel free to take a look at it if
> you are into that sort of thing.
## Contents
If you are impatient, feel free to dive straight in to the exercises, and use the
other sections as a reference. You might miss out on some neat tricks though.
* [Managing files and directories](#managing-files-and-directories)
* [Querying and changing the current directory](#querying-and-changing-the-current-directory)
* [Directory listings](#directory-listings)
* [Creating and removing directories](#creating-and-removing-directories)
* [Moving and removing files](#moving-and-removing-files)
* [Walking a directory tree](#walking-a-directory-tree)
* [Copying, moving, and removing directory trees](#copying-moving-and-removing-directory-trees)
* [Managing file paths](#managing-file-paths)
* [File and directory tests](#file-and-directory-tests)
* [Deconstructing paths](#deconstructing-paths)
* [Absolute and relative paths](#absolute-and-relative-paths)
* [Wildcard matching a.k.a. globbing](#wildcard-matching-aka-globbing)
* [Expanding the home directory and environment variables](#expanding-the-home-directory-and-environment-variables)
* [Cross-platform compatibility](#cross-platform-compatbility)
* [FileTrees](#filetree)
* [FileTree](#filetree)
* [Exercises](#exercises)
* [Re-name subject directories](#re-name-subject-directories)
* [Re-organise a data set](#re-organise-a-data-set)
* [Solutions](#solutions)
* [Appendix: Exceptions](#appendix-exceptions)
<a class="anchor" id="managing-files-and-directories"></a>
## Managing files and directories
The `os` module contains functions for querying and changing the current
working directory, moving and removing individual files, and for listing,
creating, removing, and traversing directories.
%% Cell type:code id: tags:
```
import os
import os.path as op
from pathlib import Path
```
%% Cell type:markdown id: tags:
> If you are using a library with a long name, you can create an alias for it
> using the `as` keyword, as we have done here for the `os.path` module.
<a class="anchor" id="querying-and-changing-the-current-directory"></a>
### Querying and changing the current directory
You can query and change the current directory with the `os.getcwd` and
`os.chdir` functions.
> Here we're also going to use the `expanduser` function from the `os.path`
> module, which allows us to expand the tilde character to the user's home
> directory This is [covered in more detail
> below](#expanding-the-home-directory-and-environment-variables).
%% Cell type:code id: tags:
```
cwd = os.getcwd()
print(f'Current directory: {cwd}')
os.chdir(op.expanduser('~'))
print(f'Changed to: {os.get_cwd()}')
print(f'Changed to: {os.getcwd()}')
os.chdir(cwd)
print(f'Changed back to: {cwd}')
```
%% Cell type:markdown id: tags:
For the rest of this practical, we're going to use a little data set that has
been pre-generated, and is located in a sub-directory called
`03_file_management`.
%% Cell type:code id: tags:
```
os.chdir('03_file_management')
```
%% Cell type:markdown id: tags:
<a class="anchor" id="directory-listings"></a>
### Directory listings
Use the `os.listdir` function to get a directory listing (a.k.a. the Unix `ls`
command):
%% Cell type:code id: tags:
```
cwd = os.getcwd()
listing = os.listdir(cwd)
print(f'Directory listing: {cwd}')
print('\n'.join(listing))
print()
datadir = 'raw_mri_data'
listing = os.listdir(datadir)
print(f'Directory listing: {datadir}')
print('\n'.join(listing))
```
%% Cell type:markdown id: tags:
> Check out the `os.scandir` function as an alternative to `os.listdir`, if
> you have performance problems on large data sets.
> In the code above, we used the string `join` method to print each path in
> our directory listing on a new line. If you have a list of strings, the
> `join` method is a handy way to insert a delimiting character or string
> (e.g. newline, space, tab, comma - any string you want), between each string
> in the list.
<a class="anchor" id="creating-and-removing-directories"></a>
### Creating and removing directories
You can, not surprisingly, use the `os.mkdir` function to make a
directory. The `os.makedirs` function is also handy - it is equivalent to
`mkdir -p` in Unix:
%% Cell type:code id: tags:
```
print(os.listdir('.'))
os.mkdir('onedir')
os.makedirs('a/big/tree/of/directories')
print(os.listdir('.'))
```
%% Cell type:markdown id: tags:
The `os.rmdir` and `os.removedirs` functions perform the reverse
operations. The `os.removedirs` function will only remove empty directories,
and you must pass it the _leaf_ directory, just like `rmdir -p` in Unix:
%% Cell type:code id: tags:
```
os.rmdir('onedir')
os.removedirs('a/big/tree/of/directories')
print(os.listdir('.'))
```
%% Cell type:markdown id: tags:
<a class="anchor" id="moving-and-removing-files"></a>
### Moving and removing files
The `os.remove` and `os.rename` functions perform the equivalent of the Unix
`rm` and `mv` commands for files. Just like in Unix, if the destination file
you pass to `os.rename` already exists, it will be silently overwritten!
%% Cell type:code id: tags:
```
with open('file.txt', 'wt') as f:
f.write('This file contains nothing of interest')
print(os.listdir())
os.rename('file.txt', 'file2.txt')
print(os.listdir())
os.remove('file2.txt')
print(os.listdir())
```
%% Cell type:markdown id: tags:
The `os.rename` function will also work on directories, but the `shutil.move`
function (covered below) is more flexible.
<a class="anchor" id="walking-a-directory-tree"></a>
### Walking a directory tree
The `os.walk` function is a useful one to know about. It is a bit fiddly to
use, but it is the best option if you need to traverse a directory tree. It
will recursively iterate over all of the files in a directory tree - by
default it will traverse the tree in a breadth-first manner.
%% Cell type:code id: tags:
```
# On each iteration of the loop, we get:
# - root: the current directory
# - dirs: a list of all sub-directories in the root
# - files: a list of all files in the root
for root, dirs, files in os.walk('raw_mri_data'):
print('Current directory: {}'.format(root))
print(' Sub-directories:')
print('\n'.join([' {}'.format(d) for d in dirs]))
print(' Files:')
print('\n'.join([' {}'.format(f) for f in files]))
```
%% Cell type:markdown id: tags:
> Note that `os.walk` does not guarantee a specific ordering in the lists of
> files and sub-directories that it returns. However, you can force an
> ordering quite easily - see its
> [documentation](https://docs.python.org/3/library/os.html#os.walk) for
> more details.
If you need to traverse the directory depth-first, you can use the `topdown`
parameter:
%% Cell type:code id: tags:
```
for root, dirs, files in os.walk('raw_mri_data', topdown=False):
print('Current directory: {}'.format(root))
print(' Sub-directories:')
print('\n'.join([' {}'.format(d) for d in dirs]))
print(' Files:')
print('\n'.join([' {}'.format(f) for f in files]))
```
%% Cell type:markdown id: tags:
> Here we have explicitly named the `topdown` argument when passing it to the
> `os.walk` function. This is referred to as a a _keyword argument_ - unnamed
> arguments are referred to as _positional arguments_. We'll give some more
> examples of positional and keyword arguments below.
<a class="anchor" id="copying-moving-and-removing-directory-trees"></a>
### Copying, moving, and removing directory trees
The `shutil` module contains some higher level functions for copying and
moving files and directories.
%% Cell type:code id: tags:
```
import shutil
```
%% Cell type:markdown id: tags:
The `shutil.copy` and `shutil.move` functions work just like the Unix `cp` and
`mv` commands:
%% Cell type:code id: tags:
```
# copy the source file to a destination file
src = 'raw_mri_data/subj_1/t1.nii'
shutil.copy(src, 'subj_1_t1.nii')
print(os.listdir('.'))
# copy the source file to a destination directory
os.mkdir('data_backup')
shutil.copy('subj_1_t1.nii', 'data_backup')
print(os.listdir('.'))
print(os.listdir('data_backup'))
# Move the file copy into that destination directory
shutil.move('subj_1_t1.nii', 'data_backup/subj_1_t1_backup.nii')
print(os.listdir('.'))
print(os.listdir('data_backup'))
# Move that destination directory into another directory
os.mkdir('data_backup_backup')
shutil.move('data_backup', 'data_backup_backup')
print(os.listdir('.'))
print(os.listdir('data_backup_backup'))
```
%% Cell type:markdown id: tags:
The `shutil.copytree` function allows you to copy entire directory trees - it
is the equivalent of the Unix `cp -r` command. The reverse operation is provided
by the `shutil.rmtree` function:
%% Cell type:code id: tags:
```
shutil.copytree('raw_mri_data', 'raw_mri_data_backup')
print(os.listdir('.'))
shutil.rmtree('raw_mri_data_backup')
shutil.rmtree('data_backup_backup')
print(os.listdir('.'))
```
%% Cell type:markdown id: tags:
<a class="anchor" id="managing-file-paths"></a>
## Managing file paths
The `os.path` module contains functions for creating and manipulating file and
directory paths, such as stripping directory prefixes and suffixes, and
joining directory paths in a cross-platform manner. In this code, we are using
`op` to refer to `os.path` - remember that we [created an alias
earlier](#managing-files-and-directories).
> Note that many of the functions in the `os.path` module do not care if your
> path actually refers to a real file or directory - they are just
> manipulating the path string, and will happily generate invalid or
> non-existent paths for you.
<a class="anchor" id="file-and-directory-tests"></a>
### File and directory tests
If you want to know whether a given path is a file, or a directory, or whether
it exists at all, then the `os.path` module has got your back with its
`isfile`, `isdir`, and `exists` functions. Let's define a silly function which
will tell us what a path is:
%% Cell type:code id: tags:
```
def whatisit(path, existonly=False):
print('Does {} exist? {}'.format(path, op.exists(path)))
if not existonly:
print('Is {} a file? {}' .format(path, op.isfile(path)))
print('Is {} a directory? {}'.format(path, op.isdir( path)))
```
%% Cell type:markdown id: tags:
> This is the first time in a while that we have defined our own function,
> [hooray!](https://www.youtube.com/watch?v=zQiibNVIvK4). Here's a quick
> refresher on how to write functions in Python, in case you have forgotten.
>
> First of all, all function definitions in Python begin with the `def`
> keyword:
>
> ```
> def myfunction():
> function_body
> ```
>
> Just like with other control flow tools, such as `if`, `for`, and `while`
> statements, the body of a function must be indented (with four spaces
> please!).
>
> Python functions can be written to accept any number of arguments:
>
> ```
> def myfunction(arg1, arg2, arg3):
> function_body
> ```
>
> Arguments can also be given default values:
>
> ```
> def myfunction(arg1, arg2, arg3=False):
> function_body
> ```
>
> In our `whatisit` function above, we gave the `existonly` argument (which
> controls whether the path is only tested for existence) a default value.
> This makes the `existonly` argument optional - we can call `whatisit` either
> with or without this argument.
>
> To return a value from a function, use the `return` keyword:
>
> ```
> def add(n1, n2):
> return n1 + n2
> ```
>
> Take a look at the [official Python
> tutorial](https://docs.python.org/3/tutorial/controlflow.html#defining-functions)
> for more details on defining your own functions.
Now let's use that function to test some paths. Here we are using the
`op.join` function to construct paths - it is [covered
below](#cross-platform-compatbility):
%% Cell type:code id: tags:
```
dirname = op.join('raw_mri_data')
filename = op.join('raw_mri_data', 'subj_1', 't1.nii')
nonexist = op.join('very', 'unlikely', 'to', 'exist')
whatisit(dirname)
whatisit(filename)
whatisit(nonexist)
whatisit(nonexist, existonly=True)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="deconstructing-paths"></a>
### Deconstructing paths
If you are only interested in the directory or file component of a path then
the `os.path` module has the `dirname`, `basename`, and `split` functions.
%% Cell type:code id: tags:
```
path = '/path/to/my/image.nii'
print('Directory name: {}'.format(op.dirname( path)))
print('Base name: {}'.format(op.basename(path)))
print('Directory and base names: {}'.format(op.split( path)))
```
%% Cell type:markdown id: tags:
> Note here that `op.split` returns both the directory and base names - remember
> that it is super easy to define a Python function that returns multiple values,
> simply by having it return a tuple. For example, the implementation of
> `op.split` might look something like this:
>
>
> ```
> def mysplit(path):
> dirname = op.dirname(path)
> basename = op.basename(path)
>
> # It is not necessary to use round brackets here
> # to denote the tuple - the return values will
> # be implicitly grouped into a tuple for us.
> return dirname, basename
> ```
>
>
> When calling a function which returns multiple values, you can _unpack_ those
> values in a single statement like so:
>
>
> ```
> dirname, basename = mysplit(path)
>
> print('Directory name: {}'.format(dirname))
> print('Base name: {}'.format(basename))
> ```
If you want to extract the prefix or suffix of a file, you can use `splitext`:
%% Cell type:code id: tags:
```
prefix, suffix = op.splitext('image.nii')
print('Prefix: {}'.format(prefix))
print('Suffix: {}'.format(suffix))
```
%% Cell type:markdown id: tags:
> Double-barrelled file suffixes (e.g. `.nii.gz`) are the work of the devil.
> Correct handling of them is an open problem in Computer Science, and is
> considered by many to be unsolvable. For `imglob`, `imcp`, and `immv`-like
> functionality, check out the `fsl.utils.path` and `fsl.utils.imcp` modules,
> part of the [`fslpy`
> project](https://users.fmrib.ox.ac.uk/~paulmc/fsleyes/fslpy/latest/). If you
> are using `fslpython`, then you already have access to all of the functions
> in `fslpy`.
<a class="anchor" id="absolute-and-relative-paths"></a>
### Absolute and relative paths
The `os.path` module has three useful functions for converting between
absolute and relative paths. The `op.abspath` and `op.relpath` functions will
respectively turn the provided path into an equivalent absolute or relative
path.
%% Cell type:code id: tags:
```
path = op.abspath('relative/path/to/some/file.txt')
print('Absolutised: {}'.format(path))
print('Relativised: {}'.format(op.relpath(path)))
```
%% Cell type:markdown id: tags:
By default, the `op.abspath` and `op.relpath` functions work relative to the
current working directory. The `op.relpath` function allows you to specify a
different directory to work from, and another function - `op.normpath` -
allows you create absolute paths with a different starting
point. `op.normpath` will take care of removing duplicate back-slashes,
and resolving references to `"."` and `".."`:
%% Cell type:code id: tags:
```
path = 'relative/path/to/some/file.txt'
root = '/vols/Data/'
abspath = op.normpath(op.join(root, path))
print('Absolute path: {}'.format(abspath))
print('Relative path: {}'.format(op.relpath(abspath, root)))
```
%% Cell type:markdown id: tags:
<a class="anchor" id="wildcard-matching-aka-globbing"></a>
### Wildcard matching a.k.a. globbing
The `glob` module has a function, also called `glob`, which allows you to find
files, based on unix-style wildcard pattern matching.
%% Cell type:code id: tags:
```
from glob import glob
root = 'raw_mri_data'
# find all niftis for subject 1
images = glob(op.join(root, 'subj_1', '*.nii*'))
print('Subject #1 images:')
print('\n'.join([' {}'.format(i) for i in images]))
# find all subject directories
subjdirs = glob(op.join(root, 'subj_*'))
print('Subject directories:')
print('\n'.join([' {}'.format(d) for d in subjdirs]))
```
%% Cell type:markdown id: tags:
As with [`os.walk`](walking-a-directory-tree), the order of the results
returned by `glob` is arbitrary. Unfortunately the undergraduate who
acquired this specific data set did not think to use zero-padded subject IDs
(you'll be pleased to know that this student was immediately kicked out of his
college and banned from ever returning), so we can't simply sort the paths
alphabetically. Instead, let's use some trickery to sort the subject
directories numerically by ID:
Let's define a function which, given a subject directory, returns the numeric
subject ID:
%% Cell type:code id: tags:
```
def get_subject_id(subjdir):
# Remove any leading directories (e.g. "raw_mri_data/")
subjdir = op.basename(subjdir)
# Split "subj_[id]" into two words
prefix, sid = subjdir.split('_')
# return the subject ID as an integer
return int(sid)
```
%% Cell type:markdown id: tags:
This function works like so: