Skip to content
Snippets Groups Projects
Commit 7f70f62f authored by Michiel Cottaar's avatar Michiel Cottaar
Browse files

Attempted to clarify unicode versus bytes

parent 5affdf9b
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
# Text input/output
In this section we will explore how to write and/or retrieve our data from text files.
Most of the functionality for reading/writing files and manipulating strings is available without any imports. However, you can find some additional functionality in the [`string`](https://docs.python.org/3.6/library/string.html) module.
Most of the string functions are available as methods on string objects. This means that you can use the ipython autocomplete to check for them.
%% Cell type:code id: tags:
``` python
empty_string = ''
```
%% Cell type:code id: tags:
``` python
empty_string. # after running the code block above, put your cursor behind the dot and press tab to get a list of methods
```
%% Cell type:markdown id: tags:
<a class="anchor" id="creating-new-strings"></a>
## Creating new strings
<a class="anchor" id="string-syntax"></a>
### String syntax
Single-line strings can be created in python using either single or double quotes
%% Cell type:code id: tags:
``` python
a_string = 'To be or not to be'
same_string = "To be or not to be"
print(a_string == same_string)
```
%% Cell type:markdown id: tags:
The main rationale for choosing between single or double quotes, is whether the string itself will contain any quotes. You can include a single quote in a string surrounded by single quotes by escaping it with the `\` character:
%% Cell type:code id: tags:
``` python
a_string = "That's the question"
same_string = 'That\'s the question'
print(a_string == same_string)
```
%% Cell type:markdown id: tags:
New-lines (`\n`), tabs (`\t`) and many other special characters are supported
%% Cell type:code id: tags:
``` python
a_string = "This is the first line.\nAnd here is the second.\n\tThe third starts with a tab."
print(a_string)
```
%% Cell type:markdown id: tags:
However, the easiest way to create multi-line strings is to use a triple quote (again single or double quotes can be used:
%% Cell type:code id: tags:
``` python
multi_line_string = """This is the first line.
And here is the second.
\tThird line starts with a tab."""
print(multi_line_string)
```
%% Cell type:markdown id: tags:
If you don't want python to reintepret your `\n`, `\t`, etc. in your strings, you can prepend the quotes enclosing the string with an `r`. This will lead to python interpreting the following string as raw text.
%% Cell type:code id: tags:
``` python
single_line_string = "This string is not multiline.\nEven though it contains the \n character"
print(single_line_string)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="unicode-versus-bytes"></a>
#### unicode versus bytes
To encourage the spread of python around the world, python 3 switched to using unicode as the default for strings and code (which is one of the main reasons for the incompatibility between python 2 and 3). This means that any unicode characters can be used in strings (or in our code):
%% Cell type:code id: tags:
``` python
Δ = "café"
print(Δ)
```
%% Cell type:markdown id: tags:
Python 3 uses UTF-8 encoding by default, although you can change this in any file (see [python documentation on encoding](https://docs.python.org/3/howto/unicode.html) for more details)
In python 2 the string object was a simple array of bytes. You can create such a byte array from your unicode string in python 3 using the encode method
%% Cell type:code id: tags:
``` python
delta = "Δ"
print(delta, ' in python 2 would be represented as ', delta.encode())
```
%% Cell type:markdown id: tags:
These byte arrays can be created directly be prepending the quotes enclosing the string with a `b`, which tells python 3 to interpret the following as a byte array:
%% Cell type:code id: tags:
``` python
a_byte_array = b'\xce\xa9'
print('The bytes ', a_byte_array, ' become ', a_byte_array.decode(), ' with UTF-8 encoding')
```
%% Cell type:markdown id: tags:
Especially in code dealing with strings (e.g., reading/writing of files) many of the errors arising of running python 2 code in python 3 arise from the mixing of unicode strings with byte arrays. Decoding and/or encoding some of these objects can often fix these issues.
<a class="anchor" id="converting-objects-into-strings"></a>
### converting objects into strings
There are two functions to convert python objects into strings, `repr()` and `str()`.
All other functions that rely on string-representations of python objects will use one of these two (for example the `print()` function will call `str()` on the object).
The goal of the `str()` function is to be readable, while the goal of `repr()` is to be unambiguous. For example
%% Cell type:code id: tags:
``` python
print(str("3"))
print(str(3))
```
%% Cell type:markdown id: tags:
While the output of both `str()` functions are very clear, we can not know whether the input was a string or an actual integer.
%% Cell type:code id: tags:
``` python
print(repr("3"))
print(repr(3))
```
%% Cell type:markdown id: tags:
Note that the output of the `repr()` function can be directly be passed back to the python interpreter to recreate our string/integer.
<a class="anchor" id="combining-strings"></a>
### Combining strings
The simplest way to concatenate strings is to simply add them together:
%% Cell type:code id: tags:
``` python
a_string = "Part 1"
other_string = "Part 2"
full_string = a_string + ", " + other_string
print(full_string)
```
%% Cell type:markdown id: tags:
Given a whole sequence of strings, you can concatenate them together using the `join()` method:
%% Cell type:code id: tags:
``` python
list_of_strings = ['first', 'second', 'third', 'fourth']
full_string = ', '.join(list_of_strings)
print(full_string)
```
%% Cell type:markdown id: tags:
Note that the string on which the `join()` method is called (`', '` in this case) is used to glue the different strings together. If you just want to concatenate the strings you can call `join()` on the empty string:
%% Cell type:code id: tags:
``` python
list_of_strings = ['first', 'second', 'third', 'fourth']
full_string = ''.join(list_of_strings)
print(full_string)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="string-formatting"></a>
### String formatting
Using the techniques in [Combining strings](#combining-strings) we can build simple strings. For longer strings it is often useful to first write a template strings with some placeholders, where variables are later inserted. Built into python are currently 4 different ways of doing this (with many packages providing similar capabilities):
* the recommended [new-style formatting](https://docs.python.org/3.6/library/string.html#format-string-syntax).
* printf-like [old-style formatting](https://docs.python.org/3/library/stdtypes.html#old-string-formatting)
* [formatted string literals](https://docs.python.org/3.6/reference/lexical_analysis.html#f-strings) (these are only available in python 3.6+)
* bash-like [template-strings](https://docs.python.org/3.6/library/string.html#template-strings)
Here we provide a single example using the first three methods, so you can recognize them in the future.
First the old print-f style. Note that this style is invoked by using the modulo (`%`) operator on the string. Every placeholder (starting with the `%`) is then replaced by one of the values provided.
%% Cell type:code id: tags:
``` python
a = 3
b = 1 / 3
print('%.3f = %i + %.3f' % (a + b, a, b))
print('%(total).3f = %(a)i + %(b).3f' % {'a': a, 'b': b, 'total': a + b})
```
%% Cell type:markdown id: tags:
Then the recommended new style formatting (You can find a nice tutorial [here](https://www.digitalocean.com/community/tutorials/how-to-use-string-formatters-in-python-3)). Note that this style is invoked by calling the `format()` method on the string and the placeholders are marked by the curly braces `{}`.
%% Cell type:code id: tags:
``` python
a = 3
b = 1 / 3
print('{:.3f} = {} + {:.3f}'.format(a + b, a, b))
print('{total:.3f} = {a} + {b:.3f}'.format(a=a, b=b, total=a+b))
```
%% Cell type:markdown id: tags:
Finally the new, fancy formatted string literals (only available in python 3.6+). This new format is very similar to the recommended style, except that all placeholders are automatically evaluated in the local environment at the time the template is defined. This means that we do not have to explicitly provide the parameters (and we can evaluate the sum inside the string!), although it does mean we also can not re-use the template.
%% Cell type:code id: tags:
``` python
a = 3
b = 1/3
print(f'{a + b:.3f} = {a} + {b:.3f} = {a + b}')
```
%% Cell type:markdown id: tags:
<a class="anchor" id="reading-writing-files"></a>
## Reading/writing files
## Extracting sub-strings from strings
### Splitting strings
The simplest way to extract a sub-string is to use slicing
%% Cell type:code id: tags:
``` python
a_string = 'abcdefghijklmnopqrstuvwxyz'
print(a_string[10]) # create a string containing only the 10th character
print(a_string[20:]) # create a string containing the 20th character onward
print(a_string[::-1]) # creating the reverse string
```
%% Cell type:markdown id: tags:
If you are not sure, where to cut into a string, you can use the `find()` method to find the first occurrence of a sub-string or `findall()` to find all occurrences.
%% Cell type:code id: tags:
``` python
a_string = 'abcdefghijklmnopqrstuvwxyz'
index = a_string.find('fgh')
print(a_string[:index]) # extracts the sub-string up to the first occurence of 'fgh'
print('index for non-existent sub-string', a_string.find('cats')) # note that find returns -1 when it can not find the sub-string rather than raising an error.
```
%% Cell type:markdown id: tags:
### Regular expressions
Regular expressions are used for looking for specific patterns in a longer string. This can be used to extract specific information from a well-formatted string or to modify a string. In python regular expressions are available in the [re](https://docs.python.org/3/library/re.html#re-syntax) module.
A full discussion of regular expression goes far beyond this tutorial. If you are interested, have a look at [https://docs.python.org/3/howto/regex.html]
## Exercises
### Joining/splitting strings
go from 2 column file to 2 rows
### String formatting and regular expressions
Given a template for MRI files:
s<subject_id>/<modality>_<res>mm.nii.gz
where <subject_id> is a 6-digit subject-id, <modality> is one of T1w, T2w, or PD, and <res> is the resolution of the image (up to one digits behind the dot, e.g. 1.5)
Write a function that takes the subject_id (as an integer), the modality (as a string), and the resolution (as a float) and returns the complete filename (Hint: use one of the formatting techniques mentioned in [String formatting](#string-formatting)).
%% Cell type:code id: tags:
``` python
def get_filename(subject_id, modality, resolution):
...
```
%% Cell type:markdown id: tags:
For a more difficult exercise, write a function that extracts the subject id, modality, and resolution from a filename name (using a regular expression or by using `find` and `split` to access relevant parts of the string)
%% Cell type:code id: tags:
``` python
def get_parameters(filename):
...
return subject_id, modality, resolution
```
...@@ -54,15 +54,16 @@ print(single_line_string) ...@@ -54,15 +54,16 @@ print(single_line_string)
<a class="anchor" id="unicode-versus-bytes"></a> <a class="anchor" id="unicode-versus-bytes"></a>
#### unicode versus bytes #### unicode versus bytes
To encourage the spread of python around the world, python 3 switched to using unicode as the default for strings and code (which is one of the main reasons for the incompatibility between python 2 and 3). This means that any unicode characters can be used in strings (or in our code): To encourage the spread of python around the world, python 3 switched to using unicode as the default for strings and code (which is one of the main reasons for the incompatibility between python 2 and 3).
This means that each element in a string is a unicode character (using [UTF-8 encoding](https://docs.python.org/3/howto/unicode.html)), which can consist of one or more bytes.
The advantage is that any unicode characters can now be used in strings or in the code itself:
``` ```
Δ = "café" Δ = "café"
print(Δ) print(Δ)
``` ```
Python 3 uses UTF-8 encoding by default, although you can change this in any file (see [python documentation on encoding](https://docs.python.org/3/howto/unicode.html) for more details)
In python 2 the string object was a simple array of bytes. You can create such a byte array from your unicode string in python 3 using the encode method In python 2 each element in a string was a single byte rather than a potentially multi-byte character. You can create such a byte array from your unicode string in python 3 using the `encode()` method and converted back to a `decode()` method.
``` ```
delta = "Δ" delta = "Δ"
print(delta, ' in python 2 would be represented as ', delta.encode()) print(delta, ' in python 2 would be represented as ', delta.encode())
...@@ -71,7 +72,7 @@ print(delta, ' in python 2 would be represented as ', delta.encode()) ...@@ -71,7 +72,7 @@ print(delta, ' in python 2 would be represented as ', delta.encode())
These byte arrays can be created directly be prepending the quotes enclosing the string with a `b`, which tells python 3 to interpret the following as a byte array: These byte arrays can be created directly be prepending the quotes enclosing the string with a `b`, which tells python 3 to interpret the following as a byte array:
``` ```
a_byte_array = b'\xce\xa9' a_byte_array = b'\xce\xa9'
print('The bytes ', a_byte_array, ' become ', a_byte_array.decode(), ' with UTF-8 encoding') print('The two bytes ', a_byte_array, ' become single unicode character (', a_byte_array.decode(), ') with UTF-8 encoding')
``` ```
Especially in code dealing with strings (e.g., reading/writing of files) many of the errors arising of running python 2 code in python 3 arise from the mixing of unicode strings with byte arrays. Decoding and/or encoding some of these objects can often fix these issues. Especially in code dealing with strings (e.g., reading/writing of files) many of the errors arising of running python 2 code in python 3 arise from the mixing of unicode strings with byte arrays. Decoding and/or encoding some of these objects can often fix these issues.
......
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Callable scripts in python # Callable scripts in python
In this tutorial we will cover how to write simple stand-alone scripts in python that can be used as alternatives to bash scripts. In this tutorial we will cover how to write simple stand-alone scripts in python that can be used as alternatives to bash scripts.
There are some code blocks within this webpage, but we recommend that you write the code in an IDE or editor instead and then run the scripts from a terminal. There are some code blocks within this webpage, but we recommend that you write the code in an IDE or editor instead and then run the scripts from a terminal.
## Basic script ## Basic script
The first line of a python script is usually: The first line of a python script is usually:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
#!/usr/bin/env python #!/usr/bin/env python
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
which invokes whichever version of python can be found by `/usr/bin/env` since python can be located in many different places. which invokes whichever version of python can be found by `/usr/bin/env` since python can be located in many different places.
For FSL scripts we use an alternative, to ensure that we pick up the version of python (and associated packages) that we ship with FSL. To do this we use the line: For FSL scripts we use an alternative, to ensure that we pick up the version of python (and associated packages) that we ship with FSL. To do this we use the line:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
#!/usr/bin/env fslpython #!/usr/bin/env fslpython
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
After this line the rest of the file just uses regular python syntax, as in the other tutorials. Make sure you make the file executable - just like a bash script. After this line the rest of the file just uses regular python syntax, as in the other tutorials. Make sure you make the file executable - just like a bash script.
## Calling other executables ## Calling other executables
The most essential call that you need to use to replicate the way a bash script calls executables is `subprocess.run()`. A simple call looks like this: The most essential call that you need to use to replicate the way a bash script calls executables is `subprocess.run()`. A simple call looks like this:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
import subprocess as sp import subprocess as sp
sp.run(['ls', '-la']) sp.run(['ls', '-la'])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
To suppress the output do this: To suppress the output do this:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
spobj = sp.run(['ls'], stdout = sp.PIPE) spobj = sp.run(['ls'], stdout = sp.PIPE)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
To store the output do this: To store the output do this:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
spobj = sp.run('ls -la'.split(), stdout = sp.PIPE) spobj = sp.run('ls -la'.split(), stdout = sp.PIPE)
sout = spobj.stdout.decode('utf-8') sout = spobj.stdout.decode('utf-8')
print(sout) print(sout)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
> Note that the `decode` call in the middle line converts the string from a byte string to a normal string. In Python 3 there is a distinction between strings (sequences of characters, possibly using multiple bytes to store each character) and bytes (sequences of bytes). The world has moved on from ASCII, so in this day and age, this distinction is absolutely necessary, and Python does a fairly good job of it. > Note that the `decode` call in the middle line converts the string from a byte string to a normal string. In Python 3 there is a distinction between strings (sequences of characters, possibly using multiple bytes to store each character) and bytes (sequences of bytes). The world has moved on from ASCII, so in this day and age, this distinction is absolutely necessary, and Python does a fairly good job of it.
If the output is numerical then this can be extracted like this: If the output is numerical then this can be extracted like this:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
import os import os
fsldir = os.getenv('FSLDIR') fsldir = os.getenv('FSLDIR')
spobj = sp.run([fsldir+'/bin/fslstats', fsldir+'/data/standard/MNI152_T1_1mm_brain', '-V'], stdout = sp.PIPE) spobj = sp.run([fsldir+'/bin/fslstats', fsldir+'/data/standard/MNI152_T1_1mm_brain', '-V'], stdout = sp.PIPE)
sout = spobj.stdout.decode('utf-8') sout = spobj.stdout.decode('utf-8')
vol_vox = float(sout.split()[0]) vol_vox = float(sout.split()[0])
vol_mm = float(sout.split()[1]) vol_mm = float(sout.split()[1])
print('Volumes are: ', vol_vox, ' in voxels and ', vol_mm, ' in mm') print('Volumes are: ', vol_vox, ' in voxels and ', vol_mm, ' in mm')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
An alternative way to run a set of commands would be like this: An alternative way to run a set of commands would be like this:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
commands = """ commands = """
{fsldir}/bin/fslmaths {t1} -bin {t1_mask} {fsldir}/bin/fslmaths {t1} -bin {t1_mask}
{fsldir}/bin/fslmaths {t2} -mas {t1_mask} {t2_masked} {fsldir}/bin/fslmaths {t2} -mas {t1_mask} {t2_masked}
""" """
fsldirpath = os.getenv('FSLDIR') fsldirpath = os.getenv('FSLDIR')
commands = commands.format(t1 = 't1.nii.gz', t1_mask = 't1_mask', t2 = 't2', t2_masked = 't2_masked', fsldir = fsldirpath) commands = commands.format(t1 = 't1.nii.gz', t1_mask = 't1_mask', t2 = 't2', t2_masked = 't2_masked', fsldir = fsldirpath)
sout=[] sout=[]
for cmd in commands.split('\n'): for cmd in commands.split('\n'):
if cmd: # avoids empty strings getting passed to sp.run() if cmd: # avoids empty strings getting passed to sp.run()
print('Running command: ', cmd) print('Running command: ', cmd)
spobj = sp.run(cmd.split(), stdout = sp.PIPE) spobj = sp.run(cmd.split(), stdout = sp.PIPE)
sout.append(spobj.stdout.decode('utf-8')) sout.append(spobj.stdout.decode('utf-8'))
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## Command line arguments ## Command line arguments
The simplest way of dealing with command line arguments is use the module `sys`, which gives access to an `argv` list: The simplest way of dealing with command line arguments is use the module `sys`, which gives access to an `argv` list:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
import sys import sys
print(len(sys.argv)) print(len(sys.argv))
print(sys.argv[0]) print(sys.argv[0])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
For more sophisticated argument parsing you can use `argparse` - good documentation and examples of this can be found on the web. For more sophisticated argument parsing you can use `argparse` - good documentation and examples of this can be found on the web.
## Example script ## Example script
Here is a simple bash script (it masks an image and calculates volumes - just as a random example). DO NOT execute the code blocks here within the notebook/webpage: Here is a simple bash script (it masks an image and calculates volumes - just as a random example). DO NOT execute the code blocks here within the notebook/webpage:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
%%bash
#!/bin/bash #!/bin/bash
if [ $# -lt 2 ] ; then if [ $# -lt 2 ] ; then
echo "Usage: $0 <input image> <output image>" echo "Usage: $0 <input image> <output image>"
exit 1 exit 1
fi fi
infile=$1 infile=$1
outfile=$2 outfile=$2
# mask input image with MNI # mask input image with MNI
$FSLDIR/bin/fslmaths $infile -mas $FSLDIR/data/standard/MNI152_T1_1mm_brain $outfile $FSLDIR/bin/fslmaths $infile -mas $FSLDIR/data/standard/MNI152_T1_1mm_brain $outfile
# calculate volumes of masked image # calculate volumes of masked image
vv=`$FSLDIR/bin/fslstats $outfile -V` vv=`$FSLDIR/bin/fslstats $outfile -V`
vol_vox=`echo $vv | awk '{ print $1 }'` vol_vox=`echo $vv | awk '{ print $1 }'`
vol_mm=`echo $vv | awk '{ print $2 }'` vol_mm=`echo $vv | awk '{ print $2 }'`
echo "Volumes are: $vol_vox in voxels and $vol_mm in mm" echo "Volumes are: $vol_vox in voxels and $vol_mm in mm"
``` ```
%% Output
Usage: bash <input image> <output image>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
And an alternative in python: And an alternative in python:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` ``` python
#!/usr/bin/env fslpython #!/usr/bin/env fslpython
import os, sys import os, sys
import subprocess as sp import subprocess as sp
fsldir=os.getenv('FSLDIR') fsldir=os.getenv('FSLDIR')
if len(sys.argv)<2: if len(sys.argv)<2:
print('Usage: ', sys.argv[0], ' <input image> <output image>') print('Usage: ', sys.argv[0], ' <input image> <output image>')
sys.exit(1) sys.exit(1)
infile = sys.argv[1] infile = sys.argv[1]
outfile = sys.argv[2] outfile = sys.argv[2]
# mask input image with MNI # mask input image with MNI
spobj = sp.run([fsldir+'/bin/fslmaths', infile, '-mas', fsldir+'/data/standard/MNI152_T1_1mm_brain', outfile], stdout = sp.PIPE) spobj = sp.run([fsldir+'/bin/fslmaths', infile, '-mas', fsldir+'/data/standard/MNI152_T1_1mm_brain', outfile], stdout = sp.PIPE)
# calculate volumes of masked image # calculate volumes of masked image
spobj = sp.run([fsldir+'/bin/fslstats', outfile, '-V'], stdout = sp.PIPE) spobj = sp.run([fsldir+'/bin/fslstats', outfile, '-V'], stdout = sp.PIPE)
sout = spobj.stdout.decode('utf-8') sout = spobj.stdout.decode('utf-8')
vol_vox = float(sout.split()[0]) vol_vox = float(sout.split()[0])
vol_mm = float(sout.split()[1]) vol_mm = float(sout.split()[1])
print('Volumes are: ', vol_vox, ' in voxels and ', vol_mm, ' in mm') print('Volumes are: ', vol_vox, ' in voxels and ', vol_mm, ' in mm')
``` ```
%% Output
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-2-f7378930c369> in <module>()
13 spobj = sp.run([fsldir+'/bin/fslstats', outfile, '-V'], stdout = sp.PIPE)
14 sout = spobj.stdout.decode('utf-8')
---> 15 vol_vox = float(sout.split()[0])
16 vol_mm = float(sout.split()[1])
17 print('Volumes are: ', vol_vox, ' in voxels and ', vol_mm, ' in mm')
IndexError: list index out of range
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment