Attempted to clarify unicode versus bytes

7f70f62f · Michiel Cottaar · 5affdf9b · 7f70f62f · 7f70f62f · 7f70f62f
Commit 7f70f62f authored 7 years ago by Michiel Cottaar
--- a/getting_started/02_text_io.ipynb
+++ b/getting_started/02_text_io.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Text input/output\n",
+    "\n",
+    "In this section we will explore how to write and/or retrieve our data from text files.\n",
+    "\n",
+    "Most of the functionality for reading/writing files and manipulating strings is available without any imports. However, you can find some additional functionality in the [`string`](https://docs.python.org/3.6/library/string.html) module.\n",
+    "\n",
+    "Most of the string functions are available as methods on string objects. This means that you can use the ipython autocomplete to check for them."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "empty_string = ''"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "empty_string.    # after running the code block above, put your cursor behind the dot and press tab to get a list of methods"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a class=\"anchor\" id=\"creating-new-strings\"></a>\n",
+    "## Creating new strings\n",
+    "\n",
+    "<a class=\"anchor\" id=\"string-syntax\"></a>\n",
+    "### String syntax\n",
+    "Single-line strings can be created in python using either single or double quotes"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a_string = 'To be or not to be'\n",
+    "same_string = \"To be or not to be\"\n",
+    "print(a_string == same_string)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The main rationale for choosing between single or double quotes, is whether the string itself will contain any quotes. You can include a single quote in a string surrounded by single quotes by escaping it with the `\\` character:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a_string = \"That's the question\"\n",
+    "same_string = 'That\\'s the question'\n",
+    "print(a_string == same_string)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "New-lines (`\\n`), tabs (`\\t`) and many other special characters are supported"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a_string = \"This is the first line.\\nAnd here is the second.\\n\\tThe third starts with a tab.\"\n",
+    "print(a_string)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "However, the easiest way to create multi-line strings is to use a triple quote (again single or double quotes can be used:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "multi_line_string = \"\"\"This is the first line.\n",
+    "And here is the second.\n",
+    "\\tThird line starts with a tab.\"\"\"\n",
+    "print(multi_line_string)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you don't want python to reintepret your `\\n`, `\\t`, etc. in your strings, you can prepend the quotes enclosing the string with an `r`. This will lead to python interpreting the following string as raw text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "single_line_string = \"This string is not multiline.\\nEven though it contains the \\n character\"\n",
+    "print(single_line_string)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a class=\"anchor\" id=\"unicode-versus-bytes\"></a>\n",
+    "#### unicode versus bytes\n",
+    "To encourage the spread of python around the world, python 3 switched to using unicode as the default for strings and code (which is one of the main reasons for the incompatibility between python 2 and 3). This means that any unicode characters can be used in strings (or in our code):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "Δ = \"café\"\n",
+    "print(Δ)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Python 3 uses UTF-8 encoding by default, although you can change this in any file (see [python documentation on encoding](https://docs.python.org/3/howto/unicode.html) for more details)\n",
+    "\n",
+    "In python 2 the string object was a simple array of bytes. You can create such a byte array from your unicode string in python 3 using the encode method"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "delta = \"Δ\"\n",
+    "print(delta, ' in python 2 would be represented as ', delta.encode())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "These byte arrays can be created directly be prepending the quotes enclosing the string with a `b`, which tells python 3 to interpret the following as a byte array:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a_byte_array = b'\\xce\\xa9'\n",
+    "print('The bytes ', a_byte_array, ' become ', a_byte_array.decode(), ' with UTF-8 encoding')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Especially in code dealing with strings (e.g., reading/writing of files) many of the errors arising of running python 2 code in python 3 arise from the mixing of unicode strings with byte arrays. Decoding and/or encoding some of these objects can often fix these issues.\n",
+    "\n",
+    "<a class=\"anchor\" id=\"converting-objects-into-strings\"></a>\n",
+    "### converting objects into strings\n",
+    "There are two functions to convert python objects into strings, `repr()` and `str()`.\n",
+    "All other functions that rely on string-representations of python objects will use one of these two (for example the `print()` function will call `str()` on the object).\n",
+    "\n",
+    "The goal of the `str()` function is to be readable, while the goal of `repr()` is to be unambiguous. For example"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "print(str(\"3\"))\n",
+    "print(str(3))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "While the output of both `str()` functions are very clear, we can not know whether the input was a string or an actual integer."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "print(repr(\"3\"))\n",
+    "print(repr(3))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that the output of the `repr()` function can be directly be passed back to the python interpreter to recreate our string/integer.\n",
+    "\n",
+    "<a class=\"anchor\" id=\"combining-strings\"></a>\n",
+    "### Combining strings\n",
+    "The simplest way to concatenate strings is to simply add them together:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a_string = \"Part 1\"\n",
+    "other_string = \"Part 2\"\n",
+    "full_string = a_string + \", \" + other_string\n",
+    "print(full_string)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Given a whole sequence of strings, you can concatenate them together using the `join()` method:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "list_of_strings = ['first', 'second', 'third', 'fourth']\n",
+    "full_string = ', '.join(list_of_strings)\n",
+    "print(full_string)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Note that the string on which the `join()` method is called (`', '` in this case) is used to glue the different strings together. If you just want to concatenate the strings you can call `join()` on the empty string:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "list_of_strings = ['first', 'second', 'third', 'fourth']\n",
+    "full_string = ''.join(list_of_strings)\n",
+    "print(full_string)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a class=\"anchor\" id=\"string-formatting\"></a>\n",
+    "### String formatting\n",
+    "Using the techniques in [Combining strings](#combining-strings) we can build simple strings. For longer strings it is often useful to first write a template strings with some placeholders, where variables are later inserted. Built into python are currently 4 different ways of doing this (with many packages providing similar capabilities):\n",
+    "* the recommended [new-style formatting](https://docs.python.org/3.6/library/string.html#format-string-syntax).\n",
+    "* printf-like [old-style formatting](https://docs.python.org/3/library/stdtypes.html#old-string-formatting)\n",
+    "* [formatted string literals](https://docs.python.org/3.6/reference/lexical_analysis.html#f-strings) (these are only available in python 3.6+)\n",
+    "* bash-like [template-strings](https://docs.python.org/3.6/library/string.html#template-strings)\n",
+    "\n",
+    "Here we provide a single example using the first three methods, so you can recognize them in the future.\n",
+    "\n",
+    "First the old print-f style. Note that this style is invoked by using the modulo (`%`) operator on the string. Every placeholder (starting with the `%`) is then replaced by one of the values provided."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a = 3\n",
+    "b = 1 / 3\n",
+    "\n",
+    "print('%.3f = %i + %.3f' % (a + b, a, b))\n",
+    "print('%(total).3f = %(a)i + %(b).3f' % {'a': a, 'b': b, 'total': a + b})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then the recommended new style formatting (You can find a nice tutorial [here](https://www.digitalocean.com/community/tutorials/how-to-use-string-formatters-in-python-3)). Note that this style is invoked by calling the `format()` method on the string and the placeholders are marked by the curly braces `{}`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a = 3\n",
+    "b = 1 / 3\n",
+    "\n",
+    "print('{:.3f} = {} + {:.3f}'.format(a + b, a, b))\n",
+    "print('{total:.3f} = {a} + {b:.3f}'.format(a=a, b=b, total=a+b))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Finally the new, fancy formatted string literals (only available in python 3.6+). This new format is very similar to the recommended style, except that all placeholders are automatically evaluated in the local environment at the time the template is defined. This means that we do not have to explicitly provide the parameters (and we can evaluate the sum inside the string!), although it does mean we also can not re-use the template."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a = 3\n",
+    "b = 1/3\n",
+    "\n",
+    "print(f'{a + b:.3f} = {a} + {b:.3f} = {a + b}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<a class=\"anchor\" id=\"reading-writing-files\"></a>\n",
+    "## Reading/writing files\n",
+    "\n",
+    "## Extracting sub-strings from strings\n",
+    "### Splitting strings\n",
+    "The simplest way to extract a sub-string is to use slicing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a_string = 'abcdefghijklmnopqrstuvwxyz'\n",
+    "print(a_string[10])  # create a string containing only the 10th character\n",
+    "print(a_string[20:])  # create a string containing the 20th character onward\n",
+    "print(a_string[::-1])  # creating the reverse string"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "If you are not sure, where to cut into a string, you can use the `find()` method to find the first occurrence of a sub-string or `findall()` to find all occurrences."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "a_string = 'abcdefghijklmnopqrstuvwxyz'\n",
+    "index = a_string.find('fgh')\n",
+    "print(a_string[:index])  # extracts the sub-string up to the first occurence of 'fgh'\n",
+    "print('index for non-existent sub-string', a_string.find('cats'))  # note that find returns -1 when it can not find the sub-string rather than raising an error."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Regular expressions\n",
+    "Regular expressions are used for looking for specific patterns in a longer string. This can be used to extract specific information from a well-formatted string or to modify a string. In python regular expressions are available in the [re](https://docs.python.org/3/library/re.html#re-syntax) module.\n",
+    "\n",
+    "A full discussion of regular expression goes far beyond this tutorial. If you are interested, have a look at [https://docs.python.org/3/howto/regex.html]\n",
+    "\n",
+    "## Exercises\n",
+    "### Joining/splitting strings\n",
+    "go from 2 column file to 2 rows\n",
+    "### String formatting and regular expressions\n",
+    "Given a template for MRI files:\n",
+    "s<subject_id>/<modality>_<res>mm.nii.gz\n",
+    "where <subject_id> is a 6-digit subject-id, <modality> is one of T1w, T2w, or PD, and <res> is the resolution of the image (up to one digits behind the dot, e.g. 1.5)\n",
+    "Write a function that takes the subject_id (as an integer), the modality (as a string), and the resolution (as a float) and returns the complete filename (Hint: use one of the formatting techniques mentioned in [String formatting](#string-formatting))."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def get_filename(subject_id, modality, resolution):\n",
+    "    ..."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For a more difficult exercise, write a function that extracts the subject id, modality, and resolution from a filename name (using a regular expression or by using `find` and `split` to access relevant parts of the string)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "def get_parameters(filename):\n",
+    "    ...\n",
+    "    return subject_id, modality, resolution"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.2"
+  },
+  "toc": {
+   "colors": {
+    "hover_highlight": "#DAA520",
+    "running_highlight": "#FF0000",
+    "selected_highlight": "#FFD700"
+   },
+   "moveMenuLeft": true,
+   "nav_menu": {
+    "height": "287px",
+    "width": "252px"
+   },
+   "navigate_menu": true,
+   "number_sections": true,
+   "sideBar": true,
+   "threshold": 4,
+   "toc_cell": false,
+   "toc_section_display": "block",
+   "toc_window_display": false
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+
+# Text input/output
+
+In this section we will explore how to write and/or retrieve our data from text files.
+
+Most of the functionality for reading/writing files and manipulating strings is available without any imports. However, you can find some additional functionality in the [`string`](https://docs.python.org/3.6/library/string.html) module.
+
+Most of the string functions are available as methods on string objects. This means that you can use the ipython autocomplete to check for them.
+
+%% Cell type:code id: tags:
+
+``` python
+empty_string = ''
+```
+
+%% Cell type:code id: tags:
+
+``` python
+empty_string.    # after running the code block above, put your cursor behind the dot and press tab to get a list of methods
+```
+
+%% Cell type:markdown id: tags:
+
+<a class="anchor" id="creating-new-strings"></a>
+## Creating new strings
+
+<a class="anchor" id="string-syntax"></a>
+### String syntax
+Single-line strings can be created in python using either single or double quotes
+
+%% Cell type:code id: tags:
+
+``` python
+a_string = 'To be or not to be'
+same_string = "To be or not to be"
+print(a_string == same_string)
+```
+
+%% Cell type:markdown id: tags:
+
+The main rationale for choosing between single or double quotes, is whether the string itself will contain any quotes. You can include a single quote in a string surrounded by single quotes by escaping it with the `\` character:
+
+%% Cell type:code id: tags:
+
+``` python
+a_string = "That's the question"
+same_string = 'That\'s the question'
+print(a_string == same_string)
+```
+
+%% Cell type:markdown id: tags:
+
+New-lines (`\n`), tabs (`\t`) and many other special characters are supported
+
+%% Cell type:code id: tags:
+
+``` python
+a_string = "This is the first line.\nAnd here is the second.\n\tThe third starts with a tab."
+print(a_string)
+```
+
+%% Cell type:markdown id: tags:
+
+However, the easiest way to create multi-line strings is to use a triple quote (again single or double quotes can be used:
+
+%% Cell type:code id: tags:
+
+``` python
+multi_line_string = """This is the first line.
+And here is the second.
+\tThird line starts with a tab."""
+print(multi_line_string)
+```
+
+%% Cell type:markdown id: tags:
+
+If you don't want python to reintepret your `\n`, `\t`, etc. in your strings, you can prepend the quotes enclosing the string with an `r`. This will lead to python interpreting the following string as raw text.
+
+%% Cell type:code id: tags:
+
+``` python
+single_line_string = "This string is not multiline.\nEven though it contains the \n character"
+print(single_line_string)
+```
+
+%% Cell type:markdown id: tags:
+
+<a class="anchor" id="unicode-versus-bytes"></a>
+#### unicode versus bytes
+To encourage the spread of python around the world, python 3 switched to using unicode as the default for strings and code (which is one of the main reasons for the incompatibility between python 2 and 3). This means that any unicode characters can be used in strings (or in our code):
+
+%% Cell type:code id: tags:
+
+``` python
+Δ = "café"
+print(Δ)
+```
+
+%% Cell type:markdown id: tags:
+
+Python 3 uses UTF-8 encoding by default, although you can change this in any file (see [python documentation on encoding](https://docs.python.org/3/howto/unicode.html) for more details)
+
+In python 2 the string object was a simple array of bytes. You can create such a byte array from your unicode string in python 3 using the encode method
+
+%% Cell type:code id: tags:
+
+``` python
+delta = "Δ"
+print(delta, ' in python 2 would be represented as ', delta.encode())
+```
+
+%% Cell type:markdown id: tags:
+
+These byte arrays can be created directly be prepending the quotes enclosing the string with a `b`, which tells python 3 to interpret the following as a byte array:
+
+%% Cell type:code id: tags:
+
+``` python
+a_byte_array = b'\xce\xa9'
+print('The bytes ', a_byte_array, ' become ', a_byte_array.decode(), ' with UTF-8 encoding')
+```
+
+%% Cell type:markdown id: tags:
+
+Especially in code dealing with strings (e.g., reading/writing of files) many of the errors arising of running python 2 code in python 3 arise from the mixing of unicode strings with byte arrays. Decoding and/or encoding some of these objects can often fix these issues.
+
+<a class="anchor" id="converting-objects-into-strings"></a>
+### converting objects into strings
+There are two functions to convert python objects into strings, `repr()` and `str()`.
+All other functions that rely on string-representations of python objects will use one of these two (for example the `print()` function will call `str()` on the object).
+
+The goal of the `str()` function is to be readable, while the goal of `repr()` is to be unambiguous. For example
+
+%% Cell type:code id: tags:
+
+``` python
+print(str("3"))
+print(str(3))
+```
+
+%% Cell type:markdown id: tags:
+
+While the output of both `str()` functions are very clear, we can not know whether the input was a string or an actual integer.
+
+%% Cell type:code id: tags:
+
+``` python
+print(repr("3"))
+print(repr(3))
+```
+
+%% Cell type:markdown id: tags:
+
+Note that the output of the `repr()` function can be directly be passed back to the python interpreter to recreate our string/integer.
+
+<a class="anchor" id="combining-strings"></a>
+### Combining strings
+The simplest way to concatenate strings is to simply add them together:
+
+%% Cell type:code id: tags:
+
+``` python
+a_string = "Part 1"
+other_string = "Part 2"
+full_string = a_string + ", " + other_string
+print(full_string)
+```
+
+%% Cell type:markdown id: tags:
+
+Given a whole sequence of strings, you can concatenate them together using the `join()` method:
+
+%% Cell type:code id: tags:
+
+``` python
+list_of_strings = ['first', 'second', 'third', 'fourth']
+full_string = ', '.join(list_of_strings)
+print(full_string)
+```
+
+%% Cell type:markdown id: tags:
+
+Note that the string on which the `join()` method is called (`', '` in this case) is used to glue the different strings together. If you just want to concatenate the strings you can call `join()` on the empty string:
+
+%% Cell type:code id: tags:
+
+``` python
+list_of_strings = ['first', 'second', 'third', 'fourth']
+full_string = ''.join(list_of_strings)
+print(full_string)
+```
+
+%% Cell type:markdown id: tags:
+
+<a class="anchor" id="string-formatting"></a>
+### String formatting
+Using the techniques in [Combining strings](#combining-strings) we can build simple strings. For longer strings it is often useful to first write a template strings with some placeholders, where variables are later inserted. Built into python are currently 4 different ways of doing this (with many packages providing similar capabilities):
+* the recommended [new-style formatting](https://docs.python.org/3.6/library/string.html#format-string-syntax).
+* printf-like [old-style formatting](https://docs.python.org/3/library/stdtypes.html#old-string-formatting)
+* [formatted string literals](https://docs.python.org/3.6/reference/lexical_analysis.html#f-strings) (these are only available in python 3.6+)
+* bash-like [template-strings](https://docs.python.org/3.6/library/string.html#template-strings)
+
+Here we provide a single example using the first three methods, so you can recognize them in the future.
+
+First the old print-f style. Note that this style is invoked by using the modulo (`%`) operator on the string. Every placeholder (starting with the `%`) is then replaced by one of the values provided.
+
+%% Cell type:code id: tags:
+
+``` python
+a = 3
+b = 1 / 3
+
+print('%.3f = %i + %.3f' % (a + b, a, b))
+print('%(total).3f = %(a)i + %(b).3f' % {'a': a, 'b': b, 'total': a + b})
+```
+
+%% Cell type:markdown id: tags:
+
+Then the recommended new style formatting (You can find a nice tutorial [here](https://www.digitalocean.com/community/tutorials/how-to-use-string-formatters-in-python-3)). Note that this style is invoked by calling the `format()` method on the string and the placeholders are marked by the curly braces `{}`.
+
+%% Cell type:code id: tags:
+
+``` python
+a = 3
+b = 1 / 3
+
+print('{:.3f} = {} + {:.3f}'.format(a + b, a, b))
+print('{total:.3f} = {a} + {b:.3f}'.format(a=a, b=b, total=a+b))
+```
+
+%% Cell type:markdown id: tags:
+
+Finally the new, fancy formatted string literals (only available in python 3.6+). This new format is very similar to the recommended style, except that all placeholders are automatically evaluated in the local environment at the time the template is defined. This means that we do not have to explicitly provide the parameters (and we can evaluate the sum inside the string!), although it does mean we also can not re-use the template.
+
+%% Cell type:code id: tags:
+
+``` python
+a = 3
+b = 1/3
+
+print(f'{a + b:.3f} = {a} + {b:.3f} = {a + b}')
+```
+
+%% Cell type:markdown id: tags:
+
+<a class="anchor" id="reading-writing-files"></a>
+## Reading/writing files
+
+## Extracting sub-strings from strings
+### Splitting strings
+The simplest way to extract a sub-string is to use slicing
+
+%% Cell type:code id: tags:
+
+``` python
+a_string = 'abcdefghijklmnopqrstuvwxyz'
+print(a_string[10])  # create a string containing only the 10th character
+print(a_string[20:])  # create a string containing the 20th character onward
+print(a_string[::-1])  # creating the reverse string
+```
+
+%% Cell type:markdown id: tags:
+
+If you are not sure, where to cut into a string, you can use the `find()` method to find the first occurrence of a sub-string or `findall()` to find all occurrences.
+
+%% Cell type:code id: tags:
+
+``` python
+a_string = 'abcdefghijklmnopqrstuvwxyz'
+index = a_string.find('fgh')
+print(a_string[:index])  # extracts the sub-string up to the first occurence of 'fgh'
+print('index for non-existent sub-string', a_string.find('cats'))  # note that find returns -1 when it can not find the sub-string rather than raising an error.
+```
+
+%% Cell type:markdown id: tags:
+
+### Regular expressions
+Regular expressions are used for looking for specific patterns in a longer string. This can be used to extract specific information from a well-formatted string or to modify a string. In python regular expressions are available in the [re](https://docs.python.org/3/library/re.html#re-syntax) module.
+
+A full discussion of regular expression goes far beyond this tutorial. If you are interested, have a look at [https://docs.python.org/3/howto/regex.html]
+
+## Exercises
+### Joining/splitting strings
+go from 2 column file to 2 rows
+### String formatting and regular expressions
+Given a template for MRI files:
+s<subject_id>/<modality>_<res>mm.nii.gz
+where <subject_id> is a 6-digit subject-id, <modality> is one of T1w, T2w, or PD, and <res> is the resolution of the image (up to one digits behind the dot, e.g. 1.5)
+Write a function that takes the subject_id (as an integer), the modality (as a string), and the resolution (as a float) and returns the complete filename (Hint: use one of the formatting techniques mentioned in [String formatting](#string-formatting)).
+
+%% Cell type:code id: tags:
+
+``` python
+def get_filename(subject_id, modality, resolution):
+    ...
+```
+
+%% Cell type:markdown id: tags:
+
+For a more difficult exercise, write a function that extracts the subject id, modality, and resolution from a filename name (using a regular expression or by using `find` and `split` to access relevant parts of the string)
+
+%% Cell type:code id: tags:
+
+``` python
+def get_parameters(filename):
+    ...
+    return subject_id, modality, resolution
+```
--- a/getting_started/02_text_io.md
+++ b/getting_started/02_text_io.md
@@ -54,15 +54,16 @@ print(single_line_string)

 <a class="anchor" id="unicode-versus-bytes"></a>
 #### unicode versus bytes
-To encourage the spread of python around the world, python 3 switched to using unicode as the default for strings and code (which is one of the main reasons for the incompatibility between python 2 and 3). This means that any unicode characters can be used in strings (or in our code):
+To encourage the spread of python around the world, python 3 switched to using unicode as the default for strings and code (which is one of the main reasons for the incompatibility between python 2 and 3).
+This means that each element in a string is a unicode character (using [UTF-8 encoding](https://docs.python.org/3/howto/unicode.html)), which can consist of one or more bytes.
+The advantage is that any unicode characters can now be used in strings or in the code itself:
 ```
 Δ = "café"
 print(Δ)
 ```

-Python 3 uses UTF-8 encoding by default, although you can change this in any file (see [python documentation on encoding](https://docs.python.org/3/howto/unicode.html) for more details)

-In python 2 the string object was a simple array of bytes. You can create such a byte array from your unicode string in python 3 using the encode method
+In python 2 each element in a string was a single byte rather than a potentially multi-byte character. You can create such a byte array from your unicode string in python 3 using the `encode()` method and converted back to a `decode()` method.
 ```
 delta = "Δ"
 print(delta, ' in python 2 would be represented as ', delta.encode())
@@ -71,7 +72,7 @@ print(delta, ' in python 2 would be represented as ', delta.encode())
 These byte arrays can be created directly be prepending the quotes enclosing the string with a `b`, which tells python 3 to interpret the following as a byte array:
 ```
 a_byte_array = b'\xce\xa9'
-print('The bytes ', a_byte_array, ' become ', a_byte_array.decode(), ' with UTF-8 encoding')
+print('The two bytes ', a_byte_array, ' become single unicode character (', a_byte_array.decode(), ') with UTF-8 encoding')
 ```

 Especially in code dealing with strings (e.g., reading/writing of files) many of the errors arising of running python 2 code in python 3 arise from the mixing of unicode strings with byte arrays. Decoding and/or encoding some of these objects can often fix these issues.

--- a/getting_started/scripts.ipynb
+++ b/getting_started/scripts.ipynb
@@ -18,7 +18,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "#!/usr/bin/env python"
@@ -36,7 +38,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "#!/usr/bin/env fslpython"
@@ -56,7 +60,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "import subprocess as sp\n",
@@ -73,7 +79,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "spobj = sp.run(['ls'], stdout = sp.PIPE)"
@@ -89,7 +97,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "spobj = sp.run('ls -la'.split(), stdout = sp.PIPE)\n",
@@ -109,7 +119,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "import os\n",
@@ -131,7 +143,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "commands = \"\"\"\n",
@@ -162,7 +176,9 @@
  {
   "cell_type": "code",
   "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
   "outputs": [],
   "source": [
    "import sys\n",
@@ -184,10 +200,19 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 1,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Usage: bash <input image> <output image>\n"
+     ]
+    }
+   ],
   "source": [
+    "%%bash\n",
    "#!/bin/bash\n",
    "if [ $# -lt 2 ] ; then\n",
    "  echo \"Usage: $0 <input image> <output image>\"\n",
@@ -213,9 +238,21 @@
  },
  {
   "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
   "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "ename": "IndexError",
+     "evalue": "list index out of range",
+     "traceback": [
+      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
+      "\u001b[0;31mIndexError\u001b[0m                                Traceback (most recent call last)",
+      "\u001b[0;32m<ipython-input-2-f7378930c369>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     13\u001b[0m \u001b[0mspobj\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mfsldir\u001b[0m\u001b[0;34m+\u001b[0m\u001b[0;34m'/bin/fslstats'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0moutfile\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'-V'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mstdout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mPIPE\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     14\u001b[0m \u001b[0msout\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mspobj\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstdout\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'utf-8'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 15\u001b[0;31m \u001b[0mvol_vox\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfloat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msout\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     16\u001b[0m \u001b[0mvol_mm\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfloat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msout\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msplit\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     17\u001b[0m \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'Volumes are: '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvol_vox\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m' in voxels and '\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mvol_mm\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m' in mm'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
+      "\u001b[0;31mIndexError\u001b[0m: list index out of range"
+     ],
+     "output_type": "error"
+    }
+   ],
   "source": [
    "#!/usr/bin/env fslpython\n",
    "import os, sys\n",
@@ -235,9 +272,55 @@
    "vol_mm = float(sout.split()[1])\n",
    "print('Volumes are: ', vol_vox, ' in voxels and ', vol_mm, ' in mm')"
   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
  }
 ],
- "metadata": {},
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.2"
+  },
+  "toc": {
+   "colors": {
+    "hover_highlight": "#DAA520",
+    "running_highlight": "#FF0000",
+    "selected_highlight": "#FFD700"
+   },
+   "moveMenuLeft": true,
+   "nav_menu": {
+    "height": "105px",
+    "width": "252px"
+   },
+   "navigate_menu": true,
+   "number_sections": true,
+   "sideBar": true,
+   "threshold": 4.0,
+   "toc_cell": false,
+   "toc_section_display": "block",
+   "toc_window_display": false
+  }
+ },
 "nbformat": 4,
 "nbformat_minor": 2
 }
 %% Cell type:markdown id: tags:

 # Callable scripts in python

 In this tutorial we will cover how to write simple stand-alone scripts in python that can be used as alternatives to bash scripts.

 There are some code blocks within this webpage, but we recommend that you write the code in an IDE or editor instead and then run the scripts from a terminal.

 ## Basic script

 The first line of a python script is usually:

 %% Cell type:code id: tags:

-``` 
+``` python
 #!/usr/bin/env python
 ```

 %% Cell type:markdown id: tags:

 which invokes whichever version of python can be found by `/usr/bin/env` since python can be located in many different places.

 For FSL scripts we use an alternative, to ensure that we pick up the version of python (and associated packages) that we ship with FSL.  To do this we use the line:

 %% Cell type:code id: tags:

-``` 
+``` python
 #!/usr/bin/env fslpython
 ```

 %% Cell type:markdown id: tags:

 After this line the rest of the file just uses regular python syntax, as in the other tutorials.  Make sure you make the file executable - just like a bash script.

 ## Calling other executables

 The most essential call that you need to use to replicate the way a bash script calls executables is `subprocess.run()`.  A simple call looks like this:

 %% Cell type:code id: tags:

-``` 
+``` python
 import subprocess as sp
 sp.run(['ls', '-la'])
 ```

 %% Cell type:markdown id: tags:

 To suppress the output do this:

 %% Cell type:code id: tags:

-``` 
+``` python
 spobj = sp.run(['ls'], stdout = sp.PIPE)
 ```

 %% Cell type:markdown id: tags:

 To store the output do this:

 %% Cell type:code id: tags:

-``` 
+``` python
 spobj = sp.run('ls -la'.split(), stdout = sp.PIPE)
 sout = spobj.stdout.decode('utf-8')
 print(sout)
 ```

 %% Cell type:markdown id: tags:

 > Note that the `decode` call in the middle line converts the string from a byte string to a normal string. In Python 3 there is a distinction between strings (sequences of characters, possibly using multiple bytes to store each character) and bytes (sequences of bytes). The world has moved on from ASCII, so in this day and age, this distinction is absolutely necessary, and Python does a fairly good job of it.

 If the output is numerical then this can be extracted like this:

 %% Cell type:code id: tags:

-``` 
+``` python
 import os
 fsldir = os.getenv('FSLDIR')
 spobj = sp.run([fsldir+'/bin/fslstats', fsldir+'/data/standard/MNI152_T1_1mm_brain', '-V'], stdout = sp.PIPE)
 sout = spobj.stdout.decode('utf-8')
 vol_vox = float(sout.split()[0])
 vol_mm = float(sout.split()[1])
 print('Volumes are: ', vol_vox, ' in voxels and ', vol_mm, ' in mm')
 ```

 %% Cell type:markdown id: tags:

 An alternative way to run a set of commands would be like this:

 %% Cell type:code id: tags:

-``` 
+``` python
 commands = """
 {fsldir}/bin/fslmaths {t1} -bin {t1_mask}
 {fsldir}/bin/fslmaths {t2} -mas {t1_mask} {t2_masked}
 """

 fsldirpath = os.getenv('FSLDIR')
 commands = commands.format(t1 = 't1.nii.gz', t1_mask = 't1_mask', t2 = 't2', t2_masked = 't2_masked', fsldir = fsldirpath)

 sout=[]
 for cmd in commands.split('\n'):
    if cmd:   # avoids empty strings getting passed to sp.run()
        print('Running command: ', cmd)
        spobj = sp.run(cmd.split(), stdout = sp.PIPE)
        sout.append(spobj.stdout.decode('utf-8'))
 ```

 %% Cell type:markdown id: tags:

 ## Command line arguments

 The simplest way of dealing with command line arguments is use the module `sys`, which gives access to an `argv` list:

 %% Cell type:code id: tags:

-``` 
+``` python
 import sys
 print(len(sys.argv))
 print(sys.argv[0])
 ```

 %% Cell type:markdown id: tags:

 For more sophisticated argument parsing you can use `argparse` -  good documentation and examples of this can be found on the web.


 ## Example script

 Here is a simple bash script (it masks an image and calculates volumes - just as a random example). DO NOT execute the code blocks here within the notebook/webpage:

 %% Cell type:code id: tags:

-``` 
+``` python
+%%bash
 #!/bin/bash
 if [ $# -lt 2 ] ; then
  echo "Usage: $0 <input image> <output image>"
  exit 1
 fi
 infile=$1
 outfile=$2
 # mask input image with MNI
 $FSLDIR/bin/fslmaths $infile -mas $FSLDIR/data/standard/MNI152_T1_1mm_brain $outfile
 # calculate volumes of masked image
 vv=`$FSLDIR/bin/fslstats $outfile -V`
 vol_vox=`echo $vv | awk '{ print $1 }'`
 vol_mm=`echo $vv | awk '{ print $2 }'`
 echo "Volumes are: $vol_vox in voxels and $vol_mm in mm"
 ```

+%% Output
+
+    Usage: bash <input image> <output image>
+
 %% Cell type:markdown id: tags:

 And an alternative in python:

 %% Cell type:code id: tags:

-``` 
+``` python
 #!/usr/bin/env fslpython
 import os, sys
 import subprocess as sp
 fsldir=os.getenv('FSLDIR')
 if len(sys.argv)<2:
  print('Usage: ', sys.argv[0], ' <input image> <output image>')
  sys.exit(1)
 infile = sys.argv[1]
 outfile = sys.argv[2]
 # mask input image with MNI
 spobj = sp.run([fsldir+'/bin/fslmaths', infile, '-mas', fsldir+'/data/standard/MNI152_T1_1mm_brain', outfile], stdout = sp.PIPE)
 # calculate volumes of masked image
 spobj = sp.run([fsldir+'/bin/fslstats', outfile, '-V'], stdout = sp.PIPE)
 sout = spobj.stdout.decode('utf-8')
 vol_vox = float(sout.split()[0])
 vol_mm = float(sout.split()[1])
 print('Volumes are: ', vol_vox, ' in voxels and ', vol_mm, ' in mm')
 ```
+
+%% Output
+
+    ---------------------------------------------------------------------------
+    IndexError                                Traceback (most recent call last)
+    <ipython-input-2-f7378930c369> in <module>()
+         13 spobj = sp.run([fsldir+'/bin/fslstats', outfile, '-V'], stdout = sp.PIPE)
+         14 sout = spobj.stdout.decode('utf-8')
+    ---> 15 vol_vox = float(sout.split()[0])
+         16 vol_mm = float(sout.split()[1])
+         17 print('Volumes are: ', vol_vox, ' in voxels and ', vol_mm, ' in mm')
+    IndexError: list index out of range
+
+%% Cell type:code id: tags:
+
+``` python
+```