added string manipulations

still need file input/output

added string manipulations
5affdf9b · Michiel Cottaar · 1422b3ef · 5affdf9b
Commit 5affdf9b authored 7 years ago by Michiel Cottaar
--- a/getting_started/02_text_io.md
+++ b/getting_started/02_text_io.md
+# Text input/output
+
+In this section we will explore how to write and/or retrieve our data from text files.
+
+Most of the functionality for reading/writing files and manipulating strings is available without any imports. However, you can find some additional functionality in the [`string`](https://docs.python.org/3.6/library/string.html) module.
+
+Most of the string functions are available as methods on string objects. This means that you can use the ipython autocomplete to check for them.
+```
+empty_string = ''
+```
+
+```
+empty_string.    # after running the code block above, put your cursor behind the dot and press tab to get a list of methods
+```
+
+<a class="anchor" id="creating-new-strings"></a>
+## Creating new strings
+
+<a class="anchor" id="string-syntax"></a>
+### String syntax
+Single-line strings can be created in python using either single or double quotes
+```
+a_string = 'To be or not to be'
+same_string = "To be or not to be"
+print(a_string == same_string)
+```
+
+The main rationale for choosing between single or double quotes, is whether the string itself will contain any quotes. You can include a single quote in a string surrounded by single quotes by escaping it with the `\` character:
+```
+a_string = "That's the question"
+same_string = 'That\'s the question'
+print(a_string == same_string)
+```
+
+New-lines (`\n`), tabs (`\t`) and many other special characters are supported
+```
+a_string = "This is the first line.\nAnd here is the second.\n\tThe third starts with a tab."
+print(a_string)
+```
+
+However, the easiest way to create multi-line strings is to use a triple quote (again single or double quotes can be used:
+```
+multi_line_string = """This is the first line.
+And here is the second.
+\tThird line starts with a tab."""
+print(multi_line_string)
+```
+
+If you don't want python to reintepret your `\n`, `\t`, etc. in your strings, you can prepend the quotes enclosing the string with an `r`. This will lead to python interpreting the following string as raw text.
+```
+single_line_string = r"This string is not multiline.\nEven though it contains the \n character"
+print(single_line_string)
+```
+
+<a class="anchor" id="unicode-versus-bytes"></a>
+#### unicode versus bytes
+To encourage the spread of python around the world, python 3 switched to using unicode as the default for strings and code (which is one of the main reasons for the incompatibility between python 2 and 3). This means that any unicode characters can be used in strings (or in our code):
+```
+Δ = "café"
+print(Δ)
+```
+
+Python 3 uses UTF-8 encoding by default, although you can change this in any file (see [python documentation on encoding](https://docs.python.org/3/howto/unicode.html) for more details)
+
+In python 2 the string object was a simple array of bytes. You can create such a byte array from your unicode string in python 3 using the encode method
+```
+delta = "Δ"
+print(delta, ' in python 2 would be represented as ', delta.encode())
+```
+
+These byte arrays can be created directly be prepending the quotes enclosing the string with a `b`, which tells python 3 to interpret the following as a byte array:
+```
+a_byte_array = b'\xce\xa9'
+print('The bytes ', a_byte_array, ' become ', a_byte_array.decode(), ' with UTF-8 encoding')
+```
+
+Especially in code dealing with strings (e.g., reading/writing of files) many of the errors arising of running python 2 code in python 3 arise from the mixing of unicode strings with byte arrays. Decoding and/or encoding some of these objects can often fix these issues.
+
+<a class="anchor" id="converting-objects-into-strings"></a>
+### converting objects into strings
+There are two functions to convert python objects into strings, `repr()` and `str()`.
+All other functions that rely on string-representations of python objects will use one of these two (for example the `print()` function will call `str()` on the object).
+
+The goal of the `str()` function is to be readable, while the goal of `repr()` is to be unambiguous. For example
+```
+print(str("3"))
+print(str(3))
+```
+While the output of both `str()` functions are very clear, we can not know whether the input was a string or an actual integer.
+
+```
+print(repr("3"))
+print(repr(3))
+```
+Note that the output of the `repr()` function can be directly be passed back to the python interpreter to recreate our string/integer.
+
+<a class="anchor" id="combining-strings"></a>
+### Combining strings
+The simplest way to concatenate strings is to simply add them together:
+```
+a_string = "Part 1"
+other_string = "Part 2"
+full_string = a_string + ", " + other_string
+print(full_string)
+```
+
+Given a whole sequence of strings, you can concatenate them together using the `join()` method:
+```
+list_of_strings = ['first', 'second', 'third', 'fourth']
+full_string = ', '.join(list_of_strings)
+print(full_string)
+```
+
+Note that the string on which the `join()` method is called (`', '` in this case) is used as a delimiter to separate the different strings. If you just want to concatenate the strings you can call `join()` on the empty string:
+```
+list_of_strings = ['first', 'second', 'third', 'fourth']
+full_string = ''.join(list_of_strings)
+print(full_string)
+```
+
+<a class="anchor" id="string-formatting"></a>
+### String formatting
+Using the techniques in [Combining strings](#combining-strings) we can build simple strings. For longer strings it is often useful to first write a template strings with some placeholders, where variables are later inserted. Built into python are currently 4 different ways of doing this (with many packages providing similar capabilities):
+* the recommended [new-style formatting](https://docs.python.org/3.6/library/string.html#format-string-syntax).
+* printf-like [old-style formatting](https://docs.python.org/3/library/stdtypes.html#old-string-formatting)
+* [formatted string literals](https://docs.python.org/3.6/reference/lexical_analysis.html#f-strings) (these are only available in python 3.6+)
+* bash-like [template-strings](https://docs.python.org/3.6/library/string.html#template-strings)
+
+Here we provide a single example using the first three methods, so you can recognize them in the future.
+
+First the old print-f style. Note that this style is invoked by using the modulo (`%`) operator on the string. Every placeholder (starting with the `%`) is then replaced by one of the values provided.
+```
+a = 3
+b = 1 / 3
+
+print('%.3f = %i + %.3f' % (a + b, a, b))
+print('%(total).3f = %(a)i + %(b).3f' % {'a': a, 'b': b, 'total': a + b})
+```
+
+Then the recommended new style formatting (You can find a nice tutorial [here](https://www.digitalocean.com/community/tutorials/how-to-use-string-formatters-in-python-3)). Note that this style is invoked by calling the `format()` method on the string and the placeholders are marked by the curly braces `{}`.
+```
+a = 3
+b = 1 / 3
+
+print('{:.3f} = {} + {:.3f}'.format(a + b, a, b))
+print('{total:.3f} = {a} + {b:.3f}'.format(a=a, b=b, total=a+b))
+```
+
+Finally the new, fancy formatted string literals (only available in python 3.6+). This new format is very similar to the recommended style, except that all placeholders are automatically evaluated in the local environment at the time the template is defined. This means that we do not have to explicitly provide the parameters (and we can evaluate the sum inside the string!), although it does mean we also can not re-use the template.
+```
+a = 3
+b = 1/3
+
+print(f'{a + b:.3f} = {a} + {b:.3f} = {a + b}')
+```
+
+
+<a class="anchor" id="reading-writing-files"></a>
+## Reading/writing files
+
+
+
+## Extracting sub-strings from strings
+### Splitting strings
+The simplest way to extract a sub-string is to use slicing
+```
+a_string = 'abcdefghijklmnopqrstuvwxyz'
+print(a_string[10])  # create a string containing only the 10th character
+print(a_string[20:])  # create a string containing the 20th character onward
+print(a_string[::-1])  # creating the reverse string
+```
+
+If you are not sure, where to cut into a string, you can use the `find()` method to find the first occurrence of a sub-string or `findall()` to find all occurrences.
+```
+a_string = 'abcdefghijklmnopqrstuvwxyz'
+index = a_string.find('fgh')
+print(a_string[:index])  # extracts the sub-string up to the first occurence of 'fgh'
+print('index for non-existent sub-string', a_string.find('cats'))  # note that find returns -1 when it can not find the sub-string rather than raising an error.
+```
+
+### Regular expressions
+Regular expressions are used for looking for specific patterns in a longer string. This can be used to extract specific information from a well-formatted string or to modify a string. In python regular expressions are available in the [re](https://docs.python.org/3/library/re.html#re-syntax) module.
+
+A full discussion of regular expression goes far beyond this tutorial. If you are interested, have a look at [https://docs.python.org/3/howto/regex.html]
+
+## Exercises
+### Joining/splitting strings
+go from 2 column file to 2 rows
+### String formatting and regular expressions
+Given a template for MRI files:
+s<subject_id>/<modality>_<res>mm.nii.gz
+where <subject_id> is a 6-digit subject-id, <modality> is one of T1w, T2w, or PD, and <res> is the resolution of the image (up to one digits behind the dot, e.g. 1.5)
+Write a function that takes the subject_id (as an integer), the modality (as a string), and the resolution (as a float) and returns the complete filename (Hint: use one of the formatting techniques mentioned in [String formatting](#string-formatting)).
+```
+def get_filename(subject_id, modality, resolution):
+    ...
+```
+
+For a more difficult exercise, write a function that extracts the subject id, modality, and resolution from a filename name (using a regular expression or by using `find` and `split` to access relevant parts of the string)
+```
+def get_parameters(filename):
+    ...
+    return subject_id, modality, resolution
+```
+