In [1]:

%matplotlib inline
%pylab inline

Populating the interactive namespace from numpy and matplotlib

NumPy - multidimensional data arrays

NumPy in a nutshell is a multidimensional array library.

Why does one need numpy?

No useful builtin homogenous data container built into Python, lists can contain arbitrary type objects which needs a lot of memory and CPU power to process and store.
Only very limited number functions available to work with containers containing homogenous data.

In [2]:

l = list(range(10000))
a = np.arange(10000)

In [3]:

l[:10], a[:10]

Out[3]:

([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))

In [4]:

%timeit [x + 1 for x in l]

1000 loops, best of 3: 748 µs per loop

In [5]:

%timeit a + 1

10000 loops, best of 3: 15.4 µs per loop

Creating `numpy` arrays

There are a number of ways to initialize new numpy arrays, for example from

a Python list or tuples
using functions that are dedicated to generating numpy arrays, such as arange, linspace, etc.
reading data from files
wrapping existing memory blocks (buffer interface, memoryview)

From lists

For example, to create new vector and matrix arrays from Python lists we can use the numpy.array function.

In [6]:

# one dimensional array, the argument to the array function is a Python list
# will cast to 'best' type as required: int -> double -> complex double
v = np.array([1,2,3,4])

v

Out[6]:

array([1, 2, 3, 4])

In [7]:

# a 2d image: the argument to the array function is a nested Python list
img = np.array([[1, 2], [3, 4]])

img

Out[7]:

array([[1, 2],
       [3, 4]])

An array has two basic properties:

dtype: object that fully describes the type of data, for example 32 bit unsigned integer or 128 bit complex floating point number

In [8]:

v.dtype

Out[8]:

dtype('int64')

shape: a tuple containing the size of each dimension, an array can have up to 32 dimensions and each dimension can be as large as the address space allows.

In [9]:

v.shape

Out[9]:

(4,)

In [10]:

img.shape

Out[10]:

(2, 2)

Many functions take arguments named dtype or shape (sometimes size) that define how the output looks like:

In [11]:

img = np.array([[1, 2], [3, 4]], dtype=np.complex64)

img

Out[11]:

array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]], dtype=complex64)

Using array-generating functions

For larger arrays it is inpractical to initialize the data manually, using explicit pythons lists. Instead we can use one of the many functions in numpy that generates arrays of different forms. Some of the more common are:

emtpy, zeros and ones

In [12]:

np.empty((3,3))

Out[12]:

array([[  0.00000000e+000,   0.00000000e+000,   3.16601930e-316],
       [  3.21323221e-316,   3.17501841e-316,   3.21372865e-316],
       [  1.66064396e-316,   0.00000000e+000,   6.90268638e-310]])

In [13]:

np.zeros((3,3))

Out[13]:

array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

In [14]:

np.ones((3,3), dtype=np.int16)

Out[14]:

array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]], dtype=int16)

In [15]:

# get same type of array
np.empty_like(v)
np.ones_like(v)
np.zeros_like(v)
np.full_like(v, 3) #numpy 1.8

Out[15]:

array([3, 3, 3, 3])

In [16]:

# create a range

x = np.arange(0, 10, 1) # arguments: start, stop, step, [dtype]

x

Out[16]:

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [17]:

x = np.arange(-1, 1, 0.1)

x

Out[17]:

array([ -1.00000000e+00,  -9.00000000e-01,  -8.00000000e-01,
        -7.00000000e-01,  -6.00000000e-01,  -5.00000000e-01,
        -4.00000000e-01,  -3.00000000e-01,  -2.00000000e-01,
        -1.00000000e-01,  -2.22044605e-16,   1.00000000e-01,
         2.00000000e-01,   3.00000000e-01,   4.00000000e-01,
         5.00000000e-01,   6.00000000e-01,   7.00000000e-01,
         8.00000000e-01,   9.00000000e-01])

linspace and logspace

In [18]:

# using linspace, end points are included by default and you define the number of points instead of the step
np.linspace(0, 10, num=30, endpoint=True)

Out[18]:

array([  0.        ,   0.34482759,   0.68965517,   1.03448276,
         1.37931034,   1.72413793,   2.06896552,   2.4137931 ,
         2.75862069,   3.10344828,   3.44827586,   3.79310345,
         4.13793103,   4.48275862,   4.82758621,   5.17241379,
         5.51724138,   5.86206897,   6.20689655,   6.55172414,
         6.89655172,   7.24137931,   7.5862069 ,   7.93103448,
         8.27586207,   8.62068966,   8.96551724,   9.31034483,
         9.65517241,  10.        ])

In [19]:

np.logspace(0, 10, num=10, base=np.e)

Out[19]:

array([  1.00000000e+00,   3.03773178e+00,   9.22781435e+00,
         2.80316249e+01,   8.51525577e+01,   2.58670631e+02,
         7.85771994e+02,   2.38696456e+03,   7.25095809e+03,
         2.20264658e+04])

In [20]:

from numpy import random

In [21]:

# standard normal distributed random numbers
random.normal(loc=-5., scale=5, size=(5,5))

Out[21]:

array([[-10.07496   ,  -8.90712815,  -2.59793592,  -8.89634474,
        -10.60483511],
       [ -8.93133842,  -0.4726435 , -10.20600326, -10.44268975,  -0.9836462 ],
       [-11.01432594,  -9.38263595,  -0.3652716 ,  -8.30859508,
          1.38892859],
       [  0.99216126, -12.00298969,  -9.85226298,  -2.32475633,
         -8.16799242],
       [ -7.30244278,  -7.24388802,  -6.26780013,  -7.97688188,
         -4.67301287]])

File I/O

Comma-separated values (CSV)

A very common file format for data files are the comma-separated values (CSV), or related format such as TSV (tab-separated values). To read data from such file into Numpy arrays we can use the numpy.genfromtxt or numpy.loadtxt function. For example:

In [22]:

!head stockholm_td_adj.dat

#year month day min   temperature   max dontknow
1800  1  1    -6.1    -6.1    -6.1 1
1800  1  2   -15.4   -15.4   -15.4 1
1800  1  3   -15.0   -15.0   -15.0 1
1800  1  4   -19.3   -19.3   -19.3 1
1800  1  5   -16.8   -16.8   -16.8 1
1800  1  6   -11.4   -11.4   -11.4 1
1800  1  7    -7.6    -7.6    -7.6 1
1800  1  8    -7.1    -7.1    -7.1 1
1800  1  9   -10.1   -10.1   -10.1 1

In [23]:

data = np.genfromtxt('stockholm_td_adj.dat')
data.shape, data.dtype

Out[23]:

((77431, 7), dtype('float64'))

In [24]:

!wc -l stockholm_td_adj.dat

77432 stockholm_td_adj.dat

In [25]:

data = np.genfromtxt('stockholm_td_adj.dat', names=True)
# structured array
data.shape, data.dtype

Out[25]:

((77431,),
 dtype([('year', '<f8'), ('month', '<f8'), ('day', '<f8'), ('min', '<f8'), ('temperature', '<f8'), ('max', '<f8'), ('dontknow', '<f8')]))

In [26]:

data['year'][:10]

Out[26]:

array([ 1800.,  1800.,  1800.,  1800.,  1800.,  1800.,  1800.,  1800.,
        1800.,  1800.])

In [27]:

fig, ax = subplots(figsize=(14,4))

x = data['year'] + data['month'] / 12.0 + data['day'] / 365.
y = data['temperature']

ax.plot(x, y)
ax.axis('tight')
ax.set_title('temperatures in Stockholm')
ax.set_xlabel('year')
ax.set_ylabel('tempature (C)');

Manipulating arrays

Changing the shape

The shape (dimensionality) of an array can be changed as long as the new shape has the same number of elements

In [28]:

vec = np.arange(9)
# create a view of vec with a different shape
imgview = vec.reshape(3, 3)

imgview

Out[28]:

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [29]:

# the ordering of the elements can be defined (Fortran or C)
vec.reshape(3, 3, order='F')

Out[29]:

array([[0, 3, 6],
       [1, 4, 7],
       [2, 5, 8]])

In [30]:

# the value of a negative entry will be infered from the remaining free entries
vec.reshape(3, -1).shape

Out[30]:

(3, 3)

Reshaping will create a view/reference of the original data array. Changing the reshaped array will also change the original array.

In [31]:

imgview[0] = 5
vec

Out[31]:

array([5, 5, 5, 3, 4, 5, 6, 7, 8])

Indexing

We can index elements in an array using the square bracket and indices:

In [32]:

# regular python indexing [start:stop:step]
vec[3], vec[3::-1]

Out[32]:

(3, array([3, 5, 5, 5]))

The dimensions are separated by commas in the brackets:

In [33]:

imgview[1,0]

Out[33]:

In [34]:

#row
imgview[0], imgview[0,:]

Out[34]:

(array([5, 5, 5]), array([5, 5, 5]))

In [35]:

#column
imgview[:, 0]
# Ellipsis means all free axis, this selects the zero element of the last dimension, regardless of the dimension of imgview
imgview[..., 0]

Out[35]:

array([5, 3, 6])

In [36]:

# assignment on a indexed array
imgview[-1, :] = 1
imgview

Out[36]:

array([[5, 5, 5],
       [3, 4, 5],
       [1, 1, 1]])

Compared to Python lists, a slice of an array is no copy, it is a view on the original data copies must be explicit, e.g. with

imgview[0, :].copy()

Basic scalar math on arrays

The scalar math operation on arrays work element-wise:

In [37]:

a = np.ones((5,))
b = np.zeros_like(a)
a, b

Out[37]:

(array([ 1.,  1.,  1.,  1.,  1.]), array([ 0.,  0.,  0.,  0.,  0.]))

In [38]:

b = a + a
b

Out[38]:

array([ 2.,  2.,  2.,  2.,  2.])

In [39]:

# in place
a /= b
a

Out[39]:

array([ 0.5,  0.5,  0.5,  0.5,  0.5])

In [40]:

a + 1

Out[40]:

array([ 1.5,  1.5,  1.5,  1.5,  1.5])

In [41]:

a + np.ones((4,))

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-2d6215575c4f> in <module>()
----> 1 a + np.ones((4,))

ValueError: operands could not be broadcast together with shapes (5) (4)

Broadcasting

Broadcasting applies if the shapes of a set of arrays do not match but the shapes can be made to match by repeating certain dimensions.

The basic rules when broadcasting can be applied are:

the size of an dimension is 1
the trailing shape of the arrays match

Broadcasting documentation

The simplest case is adding a zero dimensional array (scalar) to a higher dimensional array:

In [43]:

from IPython.display import Image
Image(filename='images/broadcast_11.png')

Out[43]:

In [44]:

# subtract vector from an image (e.g. remove overscan offset):
img = np.arange(4*3).reshape(4,3)
o = np.array([1, 2, 3])
#dim(4, 3) - dim(3)
img - o

Out[44]:

array([[-1, -1, -1],
       [ 2,  2,  2],
       [ 5,  5,  5],
       [ 8,  8,  8]])

In [45]:

img = img.reshape(3,4)
o = np.array([1, 2, 3])
#dim(3, 4) - dim(3), trailing dimension does not match
img - o

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-5fbf216187b7> in <module>()
      2 o = np.array([1, 2, 3])
      3 #dim(3, 4) - dim(3), trailing dimension does not match
----> 4 img - o

ValueError: operands could not be broadcast together with shapes (3,4) (3)

In [46]:

# solution: add a new empty dimension (size 1)
# dim(3, 4) - dim(3, 1)
img - o[:,np.newaxis]

Out[46]:

array([[-1,  0,  1,  2],
       [ 2,  3,  4,  5],
       [ 5,  6,  7,  8]])

Fancy indexing

Arrays or lists of integers can also be used as indices to other arrays:

In [47]:

v = np.arange(5, 10)
row_indices = [0, 2, 4]
print(v)
print(row_indices, '->', v[row_indices])

[5 6 7 8 9]
([0, 2, 4], '->', array([5, 7, 9]))

In [48]:

img = np.arange(25).reshape(5, 5)
col_indices = [1, 2, -1] # index -1 means the last element
img[row_indices, col_indices]

Out[48]:

array([ 1, 12, 24])

In [49]:

img[:, col_indices]

Out[49]:

array([[ 1,  2,  4],
       [ 6,  7,  9],
       [11, 12, 14],
       [16, 17, 19],
       [21, 22, 24]])

Functions related with positions of data like sorting, partitioning, min/max often have a arg variant which instead of returning the result, return an index array that when applied to the original array return the result:

In [50]:

v = random.randint(0, 5, size=10)
sortindices = v.argsort()
sortindices, v[sortindices]

Out[50]:

(array([3, 4, 0, 5, 6, 8, 9, 1, 2, 7]), array([0, 0, 1, 1, 1, 1, 3, 4, 4, 4]))

We can also index with boolean masks: If the index array is an Numpy array of with data type bool, then an element is selected (True) or not (False) depending on the value of the index mask at the position each element:

In [51]:

x = arange(0, 10, 0.5)
x

Out[51]:

array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])

In [52]:

# & is a bitwise and, which, for booleans, is equivalent to np.logical_and(a, b)
mask = (x >= 5) & (x < 7.5)

mask

Out[52]:

array([False, False, False, False, False, False, False, False, False,
       False,  True,  True,  True,  True,  True, False, False, False,
       False, False], dtype=bool)

In [53]:

x[mask]

Out[53]:

array([ 5. ,  5.5,  6. ,  6.5,  7. ])

NumPy also contains a special array class which carries a mask and ignores the masked values for all arithmetic operations:

In [54]:

mx = np.ma.MaskedArray(x, mask=~((5 < x) & (x < 7.5)))
mx

Out[54]:

masked_array(data = [-- -- -- -- -- -- -- -- -- -- -- 5.5 6.0 6.5 7.0 -- -- -- -- --],
             mask = [ True  True  True  True  True  True  True  True  True  True  True False
 False False False  True  True  True  True  True],
       fill_value = 1e+20)

In [55]:

x.sum(), mx.sum()

Out[55]:

(95.0, 25.0)

where

The index mask can be converted to position index using the where function

In [56]:

indices = np.where((x > 5) & (x < 7.5))

indices

Out[56]:

(array([11, 12, 13, 14]),)

It can also be used to conditionally select from two options

In [57]:

a = np.ones_like(x)
np.where((x > 5) & (x < 7.5), a, 5) # note the broadcasting of the second option

Out[57]:

array([ 5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  1.,  1.,
        1.,  1.,  5.,  5.,  5.,  5.,  5.])