In [1]:
%matplotlib inline
%pylab inline
Populating the interactive namespace from numpy and matplotlib

NumPy - multidimensional data arrays

NumPy in a nutshell is a multidimensional array library.

Why does one need numpy?

In [2]:
l = list(range(10000))
a = np.arange(10000)
In [3]:
l[:10], a[:10]
Out[3]:
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]))
In [4]:
%timeit [x + 1 for x in l]
1000 loops, best of 3: 748 µs per loop

In [5]:
%timeit a + 1
10000 loops, best of 3: 15.4 µs per loop

Creating numpy arrays

There are a number of ways to initialize new numpy arrays, for example from

From lists

For example, to create new vector and matrix arrays from Python lists we can use the numpy.array function.

In [6]:
# one dimensional array, the argument to the array function is a Python list
# will cast to 'best' type as required: int -> double -> complex double
v = np.array([1,2,3,4])

v
Out[6]:
array([1, 2, 3, 4])
In [7]:
# a 2d image: the argument to the array function is a nested Python list
img = np.array([[1, 2], [3, 4]])

img
Out[7]:
array([[1, 2],
       [3, 4]])

An array has two basic properties:

In [8]:
v.dtype
Out[8]:
dtype('int64')
In [9]:
v.shape
Out[9]:
(4,)
In [10]:
img.shape
Out[10]:
(2, 2)

Many functions take arguments named dtype or shape (sometimes size) that define how the output looks like:

In [11]:
img = np.array([[1, 2], [3, 4]], dtype=np.complex64)

img
Out[11]:
array([[ 1.+0.j,  2.+0.j],
       [ 3.+0.j,  4.+0.j]], dtype=complex64)

Using array-generating functions

For larger arrays it is inpractical to initialize the data manually, using explicit pythons lists. Instead we can use one of the many functions in numpy that generates arrays of different forms. Some of the more common are:

emtpy, zeros and ones

In [12]:
np.empty((3,3))
Out[12]:
array([[  0.00000000e+000,   0.00000000e+000,   3.16601930e-316],
       [  3.21323221e-316,   3.17501841e-316,   3.21372865e-316],
       [  1.66064396e-316,   0.00000000e+000,   6.90268638e-310]])
In [13]:
np.zeros((3,3))
Out[13]:
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
In [14]:
np.ones((3,3), dtype=np.int16)
Out[14]:
array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1]], dtype=int16)
In [15]:
# get same type of array
np.empty_like(v)
np.ones_like(v)
np.zeros_like(v)
np.full_like(v, 3) #numpy 1.8
Out[15]:
array([3, 3, 3, 3])

arange

In [16]:
# create a range

x = np.arange(0, 10, 1) # arguments: start, stop, step, [dtype]

x
Out[16]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [17]:
x = np.arange(-1, 1, 0.1)

x
Out[17]:
array([ -1.00000000e+00,  -9.00000000e-01,  -8.00000000e-01,
        -7.00000000e-01,  -6.00000000e-01,  -5.00000000e-01,
        -4.00000000e-01,  -3.00000000e-01,  -2.00000000e-01,
        -1.00000000e-01,  -2.22044605e-16,   1.00000000e-01,
         2.00000000e-01,   3.00000000e-01,   4.00000000e-01,
         5.00000000e-01,   6.00000000e-01,   7.00000000e-01,
         8.00000000e-01,   9.00000000e-01])

linspace and logspace

In [18]:
# using linspace, end points are included by default and you define the number of points instead of the step
np.linspace(0, 10, num=30, endpoint=True)
Out[18]:
array([  0.        ,   0.34482759,   0.68965517,   1.03448276,
         1.37931034,   1.72413793,   2.06896552,   2.4137931 ,
         2.75862069,   3.10344828,   3.44827586,   3.79310345,
         4.13793103,   4.48275862,   4.82758621,   5.17241379,
         5.51724138,   5.86206897,   6.20689655,   6.55172414,
         6.89655172,   7.24137931,   7.5862069 ,   7.93103448,
         8.27586207,   8.62068966,   8.96551724,   9.31034483,
         9.65517241,  10.        ])
In [19]:
np.logspace(0, 10, num=10, base=np.e)
Out[19]:
array([  1.00000000e+00,   3.03773178e+00,   9.22781435e+00,
         2.80316249e+01,   8.51525577e+01,   2.58670631e+02,
         7.85771994e+02,   2.38696456e+03,   7.25095809e+03,
         2.20264658e+04])

random data

In [20]:
from numpy import random
In [21]:
# standard normal distributed random numbers
random.normal(loc=-5., scale=5, size=(5,5))
Out[21]:
array([[-10.07496   ,  -8.90712815,  -2.59793592,  -8.89634474,
        -10.60483511],
       [ -8.93133842,  -0.4726435 , -10.20600326, -10.44268975,  -0.9836462 ],
       [-11.01432594,  -9.38263595,  -0.3652716 ,  -8.30859508,
          1.38892859],
       [  0.99216126, -12.00298969,  -9.85226298,  -2.32475633,
         -8.16799242],
       [ -7.30244278,  -7.24388802,  -6.26780013,  -7.97688188,
         -4.67301287]])

File I/O

Comma-separated values (CSV)

A very common file format for data files are the comma-separated values (CSV), or related format such as TSV (tab-separated values). To read data from such file into Numpy arrays we can use the numpy.genfromtxt or numpy.loadtxt function. For example:

In [22]:
!head stockholm_td_adj.dat
#year month day min   temperature   max dontknow
1800  1  1    -6.1    -6.1    -6.1 1
1800  1  2   -15.4   -15.4   -15.4 1
1800  1  3   -15.0   -15.0   -15.0 1
1800  1  4   -19.3   -19.3   -19.3 1
1800  1  5   -16.8   -16.8   -16.8 1
1800  1  6   -11.4   -11.4   -11.4 1
1800  1  7    -7.6    -7.6    -7.6 1
1800  1  8    -7.1    -7.1    -7.1 1
1800  1  9   -10.1   -10.1   -10.1 1

In [23]:
data = np.genfromtxt('stockholm_td_adj.dat')
data.shape, data.dtype
Out[23]:
((77431, 7), dtype('float64'))
In [24]:
!wc -l stockholm_td_adj.dat
77432 stockholm_td_adj.dat

In [25]:
data = np.genfromtxt('stockholm_td_adj.dat', names=True)
# structured array
data.shape, data.dtype
Out[25]:
((77431,),
 dtype([('year', '<f8'), ('month', '<f8'), ('day', '<f8'), ('min', '<f8'), ('temperature', '<f8'), ('max', '<f8'), ('dontknow', '<f8')]))
In [26]:
data['year'][:10]
Out[26]:
array([ 1800.,  1800.,  1800.,  1800.,  1800.,  1800.,  1800.,  1800.,
        1800.,  1800.])
In [27]:
fig, ax = subplots(figsize=(14,4))

x = data['year'] + data['month'] / 12.0 + data['day'] / 365.
y = data['temperature']

ax.plot(x, y)
ax.axis('tight')
ax.set_title('temperatures in Stockholm')
ax.set_xlabel('year')
ax.set_ylabel('tempature (C)');

Manipulating arrays

Changing the shape

The shape (dimensionality) of an array can be changed as long as the new shape has the same number of elements

In [28]:
vec = np.arange(9)
# create a view of vec with a different shape
imgview = vec.reshape(3, 3)

imgview
Out[28]:
array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])
In [29]:
# the ordering of the elements can be defined (Fortran or C)
vec.reshape(3, 3, order='F')
Out[29]:
array([[0, 3, 6],
       [1, 4, 7],
       [2, 5, 8]])
In [30]:
# the value of a negative entry will be infered from the remaining free entries
vec.reshape(3, -1).shape
Out[30]:
(3, 3)

Reshaping will create a view/reference of the original data array. Changing the reshaped array will also change the original array.

In [31]:
imgview[0] = 5
vec
Out[31]:
array([5, 5, 5, 3, 4, 5, 6, 7, 8])

Indexing

We can index elements in an array using the square bracket and indices:

In [32]:
# regular python indexing [start:stop:step]
vec[3], vec[3::-1]
Out[32]:
(3, array([3, 5, 5, 5]))

The dimensions are separated by commas in the brackets:

In [33]:
imgview[1,0]
Out[33]:
3
In [34]:
#row
imgview[0], imgview[0,:]
Out[34]:
(array([5, 5, 5]), array([5, 5, 5]))
In [35]:
#column
imgview[:, 0]
# Ellipsis means all free axis, this selects the zero element of the last dimension, regardless of the dimension of imgview
imgview[..., 0]
Out[35]:
array([5, 3, 6])
In [36]:
# assignment on a indexed array
imgview[-1, :] = 1
imgview
Out[36]:
array([[5, 5, 5],
       [3, 4, 5],
       [1, 1, 1]])

Compared to Python lists, a slice of an array is no copy, it is a view on the original data copies must be explicit, e.g. with

imgview[0, :].copy()

Basic scalar math on arrays

The scalar math operation on arrays work element-wise:

In [37]:
a = np.ones((5,))
b = np.zeros_like(a)
a, b
Out[37]:
(array([ 1.,  1.,  1.,  1.,  1.]), array([ 0.,  0.,  0.,  0.,  0.]))
In [38]:
b = a + a
b
Out[38]:
array([ 2.,  2.,  2.,  2.,  2.])
In [39]:
# in place
a /= b
a
Out[39]:
array([ 0.5,  0.5,  0.5,  0.5,  0.5])
In [40]:
a + 1
Out[40]:
array([ 1.5,  1.5,  1.5,  1.5,  1.5])
In [41]:
a + np.ones((4,))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-2d6215575c4f> in <module>()
----> 1 a + np.ones((4,))

ValueError: operands could not be broadcast together with shapes (5) (4) 

Broadcasting

Broadcasting applies if the shapes of a set of arrays do not match but the shapes can be made to match by repeating certain dimensions.

The basic rules when broadcasting can be applied are:

Broadcasting documentation

The simplest case is adding a zero dimensional array (scalar) to a higher dimensional array:

In [43]:
from IPython.display import Image
Image(filename='images/broadcast_11.png')
Out[43]:
In [44]:
# subtract vector from an image (e.g. remove overscan offset):
img = np.arange(4*3).reshape(4,3)
o = np.array([1, 2, 3])
#dim(4, 3) - dim(3)
img - o
Out[44]:
array([[-1, -1, -1],
       [ 2,  2,  2],
       [ 5,  5,  5],
       [ 8,  8,  8]])
In [45]:
img = img.reshape(3,4)
o = np.array([1, 2, 3])
#dim(3, 4) - dim(3), trailing dimension does not match
img - o
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-5fbf216187b7> in <module>()
      2 o = np.array([1, 2, 3])
      3 #dim(3, 4) - dim(3), trailing dimension does not match
----> 4 img - o

ValueError: operands could not be broadcast together with shapes (3,4) (3) 
In [46]:
# solution: add a new empty dimension (size 1)
# dim(3, 4) - dim(3, 1)
img - o[:,np.newaxis]
Out[46]:
array([[-1,  0,  1,  2],
       [ 2,  3,  4,  5],
       [ 5,  6,  7,  8]])

Fancy indexing

Arrays or lists of integers can also be used as indices to other arrays:

In [47]:
v = np.arange(5, 10)
row_indices = [0, 2, 4]
print(v)
print(row_indices, '->', v[row_indices])
[5 6 7 8 9]
([0, 2, 4], '->', array([5, 7, 9]))

In [48]:
img = np.arange(25).reshape(5, 5)
col_indices = [1, 2, -1] # index -1 means the last element
img[row_indices, col_indices]
Out[48]:
array([ 1, 12, 24])
In [49]:
img[:, col_indices]
Out[49]:
array([[ 1,  2,  4],
       [ 6,  7,  9],
       [11, 12, 14],
       [16, 17, 19],
       [21, 22, 24]])

Functions related with positions of data like sorting, partitioning, min/max often have a arg variant which instead of returning the result, return an index array that when applied to the original array return the result:

In [50]:
v = random.randint(0, 5, size=10)
sortindices = v.argsort()
sortindices, v[sortindices]
Out[50]:
(array([3, 4, 0, 5, 6, 8, 9, 1, 2, 7]), array([0, 0, 1, 1, 1, 1, 3, 4, 4, 4]))

We can also index with boolean masks: If the index array is an Numpy array of with data type bool, then an element is selected (True) or not (False) depending on the value of the index mask at the position each element:

In [51]:
x = arange(0, 10, 0.5)
x
Out[51]:
array([ 0. ,  0.5,  1. ,  1.5,  2. ,  2.5,  3. ,  3.5,  4. ,  4.5,  5. ,
        5.5,  6. ,  6.5,  7. ,  7.5,  8. ,  8.5,  9. ,  9.5])
In [52]:
# & is a bitwise and, which, for booleans, is equivalent to np.logical_and(a, b)
mask = (x >= 5) & (x < 7.5)

mask
Out[52]:
array([False, False, False, False, False, False, False, False, False,
       False,  True,  True,  True,  True,  True, False, False, False,
       False, False], dtype=bool)
In [53]:
x[mask]
Out[53]:
array([ 5. ,  5.5,  6. ,  6.5,  7. ])

NumPy also contains a special array class which carries a mask and ignores the masked values for all arithmetic operations:

In [54]:
mx = np.ma.MaskedArray(x, mask=~((5 < x) & (x < 7.5)))
mx
Out[54]:
masked_array(data = [-- -- -- -- -- -- -- -- -- -- -- 5.5 6.0 6.5 7.0 -- -- -- -- --],
             mask = [ True  True  True  True  True  True  True  True  True  True  True False
 False False False  True  True  True  True  True],
       fill_value = 1e+20)

In [55]:
x.sum(), mx.sum()
Out[55]:
(95.0, 25.0)

where

The index mask can be converted to position index using the where function

In [56]:
indices = np.where((x > 5) & (x < 7.5))

indices
Out[56]:
(array([11, 12, 13, 14]),)

It can also be used to conditionally select from two options

In [57]:
a = np.ones_like(x)
np.where((x > 5) & (x < 7.5), a, 5) # note the broadcasting of the second option
Out[57]:
array([ 5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  5.,  1.,  1.,
        1.,  1.,  5.,  5.,  5.,  5.,  5.])
In [57]: