%matplotlib inline
%pylab inline
NumPy in a nutshell is a multidimensional array library.
Why does one need numpy?
l = list(range(10000))
a = np.arange(10000)
l[:10], a[:10]
%timeit [x + 1 for x in l]
%timeit a + 1
numpy
arraysThere are a number of ways to initialize new numpy arrays, for example from
arange
, linspace
, etc.For example, to create new vector and matrix arrays from Python lists we can use the numpy.array
function.
# one dimensional array, the argument to the array function is a Python list
# will cast to 'best' type as required: int -> double -> complex double
v = np.array([1,2,3,4])
v
# a 2d image: the argument to the array function is a nested Python list
img = np.array([[1, 2], [3, 4]])
img
An array has two basic properties:
v.dtype
v.shape
img.shape
Many functions take arguments named dtype
or shape
(sometimes size
) that define how the output looks like:
img = np.array([[1, 2], [3, 4]], dtype=np.complex64)
img
For larger arrays it is inpractical to initialize the data manually, using explicit pythons lists. Instead we can use one of the many functions in numpy
that generates arrays of different forms. Some of the more common are:
np.empty((3,3))
np.zeros((3,3))
np.ones((3,3), dtype=np.int16)
# get same type of array
np.empty_like(v)
np.ones_like(v)
np.zeros_like(v)
np.full_like(v, 3) #numpy 1.8
# create a range
x = np.arange(0, 10, 1) # arguments: start, stop, step, [dtype]
x
x = np.arange(-1, 1, 0.1)
x
# using linspace, end points are included by default and you define the number of points instead of the step
np.linspace(0, 10, num=30, endpoint=True)
np.logspace(0, 10, num=10, base=np.e)
from numpy import random
# standard normal distributed random numbers
random.normal(loc=-5., scale=5, size=(5,5))
A very common file format for data files are the comma-separated values (CSV), or related format such as TSV (tab-separated values). To read data from such file into Numpy arrays we can use the numpy.genfromtxt
or numpy.loadtxt
function. For example:
!head stockholm_td_adj.dat
data = np.genfromtxt('stockholm_td_adj.dat')
data.shape, data.dtype
!wc -l stockholm_td_adj.dat
data = np.genfromtxt('stockholm_td_adj.dat', names=True)
# structured array
data.shape, data.dtype
data['year'][:10]
fig, ax = subplots(figsize=(14,4))
x = data['year'] + data['month'] / 12.0 + data['day'] / 365.
y = data['temperature']
ax.plot(x, y)
ax.axis('tight')
ax.set_title('temperatures in Stockholm')
ax.set_xlabel('year')
ax.set_ylabel('tempature (C)');
The shape (dimensionality) of an array can be changed as long as the new shape has the same number of elements
vec = np.arange(9)
# create a view of vec with a different shape
imgview = vec.reshape(3, 3)
imgview
# the ordering of the elements can be defined (Fortran or C)
vec.reshape(3, 3, order='F')
# the value of a negative entry will be infered from the remaining free entries
vec.reshape(3, -1).shape
Reshaping will create a view/reference of the original data array. Changing the reshaped array will also change the original array.
imgview[0] = 5
vec
We can index elements in an array using the square bracket and indices:
# regular python indexing [start:stop:step]
vec[3], vec[3::-1]
The dimensions are separated by commas in the brackets:
imgview[1,0]
#row
imgview[0], imgview[0,:]
#column
imgview[:, 0]
# Ellipsis means all free axis, this selects the zero element of the last dimension, regardless of the dimension of imgview
imgview[..., 0]
# assignment on a indexed array
imgview[-1, :] = 1
imgview
Compared to Python lists, a slice of an array is no copy, it is a view on the original data copies must be explicit, e.g. with
imgview[0, :].copy()
The scalar math operation on arrays work element-wise:
a = np.ones((5,))
b = np.zeros_like(a)
a, b
b = a + a
b
# in place
a /= b
a
a + 1
a + np.ones((4,))
Broadcasting applies if the shapes of a set of arrays do not match but the shapes can be made to match by repeating certain dimensions.
The basic rules when broadcasting can be applied are:
The simplest case is adding a zero dimensional array (scalar) to a higher dimensional array:
from IPython.display import Image
Image(filename='images/broadcast_11.png')
# subtract vector from an image (e.g. remove overscan offset):
img = np.arange(4*3).reshape(4,3)
o = np.array([1, 2, 3])
#dim(4, 3) - dim(3)
img - o
img = img.reshape(3,4)
o = np.array([1, 2, 3])
#dim(3, 4) - dim(3), trailing dimension does not match
img - o
# solution: add a new empty dimension (size 1)
# dim(3, 4) - dim(3, 1)
img - o[:,np.newaxis]
Arrays or lists of integers can also be used as indices to other arrays:
v = np.arange(5, 10)
row_indices = [0, 2, 4]
print(v)
print(row_indices, '->', v[row_indices])
img = np.arange(25).reshape(5, 5)
col_indices = [1, 2, -1] # index -1 means the last element
img[row_indices, col_indices]
img[:, col_indices]
Functions related with positions of data like sorting, partitioning, min/max often have a arg variant which instead of returning the result, return an index array that when applied to the original array return the result:
v = random.randint(0, 5, size=10)
sortindices = v.argsort()
sortindices, v[sortindices]
We can also index with boolean masks: If the index array is an Numpy array of with data type bool
, then an element is selected (True) or not (False) depending on the value of the index mask at the position each element:
x = arange(0, 10, 0.5)
x
# & is a bitwise and, which, for booleans, is equivalent to np.logical_and(a, b)
mask = (x >= 5) & (x < 7.5)
mask
x[mask]
NumPy also contains a special array class which carries a mask and ignores the masked values for all arithmetic operations:
mx = np.ma.MaskedArray(x, mask=~((5 < x) & (x < 7.5)))
mx
x.sum(), mx.sum()
The index mask can be converted to position index using the where
function
indices = np.where((x > 5) & (x < 7.5))
indices
It can also be used to conditionally select from two options
a = np.ones_like(x)
np.where((x > 5) & (x < 7.5), a, 5) # note the broadcasting of the second option