Working with Numpy#

From the NumPy documentation:

NumPy is the fundamental package for scientific computing in Python. It is a Python library that provides a multidimensional array object, various derived objects (such as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear algebra, basic statistical operations, random simulation, and much more.

NumPy’s position at the center of the scientific Python ecosystem means that all users should start here in their learning journey through the core scientific packages.

Additional Tutorials:

What is the difference between regular Python and Numpy Python?#

Data Types#

Python has specific data types, but not as many as traditional C or FORTRAN

print(type(1))  # Python integer - this is a 64 byte integer
print(type(1.))  # Python float - this is a 64 byte float
print(type('Hello World'))  # Python string which is different than a character array

Numpy can be thought of as a space within the Python language with more data types like C or FORTRAN and a richer set of mathematical/computational enivironment built on top of the base Python environment. All the regular Python functions still exsits just more options are available.

import numpy as np  # Convention is to import and rename. Best to stick with convention

print('type(np.array(1)):', type(np.array(1)))
print()

print('np.int8:', np.array(1, dtype=np.int8).dtype)    # Numpy integer of size 8 bytes
print('np.int16:', np.array(1, dtype=np.int16).dtype)  # Numpy integer of size 16 bytes
print('np.int32:', np.array(1, dtype=np.int32).dtype)  # Numpy integer of size 32 bytes
print('np.int64:', np.array(1, dtype=np.int64).dtype)  # Numpy integer of size 64 bytes
print('np.int:', np.array(1, dtype=int).dtype)         # Default Numpy integer of size 64 bytes
print()

print('np.float16:', np.array(1, dtype=np.float16).dtype)  # Numpy float of size 16 bytes
print('np.float32:', np.array(1, dtype=np.float32).dtype)   # Numpy float of size 32 bytes
print('np.float64:', np.array(1, dtype=np.float64).dtype)   # Numpy float of size 64 bytes
print('np.float:', np.array(1, dtype=float).dtype)          # Numpy float of size 64 bytes
print()

print('np.bool:', np.array(1, dtype=np.bool_).dtype)    # Numpy boolean of size 8 bytes
print('np.int:', np.array(1, dtype=np.int_).dtype)      # Numpy integer of size 64 bytes
print('np.foat:', np.array(1, dtype=np.float_).dtype)   # Numpy float of size 64 bytes
print('np.uint:', np.array(1, dtype=np.uint).dtype)     # Numpy unsigned integer of size 64 bytes
print('np.complex:', np.array(1, dtype=complex).dtype)  # Numpy complex data object of size 128 bytes
print('Hello World:', np.array('Hello World').dtype)    # <U11 means 11 charater Unicode String.

Numpy is based on arrays of data, with all values having the same data type in the array. This greatly increases computational performance. Can convert between basic Python and Numpy data space easily and quickly.

a = [1, 2, 3]  # Create a Python list
b = np.array([1, 2, 3])  # Create a Numpy 1-D array
print('type(a):', type(a))  # Get type of a
print('type(b):', type(b))  # Get type of b
print('type(b[0]):', type(b[0]))  # Get type of index 0 of b
print('b.dtype:', b.dtype)

Basic Python vs Numpy performance#

Basic Python can perform computations relativly fast.

%%timeit
num = 100_000
a = list(range(0, num))
for ii in a:
    a[ii] = a[ii] + 1

The computation can be faster for some applications using list comprehensions. This is performing the same computation just with a diffrent syntax performing the computations in a more optimized way.

%%timeit
num = 100_000
a = np.arange(num, dtype=np.int32) 
a = a + 1

Switching between Basic Python space and Numpy space#

Create some data using Python lists

num = 10000
a = list(range(0, num))  # Defaults to integer values
print('type(a):', type(a))
print('len(a):', len(a))

Convert the python list into numpy array

b = np.array(a)
print('type(b):', type(b))
print('b.size:', b.size)
print('b.shape:', b.shape)

Convert the numpy array back into python list

c = list(b)
print('type(c):', type(c))
print('len(c):', len(c))

Making Numpy Arrays#

Numpy also comes with a large number of methods to create data arrays so you don’t have to. Chances are Numpy has a method to do what you want, so search there first.

a = np.array(range(10))   # Converts the itterator range function to numpy array
b = np.arange(10)         # Creates the numpy array directly and faster
c = np.arange(1, 10, 2)   # Creates array counting by two
d = np.arange(10, 1, -1)  # Creates array decending by one
e = np.flip(a)            # Reverses the array a from increasing to decreasing
print('a:', a)
print('b:', b)
print('c:', c)
print('d:', d)
print('e:', e)

Indexing, Slicing, Broadcasging#

How to get the specific data from the Numpy array. Numpy uses the same slicing as Python lists. Start number to but not including end number.

print('a:', a)
print('a[0:5]:', a[0:5])  # selects upto but not including index 5
print('a[3:]:', a[3:])  # selects everthing from 3 to end of array
print('a[:5]:', a[:5])  # selects to upto but not including index 5
print('a[3:5]:', a[3:5])  # selects from 3 upto but not including index 5
print('a[0:-1]:', a[:-1])  # selects upto but not including index 9
print('a[0:100]:', a[0:100])  # index is past end of array?!?
c = np.arange(10)  # Create a 1-D array
print('c.shape:', c.shape)
c = c.reshape((2, 5))  # Change to a 2-D array
print('c:\n', c)
print('c.shape:', c.shape)
print('c.size:', c.size)
print()

a = np.zeros((2, 2))    # Create an array of all zeros. Notice defaults to type float
print('a:\n', a)

b = np.ones((2, 2), dtype=int)    # Create an array of all ones
print('\nb:\n', b)

c = np.full((2, 2), 7, dtype=np.int16)   # Create a constant array
print('\nc:\n', c)

d = np.eye(3)           # Create a 3x3 identity matrix
print('\nd:\n', d)

e = np.random.random((2, 4))  # Create a float array filled with random values between 0 and 1
print('\ne:', e)

One of the most important features of Numpy is Broadcasting, where a single operation is performed on all values of the Numpy array without the need for a loop. This creates more simple and readable code, and is significantly faster. General rule is if it’s possble to remove a loop by Broadcasting, do it!

Create an array of all zeros

a = np.zeros(20, dtype=int)
a

Add a value of 1 to every value in the array. Also notice that the initial array was of type integer. But we are adding a float so all vales are first upconverted to type float and then a value of 1.0 is added to every value in the array.

a = a + 1.0  # Note how it upconverted from int to float
a

Here we change the data type of the entire array from float to integer.

a = a.astype(int)
a

Here we add a value to a subset of the array by adding 10 to values matching index from 3 to 7. The rest of the values are unchanged.

a[3:8] = a[3:8] + 10
a

Here we use shorthand notiation to add a constant of 1000 to every value in the array.

a[3:8] += 1000
a

The IEEE Not A Number value#

There is a special Numpy value called NaN which is used in calculations to represent a value that is not a number. Think of it as missing data, or a bad value that should not be propigated through the data. One of the most important things to remember about NaN is that it taints anything it touches!

print(np.nan)
print(type(np.nan))   # Python type says is of type float
print(11 + np.nan, 12 - np.nan, 13 * np.nan, 14 / np.nan) # Anything that uses NaN becomes a NaN

Because NaN is special, it acts funny. Looking for NaN requires some specific logic.

1 == 1
np.nan == np.nan  # What is going on here?

Will need to use the Numpy methods of searching the arrays if you want to do comparisons for find where NaNs are located

np.isnan(np.nan)

To use NaNs properly requries a litte bit of extra thought, but is greatly worth it!

a = np.arange(10, dtype=float)
print('a:', a)
print('np.min(a):', np.min(a))  # This is numpy min function

Now we will assign a value in the Numpy array to NaN

a[0] = np.nan
a

Calculate the mean. Notice how it returns NaN. The NaNs are tainting all the operations.

np.mean(a)

We need to use a different method that understands how to correctly handle NaNs

np.nanmean(a)

Working with time in Numpy#

If you deal with time series data you will need to work with time data types. Oh that’s right, most atmospheric data is time series data…

Python has a library called datetime that is great for working with dates and times. It is timezone unaware and works on one value at a time. This is a native data type within Python and you will see it all over the place. Here are some quick examples.

Get the datetime now.

import datetime

datetime.datetime.now()

Or get the datetime in UTC timezone.

datetime.datetime.utcnow()

Or from a timestamp of number of seconds from Epoch, Seconds from 1970-01-01 00:00:00: UTC

datetime.datetime.fromtimestamp(1326244364)

But when working with a lot of time samples we need to use the Numpy time data type. It can take in a string and convert to a Numpy datetime64 value. Notice when it prints out, it will print out in ISO format.

To just get the date right now

np.datetime64('today')

To get the date and time right now. Notice the time is in UTC. Most likely this is what you want, if not understand how to get what you want.

np.datetime64('now')
np.datetime64('2005-02-25 03:30:55')

The default is store the precision at the level of input. But we can update the precison by setting precision when setting the value or after the value is created to the desired precision.

np.datetime64('2012-03')
np.datetime64('2012-03', 's')
np.datetime64('2012-03').astype('datetime64[ns]')

OK so what you can set precisoin. Why do I care? Because you can use the precision to indicate what step size to use. Notice the following range did not provide a starting or ending day. This helps with not needing to know length of months.

np.arange('2005-02', '2005-03', dtype='datetime64[D]')

There is also a a timedelta64 data type that is the result of differencing times.

np.datetime64('2009-01-01') - np.datetime64('2008-01-01')

Time Deltas are important for adding or subtracting times.

np.datetime64('2011-06-15T22:00') + np.timedelta64(12, 'h')

Converting between Python datetime and Numpy datetime64#

At some point you will need to convert a time value from one space to another. Don’t memorize this, just remember it exists and where you can find the code.

dt = datetime.datetime.utcnow()  # Get current date and time with python datetime
print(dt)
# Convert from Python datetime to Numpy datetime64
dt64 = np.datetime64(dt)
dt64

It is a bit more complicated to convert from Numpy datetime64 to Python datetime

# Set precision to datetime64 seconds and convert to Numpy integer.
# This will give number of seconds since epoch, or timestamp.
dt64 = np.datetime64('now')
print(dt64)

dt64 = dt64.astype('datetime64[s]')
ts = dt64.astype(int)

# Then use that integer number of seconds into from time stamp method.
datetime.datetime.utcfromtimestamp(ts)