Starting with Scientific Libraries in Python

Starting with Scientific Libraries in Python#

If you want to perform some larger scientific computations in Python some great libraries to use are:

Numpy = base building blocks for large array based computing
Pandas = Pandas is an extension of the Numpy data model to structure and organize data into Dataframes. A Dataframe resembles the look of a spreadsheet, but is much more powerful.
Xarray = Xarry is an extension of the Numpy and Pandas data model to structure and organize data into Dataset. A Dataset resembles the look of a netCDF3 data model, but is much more powerful.

Working with Pandas#

import pandas as pd  # Convention suggest to import as pd

Create Pandas Dataframe#

Create some data in a Python dictionary

data = {'apples': [3, 2, 0, 1],
        'oranges': [0, 3, 7, 2]}
data

Use the dictionary to populate the new Pandas data frame. Python and Jupyter play very well with Pandas and when requested to print the Pandas Dataframe or Series Jupyter knows how to print in a visually appealing way. Notice the columns have names taken from the Dictionary keys we used to populate the Dataframe. Each row has an index set to an incremented integer by default.

purchases = pd.DataFrame(data)
purchases

A Dataframe has a concept of index to label the rows of the Dataframe and column names to label the Columns. We created column names from the dictionary keys used to initially create the Dataframe. We can create index labels in the Dataframe upon creation of the Dataframe.

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
purchases

Printing the column shows additional informatin including the index and data type. This is returned as a Pandas Series.

purchases.apples

Extracting a row uses the .loc() method on the Dataframe. .loc() standar for location and searches the index for a match.

purchases.loc['June']

Can also extract using index location .iloc() method.

purchases.iloc[0]

Read in some data#

Using the path to a specific file we read the data into a Pandas Dataframe. This is something you will use very often if you have data in ASCII column files. Get to know this method well. The method has keywords to help describe how to read the column data including the delimter, number of header rows, and which column is the time stamp. If it can parse the time stamp it will convert to Pandas native time type.

from pathlib import Path  # Importing a sub-module of the pathlib library
import numpy as np  # Convention is to import numpy as np

filename = Path('..', 'data', 'sgpmetE13.00.20191105.150801.raw.dat')
df = pd.read_csv(filename, delimiter=',', skiprows=[0, 2, 3], header=0, parse_dates=[0])

df

Similar to Numpy, can use the .shape and .size methods on the Dataframe to return metadata about size of the Dataframe.

print(df.size)
print(df.shape)  # 60 rows by 33 columns

Print the first five rows using Numpy syntax

df[:5]  # Prints from row 0 to 4

We can get the names of the columns with .columns method

df.columns

Pandas has a few methods used to inspect the data in the Dataset or Series and present the typical statistical values. Because Pandas plays so well with Jupyter the output is very easy to visualize.

df.describe()

What type is the TIMESTAMP data? Notice that Pandas has a new data type of time specific to Pandas. This is similar to Datetime datetime and Numpy datetime64, but is technically different.

print(type(df['TIMESTAMP'][0]))
print(type(df['PTemp'][0]))

Get the pressure Series from the DataFrame and sum it up to one value and calcualte the mean.

print(df['Pressure_kPa'].sum())
print(df['Pressure_kPa'].mean())

Extract the RH series from the DataFrame. This is a copy of the Series in the DataFrame so changing the values will not change the values in the DataFrame.

rh = df['RH_Avg']
print(type(rh))
rh_np = np.array(rh)
print(type(rh_np))

Calculate a rolling mean over the Series using 10 points. Notice the first value is NaN. There is a default number of values to use to calculate a value. Else it is set to NaN. By specifically stating the minimum number of points to use when calculating the mean we force it to not fill in so many NaNs. There is at least two value to use in the rollig window so only first value is set to NaN. What happens when you change min_periods to a larger number?

rh_rolling_mean = df['RH_Avg'].rolling(10, min_periods=2).mean()
rh_rolling_mean

Working with Xarray#

Xarry is a class of objects added to the regular Python system that allows storing data in a more organized method. The format is very similar to netCDF classic model (netCDF3). It can read netCDF files efficiently and handle some issues associated with incorrectly designed netCDF files.

Xarray also has extetions to use the Numpy, Pandas, Dask and SciPy libaries directly. Think of Xarray as a tool for orgnaizing data in a way that other libaries can be used on the data efficiently.

A primary difference between Xarray and Pandas is that Pandas is designed to handle 1-D data while Xarray can handle n-D data and metadata about the data.

The one downside is that Xarry has very powerful functions with less great documentation. May need to dig a bit to get the best way to perform a task.

import xarray as xr  # Convention is to import xarray as xr

DataArray#

Here we create some data with Numpy and put into an Xarray DataArray. Notice how there is a concept of dimentionality built into DataArray. “xarray.DataArray (dim_0: 10000)”. But because we didn’t define the dimention name a generic one was created for us. Also notice how nice the printing looks. Xarray plays very well with Jupyter.

data = np.arange(10000)  # This is a numpy array
da = xr.DataArray(data)  # Create the Xarray DataArray using Numpy array.
da  # Convention suggest using the varible da for DataArray when feasable.

This time create a time array to match the data array shape. Time will be one minute time steps. The time array will become a coordinate variable to describe the values along the dimention we defined and named “time”. The coordinate is set to the time array and the dimention is set to “time” string.

time = np.array('2019-11-01T00:00:00', dtype='datetime64[m]') + np.arange(data.size)
time = time.astype('datetime64[ns]')  # Only done to stop a warning appearing
da = xr.DataArray(data, dims=['time'], coords=[time])
da

We can add attributes describing metadata about the data.

da.attrs['long_name'] = 'Amazing data that will win me a Nobel prize.'
da.attrs['units'] = 'degK'
da

Same as above but all in one step while creating the DataArray.

da = xr.DataArray(
    data, dims=['time'],
    coords=[time],
    attrs={'long_name': 'Amazing data that will win me a Nobel prize.',
           'units': 'degK'})
da

To extract the data values only we use the .values attribute on the DataArray

da.values

To extract the attributes as a dictionary we use the .attrs attribute.

da.attrs

Or the attrs decorator can also accept a name for a specific attribute

da.attrs['long_name']

Dataset#

The full power of Xarray comes from using Datasets. A Dataset is a collection of DataArrays. The beauty of Datasets is holding all the corresoponding data together and performing functions on multiple DataArrays in the Datasets all at once. This becomes very powerful and extreemly fast!

Create some data and a time data array to match the data we created with minute time steps.

data1 = np.arange(10000, dtype=float)
data2 = np.arange(10000, dtype=float) + 123.456
time = np.array('2019-11-01T00:00:00', dtype='datetime64[m]') + np.arange(data1.size)
time = time.astype('datetime64[ns]')  # Only done to stop a warning appearing

ds = xr.Dataset(
    # This is the data section.
    # Notice all data is wrappted in a dictionary. In that dictionary the key
    # is the variable name followed by a tuple. The first value of the tuple
    # is the dimension(s) name, folloed by the data values, followed by optional
    # dictionary of attributes as key:value pairs.
    data_vars={'data1': ('time', data1, {'long_name': 'Data 1 values', 'units': 'degC'}),
               'data2': ('time', data2, {'long_name': 'Data 2 values', 'units': 'degF'})
               },
    # This is the coordinate section following the same format. Since this
    # comes next it could be interpredted as positional as coordinates.
    # But we are using keywords to make it easier to understand.
    coords={'time': ('time', time, {'long_name': 'Time in UTC'})},
    # Lastly we have the global attributes.
    attrs={'the_best_animals': 'sharks'}
)

ds

Print out one DataArray from the Dataset

ds['data1']

Read in data#

Let’s read in a single netCDF data file. Notice we are using relative paths to go one directory down into the data directory. This will also read in multiple files when provided a list of file names.

filename = Path('..', 'data', 'sgpmetE13.b1.20191101.000000.cdf')
ds = xr.open_mfdataset(filename)

To resolve issues with incorrectly formatted variables or reduce the memory we can exclude some variable from being read.

drop_vars = [
    'base_time', 'time_offset', 'vapor_pressure_std', 'wspd_arith_mean',
    'qc_wspd_arith_mean', 'wspd_vec_mean', 'qc_wspd_vec_mean',
    'wdir_vec_mean', 'qc_wdir_vec_mean', 'wdir_vec_std', 'tbrg_precip_total',
    'qc_tbrg_precip_total', 'tbrg_precip_total_corr', 'qc_tbrg_precip_total_corr',
    'org_precip_rate_mean', 'qc_org_precip_rate_mean', 'pwd_err_code',
    'pwd_mean_vis_1min', 'qc_pwd_mean_vis_1min', 'pwd_mean_vis_10min',
    'qc_pwd_mean_vis_10min', 'pwd_pw_code_inst', 'qc_pwd_pw_code_inst',
    'pwd_pw_code_15min', 'qc_pwd_pw_code_15min', 'pwd_pw_code_1hr',
    'qc_pwd_pw_code_1hr', 'pwd_precip_rate_mean_1min',
    'qc_pwd_precip_rate_mean_1min', 'pwd_cumul_rain', 'qc_pwd_cumul_rain',
    'pwd_cumul_snow', 'qc_pwd_cumul_snow', 'logger_volt', 'qc_logger_volt',
    'logger_temp', 'qc_logger_temp', 'temp_std', 'rh_std', 'vapor_pressure_mean',
    'qc_vapor_pressure_mean', 'a_very_long_name_that_is_not_in_the_data_file',
    'qc_atmos_pressure', 'qc_temp_mean', 'qc_rh_mean']


ds = xr.open_mfdataset(filename, drop_variables=drop_vars)

ds

We can perform the same operations of sum and mean on the DataArray as we did with the Pandas Series. Notice the returned value is a Numpy array of length one.

print(ds['atmos_pressure'].sum().values)
print(ds['atmos_pressure'].mean().values)

Sub-selecting data#

Now let’s calculate the mean for a hour by grouping the data. This works for all time worded groups: day, hour, minute, year, month, … The method returns a new Dataset and leaves the orginal untouched. Notice how data was orginanally type integer, but because we are calcualting a mean, the type is upconverted to float.

new_ds = ds.groupby('time.hour').mean()
new_ds

We can also define the grouping size by using the .resample() method and pass in the funciton to use using the .reduce() method. If there is no data to perform the operation, a new time step is added with a NaN data value.

new_ds = ds.resample(time='30min').reduce(np.nanmean)
new_ds

The .sel() method select data based on data or coordinate values. We can extract a range of data by filtering on the time coordinate and using the builtin slice() function. This looks familar to the Pandas example because Xarray is using Pandas.

new_ds = ds.sel(time=slice('2019-11-01 06:00', '2019-11-01 20:59:59'))
new_ds

For selecting at specific index use .isel().

new_ds = ds.isel({'time': range(200, 832)})
new_ds

What if we want to find the closes value in time but not match to values outside a tolerable range. We can use .reindex() method with a toleracne to indicate which values should be matched. The values that don’t have a match are set to NaN. We need to use timedelta64() to set the tolerance value which includes a time unit. Anywhere the time is outside the tolerance the data values is set to NaN. To allow for setting NaN, the data type is upconverted to float.

# To show how the tolerance work add some random seconds to the time used in matching
subset_time = ds['time'].values 
random_seconds = np.random.randint(-10, 10, size=subset_time.size).astype('timedelta64[s]')
subset_time = subset_time + random_seconds

ds_result = ds.reindex(time=subset_time, method='nearest',
                       tolerance=np.timedelta64(5, 's'))
print(ds['atmos_pressure'].values[:10])
print(ds_result['atmos_pressure'].values[:10])