Working with Xarray#

Xarray is a class of objects added to the regular python system that allows storing data in a more organized method. The format is very similar to netCDF classic model (netCDF3). It can read netCDF files efficiently and handle some issues associated with incorrectly designed netCDF files.

Xarray also has extetions to use the Numpy, Pandas, Dask and SciPy libaries directly. Think of Xarray as a tool for organizing data in a way that other libaries can be used on the data efficiently.

The primary difference between Xarray and Pandas is that Pandas is designed to handle 1-D data while Xarray can handle n-D data and metadata about the data.

The one downside is that Xarray has very powerful functions with less great documentation. May need to dig a bit to get the best way to perform a task.

import numpy as np
import xarray as xr  # Convention is to import xarray as xr

DataArray#

Here we create some data with Numpy and put into an Xarray DataArray. Notice how there is a concept of dimentionality built into DataArray. “xarray.DataArray (dim_0: 10000)”. But because we didn’t define the dimention name a generic one was created for us.

data = np.arange(10_000)  # This is a numpy array
xr_da = xr.DataArray(data)  # Create the Xarray DataArray using Numpy array.
xr_da

This time create a time array to match the data array shape. Time will be one minute time steps. The time array will become a coordinate variable to describe the values along the dimention we defined and named “time”. The coordinate is set to the time array and the dimention is set to “time” string.

time = np.array('2019-11-01T00:00:00', dtype='datetime64[m]') + np.arange(data.size)
xr_da = xr.DataArray(data, dims=['time'], coords=[time])
xr_da

We can add attributes describing metadata about the data.

xr_da.attrs['long_name'] = 'Amazing data that will win me a Nobel prize.'
xr_da.attrs['units'] = 'degK'
xr_da.attrs['valid_min'] = 0.
xr_da.attrs['valid_max'] = 10000.
xr_da

Same as above but all in one step while creating the DataArray.

xr_da = xr.DataArray(
    data, dims=['time'],
    coords=[time],
    attrs={'long_name': 'Amazing data that will win me a Nobel prize.',
           'units': 'degK',
           'valid_min': 0.,
           'valid_max': 10000.})
xr_da

To extract the data values only we use the .values attribute on the DataArray

xr_da.values

To extract the attributes as a dictionary we use the .attrs attribute.

xr_da.attrs

Or the attrs decorator can also accept a name for a specific attribute

xr_da.attrs['long_name']
type(xr_da)
type(xr_da.values)
type(xr_da.attrs)

Dataset#

The full power of Xarray comes from using Datasets. A Dataset is a collection of DataArrays. The beauty of Datasets is holding all the corresoponding data together and performing functions on multiple DataArrays in the Datasets all at once. This becomes very powerful and very fast!

Create some data and a time data array to match the data we created with minute time steps.

data1 = np.arange(10000, dtype=float)
data2 = np.arange(10000, dtype=float) + 123.456
time = np.array('2019-11-01T00:00:00', dtype='datetime64[m]') + np.arange(data1.size)
xr_ds = xr.Dataset(
    # This is the data section.
    # Notice all data is wrappted in a dictionary. In that dictionary the key
    # is the variable name followed by a tuple. The first value of the tuple
    # is the dimension(s) name, folloed by the data values, followed by optional
    # dictionary of attributes as key:value pairs.
    data_vars={'data1': ('time', data1, {'long_name': 'Data 1 values', 'units': 'degC'}),
               'data2': ('time', data2, {'long_name': 'Data 2 values', 'units': 'degF'})
               },
    # This is the coordinate section following the same format. Since this
    # comes next it could be interpredted as positional as coordinates.
    # But we are using keywords to make it easier to understand.
    coords={'time': ('time', time, {'long_name': 'Time in UTC'})},
    # Lastly we have the global attributes.
    attrs={'the_best_animals': 'sharks'}
)

Print out the full Dataset

xr_ds

Print out one DataArray from the Dataset

xr_ds['data1']

Print out values from the one variable in the Dataset

xr_ds['data1'].values

Print out one attribute from one DataArray

xr_ds['data1'].attrs['units']

Read in data#

Let’s read in a single netCDF data file.

from pathlib import Path
filename = Path('data', 'sgpmetE13.b1', 'sgpmetE13.b1.20191101.000000.cdf')
met_ds = xr.open_dataset(filename)
met_ds

We can also read in multiple netCDF data files using a differnet method. All the kewords accecpted by open_dataset() are accepted by open_mfdataset().

The filename glob is understood by open_mfdataset() and correctly grabs all the files that match the file glob. Using parallel=True allows it to use multiple cores for reading the data. This may depend on your machine and number of cores available, and may be faster or may not. We can also reduce the amount of memory required by excluding some variables from being read.

filename = str(Path('data', 'sgpmetE13.b1', 'sgpmetE13.b1.*.cdf'))

To resolve issues with incorrectly formatted variables or reduce the memory we can exclude some variable from being read.

drop_vars = [
    'base_time', 'time_offset', 'vapor_pressure_std', 'wspd_arith_mean',
    'qc_wspd_arith_mean', 'wspd_vec_mean', 'qc_wspd_vec_mean',
    'wdir_vec_mean', 'qc_wdir_vec_mean', 'wdir_vec_std', 'tbrg_precip_total',
    'qc_tbrg_precip_total', 'tbrg_precip_total_corr', 'qc_tbrg_precip_total_corr',
    'org_precip_rate_mean', 'qc_org_precip_rate_mean', 'pwd_err_code',
    'pwd_mean_vis_1min', 'qc_pwd_mean_vis_1min', 'pwd_mean_vis_10min',
    'qc_pwd_mean_vis_10min', 'pwd_pw_code_inst', 'qc_pwd_pw_code_inst',
    'pwd_pw_code_15min', 'qc_pwd_pw_code_15min', 'pwd_pw_code_1hr',
    'qc_pwd_pw_code_1hr', 'pwd_precip_rate_mean_1min',
    'qc_pwd_precip_rate_mean_1min', 'pwd_cumul_rain', 'qc_pwd_cumul_rain',
    'pwd_cumul_snow', 'qc_pwd_cumul_snow', 'logger_volt', 'qc_logger_volt',
    'logger_temp', 'qc_logger_temp', 'temp_std', 'rh_std', 'vapor_pressure_mean',
    'qc_vapor_pressure_mean', 'a_very_long_name_that_is_not_in_the_data_file']


met_ds = xr.open_mfdataset(filename, drop_variables=drop_vars,
                           parallel=True)

met_ds

Once we have the data read, Xarray has a wrapper around matplotlib to generate plots directly from the Dataset.

%matplotlib inline

met_ds['temp_mean'].plot()

Here we make two plots. Syntax is slightly different than calling matplotlib directly.

import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=2)
met_ds['temp_mean'].plot(ax=axes[0])
met_ds['rh_mean'].plot(ax=axes[1])

Xarray playing very well with other libraries#

We can use Xarray with Pandas natively for even easier actions. Here we use pandas to make a time array with a 6 hour time step for four years.

import pandas as pd

pd_time = pd.date_range('2000-01-01', freq='6H', periods=365 * 4)
pd_time

We will create a new Xarray Dataset with a range of numbers matching the number of time samples we created. The Pandas time is used to initialize the xarray Dataset. Because Pandas and Xarray play well together it just works.

xr_ds = xr.Dataset({'data': ('time', np.arange(len(pd_time))), 'time': pd_time})
xr_ds

Sub-selecting data#

Now let’s calculate the mean for a day by grouping the data. This works for all time worded groups: hour, minute, year, month, … The method returns a new Dataset and leaves the orginal untouched. Notice how data was orginanally type integer, but because we are calcualting a mean, the type is upconverted to float.

ds_mean = xr_ds.groupby('time.day').mean()
ds_mean

We can also define the grouping size by using the .resample() method and pass in the function to use using the .reduce() method. If there is no data to perform the operation, a new time step is added with a NaN data value.

ds_mean = xr_ds.resample(time='30min').reduce(np.nanmean)
ds_mean

The .sel() method select data based on data or coordinate values. We can extract a range of data by filtering on the time coordinate and using the builtin slice() function. This looks familar to the Pandas example because Xarray is using Pandas.

ds_subset = xr_ds.sel(time=slice('2000-06-01 06:00', '2000-08-03 23:59:59'))
ds_subset

For selecting at specific index use .isel().

ds_subset = xr_ds.isel({'time': range(200, 832)})
ds_subset

What if we want to find the closes value in time but not match to values outside a tolerable range. We can use .reindex() method with a toleracne to indicate which values should be matched. The values that don’t have a match are set to NaN. We need to use timedelta64() to set the tolerance value which includes a time unit. Anywhere the time is outside the tolerance the data values is set to NaN. To allow for setting NaN, the data type is upconverted to float.

# To show how the tolerance work add some random seconds to the time used in matching
subset_time = xr_ds['time'].values 
random_seconds = np.random.randint(-10, 10, size=subset_time.size).astype('timedelta64[s]')
subset_time = subset_time + random_seconds

ds_result = xr_ds.reindex(time=subset_time, method='nearest',
                          tolerance=np.timedelta64(5, 's'))
print(xr_ds['data'].values[:10])
print(ds_result['data'].values[:10])

To not have a Dataset full of missing values where the time was outside the tolerance we can drop where all values are set to NaN and return a new Dataset.

print('Number of time values:', ds_result.dims['time'])
ds_result = ds_result.dropna('time', how='all')
print('Number of time values:', ds_result.dims['time'])