Working with Pandas#

Pandas is an extension of the Numpy data model to structure and organize data into DataFrames. A DataFrame resembles a spreadsheet in looks, but is much more powerful.

import pandas as pd  # Convention suggest to import as pd

from pathlib import Path
import numpy as np

Create Pandas DataFrame#

Create some data in a python dictionary

data = {'apples': [3, 2, 0, 1],
        'oranges': [0, 3, 7, 2]}
data

Use the dictionary to populate the new pandas data frame. Python and Jupyter play very well with Pandas and when requested to print the DataFrame or Series knows how to print in a visually appealing way. Notice the columns have names taken from the Dictionary keys we used to populate the DataFrame. Each row has an index set to an incremented integer by default.

purchases = pd.DataFrame(data)
purchases

Pandas DataFrame consisting of Series

type(purchases)

Pandas Series

type(purchases.apples)

A DataFrame has a concept of “index” to label the rows of the DataFrame and column names to label the Columns. We created column names from the Dictionary keys used to initially create the DataFrame.

purchases = pd.DataFrame(data, index=['June', 'Robert', 'Lily', 'David'])
purchases

Printing the column shows additional informatin including the index and data type.

purchases.apples

Extracting a row uses the .loc() method on the DataFrame. .loc() standar for location and searches the index for a match.

purchases.loc['June']

Can also extract using index location .iloc() method.

purchases.iloc[0]

Read in some data#

Using the path to a specific file we read the data into a Pandas Data Frame. This is something you will use very often if you have data in ASCII column files. Get to know this method well. The method has keywords to help describe how to read the column data including the delimter, number of header rows, and which column is the time stamp. If it can parse the time stamp it will convert to Pandas native time type.

filename = Path('data', 'sgpmetE13.b1', 'sgpmetE13.00.20191105.150801.raw.dat')
data = pd.read_csv(filename, delimiter=',', header=0, skiprows=[0, 2, 3], parse_dates=[0])

data

How many data samples do we have? Can use the Python len() function. Notice it returns the number of rows only.

len(data)

Similar to Numpy, can use the .shape and .size methods on the DataFrame to return metadata about size of the DataFrame.

print(data.size)
print(data.shape)

Print the first five rows

data[0:5]

We can get the names of the columns with .columns method

data.columns

Pandas has a few methods used to inspect the data in the Dataset or Series and present the typical statistical values. Because Pandas plays so well with Jupyter the output is very easy to visualize.

data.describe()

Calculate the correlation coefficients of the DataSet on each Series

data.corr()

Calculate the covariance coefficients of the DataSet on each Series

data.cov()

What type is the time column data?

print(type(data['TIMESTAMP'][0]))
print(type(data['PTemp'][0]))

Get the pressure Series from the DataFrame and sum it up to one value or calcualte the mean.

print(data['Pressure_kPa'].sum())
print(data['Pressure_kPa'].mean())

Get the pressure Series from the DataFrame and use a special method to summarize the values in the Series using what Pandas things is most helpful.

data['RH_Avg'].describe()

Extract the RH series from the DataFrame. This is a copy of the Series in the DataFrame so changing the values will not change the values in the DataFrame.

rh = data['RH_Avg']
type(rh)

Warning, off on a tangent#

Pandas has two ways of extracting a Series from a DataFrame: copy and view. A copy is separate from the DataFrame and changes to the copied Series will not affect the DataFrame. Changes to a view will impact the DataFrame. In pandas, whether you get a view or a copy depends on the structure of the DataFrame and, if you are trying to modify a slice, the nature of the modification. Pandas will often warn when using a view vs. a copy, but that may not alwasy be the case. There is discussion to resovle this issue but the fix is not going to happen soon. The best solution is to use .copy() when you want a copy to be explicit.

rh = data['RH_Avg'].copy()
type(rh)

Calculate a rolling mean over the Series using 10 points. Notice the first 8 values are NaN. There is a default number of values to use to calculate a value. Else it is set to NaN.

rh_rolling_mean = rh.rolling(10).mean()
rh_rolling_mean

By specifically stating the minimum number of points to use when calculating the mean we force it to not fill in so many NaNs. There is at least one value to use in the rollig window so not NaNs. What happens when you change min_periods to a larger number?

rh_rolling_mean = rh.rolling(10, min_periods=1).mean()
rh_rolling_mean

Set a range of values in our extracted data to NaN to represent missing data.

rh[20:30] = np.nan
rh

Calculate a rolling max on the entire DataFrame.

rh_rolling_mean = data.rolling(10, min_periods=2).max()
rh_rolling_mean