All data analysis should start with understanding your data (hopefully already cleaned) through exploratory data analysis. This involves looking at descriptive statistics and visualizations to understand the distribution. spread, outliers, etc., of the different features (variables) of your data.
It’s common practice to start with univariate graphs that visualize the distributions of various features in your dataset. Univariate analysis can reveal odd distributions or interesting values that warrant further exploration in bivariate and multivariate analysis.
The main plots I’m going to go over are bar charts and histograms, which are for categorical and quantitative data, respectively. I’ll also touch on KDEs and pie charts, two less common univariate plots.
All of the example plots I use throughout this blog post are derived from the Prosper loan data provided to me by Udacity while taking their data analyst course. You can view the full project on github.
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Import cleaned dataset
df = pd.read_csv('../data/prosperLoanData_clean_v1.csv')
df['LoanOriginationDate'] = pd.to_datetime(df['LoanOriginationDate'])
Bar Charts
Bar charts depict the distribution of a categorical variable. They are commonly re-ordered by frequency for nominal categories. Ordinal categories are left unordered. Common methods for creation are sns.countplot()
, sns.barplot()
, and plt.bar()
.
Create a vertical bar chart using Seaborn
# set primary color to use while plotting
primary_color = 'tab:blue'
# Alternatively, `color_palette()` returns the given palette as a list of RGB tuples.
# Each tuple consists of three digits specifying the red, green, and blue channel values to specify a color.
# Below I choose the fourth tuple of RGB colors from the 'Blues' palette.
primary_color = sns.color_palette("Blues")[4]
# Use the color in your plot with 'color='
sns.countplot(data=df, x="LoanStatus", color=color);
The
;
at the end of plotting functions likesns.countplot()
removes the object text that is printed by default. It looks like this:<Axes: xlabel=’LoanStatus’, ylabel=’count’>
. I typically prefer to append the;
because it produces a cleaner plot. It must be amended to the last plt function line of your code block in order to hide the object text.
Create a vertical bar chart using Matplotlib
#Return Series of unique variable values
x = df['LoanStatus'].unique()
#Return Series with frequency count for each unique value
y = df['LoanStatus'].value_counts()
plt.bar(x, y)
#Label the axes
plt.xlabel('Loan Status')
plt.ylabel('count')
#Display the plot
plt.show()
Notice that in the Matplotlib solution, the bars are sorted by frequency. That’s fine here because this is a nominal variable. If we had wanted to maintain the original order of the categories, we had to do so when setting up our y variable like this y = df[‘LoanStatus’].value_counts(sort=False)
. The .value_counts()
method sorts by default; we have to provide sort=False
to prevent it from reordering the bars.
Basic plot design
-
Rotate the category labels:
plt.xticks(rotation=90)
-
Set axis labels:
plt.xlabel(‘str’)
&plt.ylabel(‘str’)
-
Set axis title:
plt.title(‘str’)
-
Set figure title:
plt.suptitle(‘str’)
Plot a bar chart using proportions instead of counts
Since we are using proportions, we cannot use sns.countplot
because it requires integer values. Instead, we must use sns.barplot
.
To normalize the data, you can use the built-in pandas normalize=True
.
# Convert CreditRating dtype to ordered category
CR_cat = pd.CategoricalDtype(categories=['HR', 'E', 'D', 'C', 'B', 'A', 'AA'], ordered=True)
df.CreditRating = df.CreditRating.astype(CR_cat)
# Counts for ratings, sliced for enumerate
cr_cnt = df.CreditRating.value_counts(sort=False)[:]
# % of total for ratings
cr_pct = df.CreditRating.value_counts(normalize=True, sort=False)
# Turn pd.Series in Dataframe
cr_cnt = cr_pct.reset_index(drop=False)
cr_cnt = cr_cnt.rename(columns={'proportion':'Distribution', 'Count':'CreditRating'})
# Plot bar chart of proportions for credit ratings
sns.barplot(data=cr_cnt, x='CreditRating', y='Distribution', color=color);
Print the text (proportion) on the bars
Rather than plotting the data on a relative frequency scale, you might use text annotations to label the frequencies on bars instead.
# Counts for ratings
cr_cnt = df.CreditRating.value_counts(sort=False)
# % of total for ratings
cr_pct = df.CreditRating.value_counts(normalize=True, sort=False)*100
cr_pct = cr_pct.round(2)
# Plot bar chart for credit ratings
sns.countplot(data=df, y="CreditRating", color=color)
# Loop to append % values as text to bars
for i, count in enumerate(cr_cnt):
# Convert percentage into string
pct_str = '{}%'.format(cr_pct[i])
plt.text(count-400, i, pct_str, va='center', ha='right', size='x-small', color='white');
Notice that this time we plotted credit rating on the y-axis in sns.countplot()
which rotated the orientation of our plot.
To plot, we loop over the tick locations and labels, adding one text element for each bar.
plt.text()
takes the (x, y) coordinates of a box in which our text is placed, and is then aligned to using ha=
and va=
. We can also specify text formatting. The image below from Matplotlib explains how the alignment works
Histograms
A histogram plots the distribution of a numeric variable and can be considered a quantitative counterpart to the bar chart. Instead of displaying a bar for each unique numeric value, it groups values into continuous intervals referred to as bins. Each bin is represented by a bar, indicating the frequency of values within that range.
Plot a default histogram
To plot simple histograms, use plt.hist(data=df, x='x_variable', bins=#)
where ‘x_variable’ is the column you want to plot the distribution of, and bins are the number of groups the distribution will be split between.
# Get zoomed subset where at least one friend was lender and amount was less than $1001
friends_less_1100 = df.query('(InvestmentFromFriendsCount > 0) & (InvestmentFromFriendsAmount < 1001)')
# Plot histogram of friend investment amounts
plt.hist(data=friends_less_1100, x='InvestmentFromFriendsAmount', bins=15, color=color)
# Label axes
plt.ylabel('Count')
plt.xlabel('InvestmentFromFriendsAmount');
Dynamic Bin Specification
The np.arange
function's first parameter sets the leftmost bin edge, the second parameter determines the upper limit, and the third specifies the bin width. Note that np.range
returns values strictly less than the upper limit. Therefore, to ensure the inclusion of the maximum data value in the histogram, I've added "+0.25" (equal to the bin width) to the upper limit. This adjustment guarantees that the rightmost bin edge accommodates all data points.
bins = np.arange(0, friends_less_1100['InvestmentFromFriendsAmount'].max()+0.25, 25)
plt.hist(data=friends_less_1100, x='InvestmentFromFriendsAmount', bins=bins, color=color)
Using Seaborn and KDE
Seaborn’s sns.histplot
functions the same as plotting a histogram with Matplotlib. However, it gives you additional functionality. For instance, we can renormalize the data into percent using the stat='percent'
parameter. We can also add a Gaussian Kernel Density Estimate (KDE) line with the parameter kde=True
.
sns.histplot(data=friends_less_1100, x='InvestmentFromFriendsAmount', bins=bins, color=color, stat='percent', kde=True)
Kernel Density Estimation
Kernel Density Estimation (KDE) is a method used to estimate the probability density function of a variable. Imagine replacing each observation in a KDE plot with a small area ‘lump’. The accumulation of these lumps forms the resulting density curve. By default, KDE plots use a normal distribution kernel, implying that each ‘lump’ is part of a Gaussian distribution, and it is this aggregated distribution that is represented.
A word of caution: KDEs are complex tools with inherent assumptions and should be used and interpreted cautiously. They sometimes exhibit implausible behaviors, such as indicating non-zero density in areas of zero probability (for example, suggesting some friends lent negative funds)!
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
data = friends_less_1100.InvestmentFromFriendsAmount
# left plot: kde with rugplot
sns.kdeplot(data, ax=ax1)
sns.rugplot(data, ax=ax1, color='r')
# right plot: kde with narrow bandwidth to show individual probability lumps
sns.kdeplot(data, ax=ax2, bw_adjust=0.2)
sns.rugplot(data, ax=ax2, color='r')
Understanding proportions from KDE (Kernel Density Estimation) plots requires a different approach compared to standard histograms. In a KDE plot, the vertical axis represents data density rather than direct proportions. The area under the curve sums to 1, so to determine the probability of an outcome within a specific range, you compute the area under the curve within that range. However, assessing this area without computational tools can be challenging and prone to inaccuracies.
While KDE plots may not offer the intuitive probability judgments of histograms, they can be valuable in certain instances. For example, when data points are sparse, KDE provides a smooth overview of the data distribution, which might be less apparent in histograms. Histograms, with their discrete jumps, can sometimes give misleading impressions, especially with limited data.
An important aspect of KDE is the bandwidth parameter, which determines the width of the density lumps. This parameter plays a similar role to bin width in histograms. Selecting an appropriate bandwidth is crucial: too narrow may exaggerate the noise in the data, while too broad may obscure meaningful details. Therefore, if the default bandwidth set by your visualization tool seems unsuitable, or for deeper analysis, adjusting this parameter can be crucial to reveal the true nature of the data.
Pie Charts
Pie charts tend to be less common due to their limited use cases. Best practices for them are fairly restrictive in order to make them useful. The main restrictions include:
-
The variable of interest should be configured as relative frequencies to display parts of a whole.
-
Limit slices to 2–5 slices, ensuring that each slice is visibly distinct. If you have many slices or very thin slices, then you’ll have to group them differently.
-
The highest frequency should start at the top of your pie, and then slices should be placed clockwise in order of relative frequency.
# Create sorted list of frequencies
sorted_terms = df.Term.value_counts()
# Plot pie chart using sorted frequencies
plt.pie(sorted_terms, labels = sorted_terms.index, startangle = 90, counterclock = False,
colors=sns.color_palette('Blues_r'))
# Add title to plot
plt.title('Term Frequency (mo)');
To follow the guidelines in the bullets above, I specify “startangle = 90” and “counterclock = False” to start the first slice at the top of the pie and plot the sorted counts clockwise around the pie.
Conclusion
Univariate data exploration is a crucial first step in understanding your dataset. By using bar charts, histograms, KDE plots, and pie charts, you can visualize the distribution, central tendency, and variability of individual features. These tools help identify patterns, trends, and outliers, providing a solid foundation for deeper analysis.
Bar charts are ideal for categorical data, while histograms and KDE plots offer insights into numerical distributions. Pie charts, though less common, can effectively show parts of a whole when used with a limited number of categories.
Each visualization type serves a distinct purpose, and selecting the right one depends on your data and the insights you seek. This initial exploration sets the stage for more advanced analyses, ensuring a thorough understanding of your data from the outset.