Tristen.
Data Wrangling 101: Preparing Your Data in Jupyter

Data Wrangling 101: Preparing Your Data in Jupyter

January 3, 2024 (8mo ago)

Data wrangling, often considered a preliminary but essential step in data analysis, involves converting and mapping raw data into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes. This post aims to unravel the layers of this process, providing a comprehensive guide from the initial stages of defining research questions to the final steps of data visualization. It’s a journey through the best practices and tools essential for any data scientist, ensuring that your data is prepared for analysis.

Define Research Questions

Ensuring that a research question is clearly stated at the beginning of any data analysis is crucial for several reasons:

  1. Guidance: The research question guides your entire data analysis process. It determines what data is needed, what kind of analyses to run, and how to interpret the results.

  2. Clarity: Clearly stating the research question at the beginning helps anyone reading the notebook understand the purpose of the analysis. It sets the context and allows the reader to follow the logic of the study.

  3. Focus: The research question keeps the analysis focused. It’s easy to get lost in the data and go off on tangents. A well-defined research question prevents this by reminding you of the main goal of your analysis.

Useful Links

Import Packages

Based on the PEP8 style guide, imports should be grouped in the following order:

  1. Standard library imports.
  2. Related third-party imports.
  3. Local application/library-specific imports.

Common packages for data wrangling:

  • NumPy
  • pandas
  • requests
  • BeautifulSoup
  • Pillow (PIL)
  • SQLAlchemy
  • Scikit-learn
  • Openpyxl

Updating packages like pandas within a Jupyter Notebook is not a good practice. Here’s why:

  • Reproducibility: Updating packages in the notebook makes it hard to reproduce the environment. It’s better to state the required versions of all packages at the beginning of your notebook.
  • Clarity: Notebooks should tell a clear story. Including code to update packages can distract from the main narrative.
  • Potential Errors: Updating a package in the notebook could break the code in earlier cells. It’s safer to update packages outside the notebook (i.e. with pip or conda).

Best Practices:

  • Always manage package versions outside the notebook, in a separate environment setup process. This could be done using virtual environments, docker containers, or conda environments.
  • At the beginning of the notebook, state the versions of all major packages used. This can be done in a markdown cell for human readers, and optionally also in a code cell that prints out the actual versions in use when the notebook is run. This can be achieved using the watermark extension (documentation), for example, as shown here:
%load_ext watermark
%load watermark
watermark -p pandas,numpy,matplotlib

Structure + Documentation

Best Practices for Documentation:

  • Use a consistent style and format throughout your documentation.
  • Use comments to explain what your code is doing.
  • Write documentation for your data, including the data sources, the data types, and the meaning of each variable.
  • Write documentation for your analysis, including the methods used and the rationale behind the choices you made.
  • Write documentation for your results, including the key findings and the implications for the business.
  • Use diagrams and visualizations to help explain your findings.

Consider Using Pre-Built Templates:

Useful Articles

1. Gather Data

Data gathering is an important step in any data science project as it involves collecting the data you need from various sources to use in further analysis. In Python, there are a number of ways to gather data:

  1. Download Data Manually:
import pandas as pd

df = pd.read_csv('file.csv')
print(df.head())
  1. Programmatically Download Files:
import requests

url = 'https://example.com/file.csv'
response = requests.get(url)

with open('file.csv', 'wb') as f:
    f.write(response.content)
  1. Accessing APIs:
import requests
import json

url = 'https://api.github.com/users/octocat'
response = requests.get(url)
data = response.json()

print(json.dumps(data, indent=4))
  1. Web Scraping with BeautifulSoup:
import requests
from bs4 import BeautifulSoup

url = 'https://example.com'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

print(soup.prettify())
  1. Extracting Data from a SQL Database:
import sqlite3

conn = sqlite3.connect('example.db')
cursor = conn.cursor()
cursor.execute("SELECT * FROM table_name")
rows = cursor.fetchall()

for row in rows:
    print(row)

Useful Articles

2. Assess Data

Data assessment involves inspecting your datasets for two things:

  • Data quality issues: like missing, duplicates, or incorrect data. This is called dirty data.
  • Data structural issues: like data having different formats. This is called messy data.

You can search for these issues in two ways:

  • Visually by scrolling
  • Programmatically using code

Dimensions of Data Quality:

Completeness is a metric that helps you understand whether your data is sufficient to answer interesting questions or solve your problem.

# Check for missing values
missing_values = df.isnull().sum()
print(missing_values)
# Drop or fill missing values
df = df.dropna()
# Or
df = df.fillna(value)

Validity is a metric that helps you understand how well your data conforms to a defined set of rules for data, also known as a schema.

# Convert string to datetime
df['date_column'] = pd.to_datetime(df['date_column'], format='%d-%m-%Y')

Accuracy is a metric that helps you understand whether your data accurately represents the reality it aims to depict.

# Here, df_truth is the source of truth
accuracy = np.mean(df['column_name'] == df_truth['column_name'])
print(accuracy)

Consistency is a metric that helps you understand two things: whether your data follows a standard format and whether your data’s info matches with information from other data sources.

# Here, df1 and df2 are two different datasets
consistency = np.mean(df1['column_name'] == df2['column_name'])
print(consistency)

Uniqueness is a metric that helps you understand whether there are duplicate or overlapping values in your data.

# Check for duplicates
duplicates = df.duplicated().sum()
print(duplicates)
# Drop duplicates
df = df.drop_duplicates()

Removing columns or variables that are unnecessary for your analysis is called Dimensionality Reduction. Changing variables measurements or ratios to improve your analysis is called Feature Engineering. Both of these are common data preprocessing steps that can be used to improve the quality of tidy data, but they don’t count as quality or tidiness issues.

Requirements for Tidiness:

  • Each variable forms a column
  • Each observation forms a row
  • Each type of observational unit forms a table

Useful Links

3. Clean Data

This is the stage where we solve the problems established during data assessment.

The Data Cleaning Process

Programmatic data cleaning has three steps:

  1. Define: write a data cleaning plan by converting your assessments into cleaning tasks.
  2. Code: translate the data cleaning plan to code and run it to clean the data.
  3. Test: test the dataset, often using code, to ensure the cleaning code works and revisit some of the elements from the assessment phase.

Dealing with Outliers

  • Set up a range manually if you already know the range using pandas indexing.
  • Identify outliers automatically using the standard deviation method, using the outputs of pandas df.describe() function.
  • Drop outliers using df.drop(index=...)
  • Identify the impact on summary statistics after dealing with outliers.
  • Instead of dropping outliers, you can choose to keep them separate:
# If we know the range of x
low_bound = df.loc[df['x'] < min]
upper_bound = df.loc[df['x'] > max]

# If we don't: Use standard deviation (σ)
summaries = df.describe().loc[['mean', 'std']]
low_bound = summaries['x']['mean'] - summaries['x']['std']
upper_bound = summaries['x']['mean'] + summaries['x']['std']

# Separate Rows
violating_rows = df[(df['x'] < low_bound) | (df['x'] > upper_bound)]

Dealing with Duplicates

  1. Review duplicates
duplicate_rows = clean_df.loc[df.duplicated(['x','y'])]
  1. Remove rows or convert to NaN
# Keep 1st occurrence
remove_dups = clean_df.drop_duplicates(subset=['x','y'])

# Keep last occurrence
remove_dups = clean_df.drop_duplicates(subset=['x','y'], keep='last')

# Remove specific occurrence
clean_df.drop([:]) #slice specific location

# Convert duplicates to NaN
duplicate_index = clean_df[clean_df.duplicated(['x', 'y'])].index
clean_df.loc[duplicate_index, 'x'] = np.nan

Dealing with Missing Data

  1. Replace non-null missing values found during assessment (ex. ‘#’)
clean_df.loc[df['x'].isin(['#'])]

# OR

clean_df['x'] = clean_df['x'].replace({'#':np.nan})
  1. Check missing total
df.isna().sum()
  1. Drop values or fill values
# Drop rows
df.dropna()

# Drop columns
df.drop('x', axis=1)

# Fill Values
## Impute w/ mean
df.fillna(df['x'].mean())

## Forward Fill
df.fillna(method='ffill')

## Back Fill
df.fillna(method='backfill')

Testing Methods

Visual:

  • Heatmaps to visualize missing data
  • Histograms to visualize the range of data
  • Box plots to visualize outliers

Programmatic:

  • Use Python assert statements to check the data types, number of NA values, etc.

Useful Links

4. Storing Data

Maintaining different versions of your data, particularly the raw data and the cleaned data, has several advantages:

  • It allows you to trace back to the original data if something goes wrong during the data-cleaning process.
  • It provides a clear understanding of the transformations applied to the raw data.
  • It facilitates the reproducibility of your analysis, which is a key aspect of good data science practices.

Save data to a CSV file using Pandas:

df.to_csv(‘demo/2023_ad_cleaned.csv’, index=False, encoding="utf-8")

Save data to a SQL database using SQLAlchemy:

df.to_sql(‘clean_data’, con=connection, if_exists='append', index_label=’ID’)

It is important to have a structure when storing and publishing your data. A sample structure is:

  • A doc folder for documentation
  • A data folder for raw data
  • A src folder for source code
  • A result folder for cleaned data and analysis

This is particularly helpful for reproducible workflows, where downstream stakeholders can understand the process and replicate your work on their setups.

Save data to a pickle file:

Sometimes during cleaning, you may have adjusted the dtypes of variables to resolve validity issues. However, saving the cleaned data to .csv files strips certain data types, which means they’ll be different when you next import them.

If you need to preserve your data types, you can save your dataframe as a pickle file using:

df.to_pickle("./FILE_NAME.pkl")

Useful Links

5. Visualization

chart-suggestions

Useful Links

Conclusion

As we reach the conclusion of “Data Wrangling 101: A Jupyter Workflow,” it’s clear that the process of transforming raw data into a clean, structured format is no small feat. This comprehensive guide has walked you through the essential steps of this intricate process, emphasizing the importance of a well-defined research question, the selection of appropriate tools and packages, and the implementation of best practices in data cleaning and storage.

Remember, data wrangling is more than just a pre-processing step; it is the foundation upon which insightful analysis is built. With the knowledge and tools provided in this guide, you’re now better equipped to tackle the complexities of data science projects, transforming raw data into powerful stories and actionable insights. Keep exploring, keep analyzing, and let your data tell its story.