--- title: Merge Data keywords: fastai sidebar: home_sidebar summary: "This notebook was made to demonstrate how to merge datasets by matching a single columns values from two datasets. We add columns of data from a foreign dataset into the ACS data we downloaded in our last tutorial." description: "This notebook was made to demonstrate how to merge datasets by matching a single columns values from two datasets. We add columns of data from a foreign dataset into the ACS data we downloaded in our last tutorial." ---
This Coding Notebook is the second in a series.
An Interactive version can be found here .
This colab and more can be found on our webpage.
Content covered in previous tutorials will be used in later tutorials.
New code and or information should have explanations and or descriptions attached.
Concepts or code covered in previous tutorials will be used without being explaining in entirety.
The Dataplay Handbook development techniques covered in the Datalabs Guidebook
If content can not be found in the current tutorial and is not covered in previous tutorials, please let me know.
This notebook has been optimized for Google Colabs ran on a Chrome Browser.
Statements found in the index page on view expressed, responsibility, errors and ommissions, use at risk, and licensing extend throughout the tutorial.
Install these libraries onto the virtual environment.
!pip install geopandas
!pip install dataplay
# @title Run: Install Modules
Nothing we havent already seen.
Our example will merge two simple datasets; pulling CSA names using tract ID's.
The First dataset will be obtained from the Census' ACS 5-year serveys.
Functions used to obtain this data were obtained from Tutorial 0) ACS: Explore and Download.
The Second dataset will be obtained using using a CSV from a publicly accessible link
We will use the function we created in our last tutorial to download the data!
# Our download function will use Baltimore City's tract, county and state as internal paramters
# Change these values in the cell below using different geographic reference codes will change those parameters
tract = '*'
county = '510'
state = '24'
# Specify the download parameters the function will receieve here
tableId = 'B19001'
year = '17'
saveAcs = False
df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
df.head()
Spatial data can be attained by using the 2010 Census Tract Shapefile Picking Tool or search their website for Tiger/Line Shapefiles
The core TIGER/Line Files and Shapefiles do not include demographic data, but they do contain geographic entity codes (GEOIDs) that can be linked to the Census Bureau’s demographic data, available on data.census.gov.-census.gov
print('Boundaries Example: https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8xXdUaT17jkdK0MWTJpg3GOy6jMWeaXTlguXNjCSb8Vr_FanSZQRaTU-m811fQz4kyMFK5wcahMNY/pub?gid=886223646&single=true&output=csv')
# Get the Second dataset.
# Our Example dataset contains Polygon Geometry information.
# We want to merge this over to our principle dataset.
# we will grab it by matching on either CSA or Tract
# The url listed below is public.
print('Tract 2 CSA Crosswalk : https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv')
inFile = input("\n Please enter the location of your file : \n" )
crosswalk = pd.read_csv( inFile )
crosswalk.head()
The following picture does nothing important but serves as a friendly reminder of the 4 basic join types.
Get Columns from both datasets to match on
You can get these values from the column values above.
Our Examples will work with the prompted values
print( 'Princpal Columns ' + str(df.columns) + '')
left_on = input("Left on principal column: ('tract') \n" )
print(' \n ');
print( 'Crosswalk Columns ' + str(crosswalk.columns) + '')
right_on = input("Right on crosswalk column: ('TRACT2010') \n" )
Specify how the merge will be performed
We will perform a left merge in this example.
It will return our Principal dataset with columns from the second dataset appended to records where their specified columns match.
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " )
Actually perfrom the merge
merged_df = pd.merge(df, crosswalk, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)
merged_df.head()
As you can see, our Census data will now have a CSA appended to it.
# Save Data to User Specified File
outFile = input("Please enter the new Filename to save the data to ('acs_csa_merge_test': " )
merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL)
flag = input("Enter a URL? If not ACS data will be used. (Y/N): " )
if (flag == 'y' or flag == 'Y'):
df = pd.read_csv( input("Please enter the location of your Principal file: " ) )
else:
tract = input("Please enter tract id (*): " )
county = input("Please enter county id (510): " )
state = input("Please enter state id (24): " )
tableId = input("Please enter acs table id (B19001): " )
year = input("Please enter acs year (18): " )
saveAcs = input("Save ACS? (Y/N): " )
df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
print( 'Principal Columns ' + str(df.columns))
print('Crosswalk Example: https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv')
crosswalk = pd.read_csv( input("Please enter the location of your crosswalk file: " ) )
print( 'Crosswalk Columns ' + str(crosswalk.columns) + '\n')
left_on = input("Left on: " )
right_on = input("Right on: " )
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " )
merged_df = pd.merge(df, crosswalk, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)
# Save the data
# Save the data
saveFile = input("Save File ('Y' or 'N'): ")
if saveFile == 'Y' or saveFile == 'y':
outFile = input("Saved Filename (Do not include the file extension ): ")
merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL);
merged_df
Intro
The following Python function is a bulked out version of the previous notes.
Description: add columns of data from a foreign dataset into a primary dataset along set parameters.
Purpose: Makes Merging datasets simple
Services
merged_df
Input(s):
Output: File
How it works:
Perform Merge
If the 'how' parameter is equal to ['left', 'right', 'outer', 'inner']
Diagram the mergeDatasets()
%%html
<img src="https://charleskarpati.com/images/class_diagram_merge_datasets.png">
mergeDatasets Flow Chart
%%html
<img src="https://charleskarpati.com/images/flow_chart_merge_datasets.png">
Gannt Chart mergeDatasets()
%%html
<img src="https://charleskarpati.com/images/gannt_chart_merge_datasets.png">
Sequence Diagram mergeDatasets()
%%html
<img src="https://charleskarpati.com/images/sequence_diagram_merge_datasets.png">
# Table: FDIC Baltimore Banks
# Columns: Bank Name, Address(es), Census Tract
left_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTViIZu-hbvhM3L7dIRAG95ISa7TNhUwdzlYxYzc1ygJoaYc3_scaXHe8Rtj5iwNA/pub?gid=1078028768&single=true&output=csv'
left_col = 'Census Tract'
# Table: Crosswalk Census Communities
# 'TRACT2010', 'GEOID2010', 'CSA2010'
right_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv'
right_col='TRACT2010'
merge_how = 'outer'
interactive = True
use_crosswalk = False
merged_df = mergeDatasets( left_ds=left_ds, left_col=left_col,
right_ds=right_ds, right_col=right_col,
merge_how='left', interactive =True, use_crosswalk=use_crosswalk )
merged_df.head()
left_col = 'GEOID2010'
right_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8xXdUaT17jkdK0MWTJpg3GOy6jMWeaXTlguXNjCSb8Vr_FanSZQRaTU-m811fQz4kyMFK5wcahMNY/pub?gid=886223646&single=true&output=csv'
right_col ='GEOID10'
merged_df_geom = mergeDatasets( left_ds=merged_df, left_col=left_col,
use_crosswalk=False, crosswalk_ds=False,
crosswalk_left_col = crosswalk_left_col, crosswalk_right_col = crosswalk_right_col,
right_ds=right_ds, right_col=right_col,
merge_how='outer', interactive = True )
merged_df_geom.head()
# Primary Table
# Description: I created a public dataset from a google xlsx sheet 'Bank Addresses and Census Tract' from a workbook of the same name.
# Table: FDIC Baltimore Banks
# Columns: Bank Name, Address(es), Census Tract
left_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTViIZu-hbvhM3L7dIRAG95ISa7TNhUwdzlYxYzc1ygJoaYc3_scaXHe8Rtj5iwNA/pub?gid=1078028768&single=true&output=csv'
left_col = 'Census Tract'
# Alternate Primary Table
# Description: Same workbook, different Sheet: 'Branches per tract'
# Columns: Census Tract, Number branches per tract
# left_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSHFrRSHva1f82ZQ7Uxwf3A1phqljj1oa2duGlZDM1vLtrm1GI5yHmpVX2ilTfMHQ/pub?gid=1698745725&single=true&output=csv'
# lef_col = 'Number branches per tract'
# Crosswalk Table
# Table: Crosswalk Census Communities
# 'TRACT2010', 'GEOID2010', 'CSA2010'
crosswalk_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv'
use_crosswalk = True
crosswalk_left_col = 'TRACT2010'
crosswalk_right_col = 'GEOID2010'
# Secondary Table
# Table: Baltimore Boundaries
# 'TRACTCE10', 'GEOID10', 'CSA', 'NAME10', 'Tract', 'geometry'
right_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8xXdUaT17jkdK0MWTJpg3GOy6jMWeaXTlguXNjCSb8Vr_FanSZQRaTU-m811fQz4kyMFK5wcahMNY/pub?gid=886223646&single=true&output=csv'
right_col ='GEOID10'
merge_how = 'geometry'
interactive = True
merge_how = 'outer'
merged_df_geom = mergeDatasets( left_ds=left_ds, left_col=left_col,
use_crosswalk=use_crosswalk, crosswalk_ds=crosswalk_ds,
crosswalk_left_col = crosswalk_left_col, crosswalk_right_col = crosswalk_right_col,
right_ds=right_ds, right_col=right_col,
merge_how=merge_how, interactive = interactive )
merged_df_geom.head()
Here we can save the data so that it may be used in later tutorials.
string = 'test_save_data_with_geom_and_csa'
merged_df.to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)
Download data by:
In the next tutorial you will learn how to load this data as a geospatial dataset so that it may be mapped and mapping functionalities may be applied to it.
You can upload this data into the next tutorial in one of two ways.
1)
OR.
2)
# When the prompts come up input the values not included from Interactive Example 1 and you will get the same output.
# This is to demonstrate that not all parameters must be known prior to executing the function.
mergeDatasets( left_ds=left_ds, left_col=left_col, right_ds=right_ds, interactive =True )
mergedDataset = mergeDatasets( left_ds=left_ds, left_col=left_col, use_crosswalk=use_crosswalk, right_ds=right_ds, right_col=right_col, merge_how = merge_how, interactive = interactive )
mergedDataset.dtypes
mergeDatasets()
# Census Crosswalk
# 'TRACT2010', 'GEOID2010', 'CSA2010'
left_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv'
# Baltimore Boundaries
# 'TRACTCE10', 'GEOID10', 'CSA', 'NAME10', 'Tract', 'geometry'
right_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8xXdUaT17jkdK0MWTJpg3GOy6jMWeaXTlguXNjCSb8Vr_FanSZQRaTU-m811fQz4kyMFK5wcahMNY/pub?gid=886223646&single=true&output=csv'
# The Left DS Cols will map to the first three Right DS Cols listed
left_col = 'GEOID2010'
right_col = 'GEOID10'
merge_how = 'outer'
interactive = True
mergeDatasets( left_ds=left_ds, left_col=left_col, right_ds=right_ds, right_col=right_col, merge_how = merge_how, interactive = interactive )