--- title: Merge Data keywords: fastai sidebar: home_sidebar summary: "This notebook was made to demonstrate how to merge datasets by matching a single columns values from two datasets. We add columns of data from a foreign dataset into the ACS data we downloaded in our last tutorial." description: "This notebook was made to demonstrate how to merge datasets by matching a single columns values from two datasets. We add columns of data from a foreign dataset into the ACS data we downloaded in our last tutorial." ---
{% raw %}
/content/drive/My Drive/colabs/dataplay/dataplay/acsDownload.py:27: FutureWarning: Passing a negative integer is deprecated in version 1.0 and will not be supported in future version. Instead, use None to not limit the column width.
  pd.set_option('display.max_colwidth', -1)
/usr/local/lib/python3.6/dist-packages/psycopg2/__init__.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: <http://initd.org/psycopg/docs/install.html#binary-install-from-pypi>.
  """)
{% endraw %}

This Coding Notebook is the second in a series.

An Interactive version can be found here Open In Colab

This colab and more can be found at https://github.com/BNIA/colabs

  • Content covered in previous tutorials will be used in later tutorials.

  • New code and or information should have explanations and or descriptions attached.

  • Concepts or code covered in previous tutorials will be used without being explaining in entirety.

  • If content can not be found in the current tutorial and is not covered in previous tutorials, please let me know.

  • This notebook has been optimized for Google Colabs ran on a Chrome Browser.

  • Statements found in the index page on view expressed, responsibility, errors and ommissions, use at risk, and licensing extend throughout the tutorial.

About this Tutorial:

Whats Inside?

The Tutorial

In this notebook, the basics of how to perform a merge is introduced

  • We will merge two datasets
  • We will merge two datasets using a crosswalk

Objectives

By the end of this tutorial users should have an understanding of:

  • How dataset merges are performed
  • The types different union approaches a merge can take
  • the 'mergeData' function, and how to use it in the future

Guided Walkthrough

SETUP

Install these libraries onto the virtual environment.

{% raw %}
!pip install geopandas
!pip install dataplay
Requirement already satisfied: geopandas in /usr/local/lib/python3.6/dist-packages (0.8.0)
Requirement already satisfied: pandas>=0.23.0 in /usr/local/lib/python3.6/dist-packages (from geopandas) (1.0.5)
Requirement already satisfied: pyproj>=2.2.0 in /usr/local/lib/python3.6/dist-packages (from geopandas) (2.6.1.post1)
Requirement already satisfied: shapely in /usr/local/lib/python3.6/dist-packages (from geopandas) (1.7.0)
Requirement already satisfied: fiona in /usr/local/lib/python3.6/dist-packages (from geopandas) (1.8.13.post1)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23.0->geopandas) (2018.9)
Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23.0->geopandas) (1.18.5)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23.0->geopandas) (2.8.1)
Requirement already satisfied: click-plugins>=1.0 in /usr/local/lib/python3.6/dist-packages (from fiona->geopandas) (1.1.1)
Requirement already satisfied: click<8,>=4.0 in /usr/local/lib/python3.6/dist-packages (from fiona->geopandas) (7.1.2)
Requirement already satisfied: six>=1.7 in /usr/local/lib/python3.6/dist-packages (from fiona->geopandas) (1.12.0)
Requirement already satisfied: attrs>=17 in /usr/local/lib/python3.6/dist-packages (from fiona->geopandas) (19.3.0)
Requirement already satisfied: cligj>=0.5 in /usr/local/lib/python3.6/dist-packages (from fiona->geopandas) (0.5.0)
Requirement already satisfied: munch in /usr/local/lib/python3.6/dist-packages (from fiona->geopandas) (2.5.0)
Requirement already satisfied: dataplay in /usr/local/lib/python3.6/dist-packages (0.0.2)
{% endraw %} {% raw %}
# @title Run: Install Modules
{% endraw %} {% raw %}
{% endraw %}

(Optional) Local File Access

Nothing we havent already seen.

Retrieve Datasets

Our example will merge two simple datasets; pulling CSA names using tract ID's.

The First dataset will be obtained from the Census' ACS 5-year serveys.

Functions used to obtain this data were obtained from Tutorial 0) ACS: Explore and Download.

The Second dataset will be obtained using using a CSV from a publicly accessible link

Get the Principal dataset.

We will use the function we created in our last tutorial to download the data!

{% raw %}
# Our download function will use Baltimore City's tract, county and state as internal paramters
# Change these values in the cell below using different geographic reference codes will change those parameters
tract = '*'
county = '510'
state = '24'

# Specify the download parameters the function will receieve here
tableId = 'B19001'
year = '17'
saveAcs = False
{% endraw %} {% raw %}
df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
df.head()
{% endraw %}

Get the Secondary Dataset

Spatial data can be attained by using the 2010 Census Tract Shapefile Picking Tool or search their website for Tiger/Line Shapefiles

The core TIGER/Line Files and Shapefiles do not include demographic data, but they do contain geographic entity codes (GEOIDs) that can be linked to the Census Bureau’s demographic data, available on data.census.gov.-census.gov

{% raw %}
# Get the Second dataset. 
# Our Example dataset contains Polygon Geometry information. 
# We want to merge this over to our principle dataset. 
# we will grab it by matching on either CSA or Tract

# The url listed below is public.

print('Tract 2 CSA Crosswalk : https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv')

inFile = input("\n Please enter the location of your file : \n" )

crosswalk = pd.read_csv( inFile )
crosswalk.head()
{% endraw %}

Perform Merge & Save

The following picture does nothing important but serves as a friendly reminder of the 4 basic join types.

  • Left - returns all left records, only includes the right record if it has a match
  • Right - Returns all right records, only includes the left record if it has a match
  • Full - Returns all records regardless of keys matching
  • Inner - Returns only records where a key match
{% raw %}
print('Boundaries Example: https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8xXdUaT17jkdK0MWTJpg3GOy6jMWeaXTlguXNjCSb8Vr_FanSZQRaTU-m811fQz4kyMFK5wcahMNY/pub?gid=886223646&single=true&output=csv')
Boundaries Example: https://docs.google.com/spreadsheets/d/e/2PACX-1vQ8xXdUaT17jkdK0MWTJpg3GOy6jMWeaXTlguXNjCSb8Vr_FanSZQRaTU-m811fQz4kyMFK5wcahMNY/pub?gid=886223646&single=true&output=csv
{% endraw %}

Get Columns from both datasets to match on

You can get these values from the column values above.

Our Examples will work with the prompted values

{% raw %}
print( 'Princpal Columns ' + str(df.columns) + '')
left_on = input("Left on principal column: ('tract') \n" )

print( 'Crosswalk Columns ' + str(crosswalk.columns) + '')
right_on = input("Right on crosswalk column: ('TRACT2010', or, 'TRACTCE10') \n" )
Princpal Columns Index(['B19001_001E_Total', 'B19001_002E_Total_Less_than_$10_000',
       'B19001_003E_Total_$10_000_to_$14_999',
       'B19001_004E_Total_$15_000_to_$19_999',
       'B19001_005E_Total_$20_000_to_$24_999',
       'B19001_006E_Total_$25_000_to_$29_999',
       'B19001_007E_Total_$30_000_to_$34_999',
       'B19001_008E_Total_$35_000_to_$39_999',
       'B19001_009E_Total_$40_000_to_$44_999',
       'B19001_010E_Total_$45_000_to_$49_999',
       'B19001_011E_Total_$50_000_to_$59_999',
       'B19001_012E_Total_$60_000_to_$74_999',
       'B19001_013E_Total_$75_000_to_$99_999',
       'B19001_014E_Total_$100_000_to_$124_999',
       'B19001_015E_Total_$125_000_to_$149_999',
       'B19001_016E_Total_$150_000_to_$199_999',
       'B19001_017E_Total_$200_000_or_more', 'state', 'county', 'tract'],
      dtype='object')
Left on principal column: ('tract') 
tract
Crosswalk Columns Index(['TRACT2010', 'GEOID2010', 'CSA2010'], dtype='object')
Right on crosswalk column: ('TRACT2010', or, 'TRACTCE10') 
TRACT2010
{% endraw %}

Specify how the merge will be performed

We will perform a left merge in this example.

It will return our Principal dataset with columns from the second dataset appended to records where their specified columns match.

{% raw %}
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " )
How: (‘left’, ‘right’, ‘outer’, ‘inner’) inner
{% endraw %}

Actually perfrom the merge

{% raw %}
merged_df = pd.merge(df, crosswalk, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)
merged_df.head()
B19001_001E_Total B19001_002E_Total_Less_than_$10_000 B19001_003E_Total_$10_000_to_$14_999 B19001_004E_Total_$15_000_to_$19_999 B19001_005E_Total_$20_000_to_$24_999 B19001_006E_Total_$25_000_to_$29_999 B19001_007E_Total_$30_000_to_$34_999 B19001_008E_Total_$35_000_to_$39_999 B19001_009E_Total_$40_000_to_$44_999 B19001_010E_Total_$45_000_to_$49_999 B19001_011E_Total_$50_000_to_$59_999 B19001_012E_Total_$60_000_to_$74_999 B19001_013E_Total_$75_000_to_$99_999 B19001_014E_Total_$100_000_to_$124_999 B19001_015E_Total_$125_000_to_$149_999 B19001_016E_Total_$150_000_to_$199_999 B19001_017E_Total_$200_000_or_more state county TRACT2010 GEOID2010 CSA2010
0 796 237 76 85 38 79 43 36 35 15 43 45 39 5 0 6 14 24 510 190100 24510190100 Southwest Baltimore
1 695 63 87 93 6 58 30 14 29 23 38 113 70 6 32 11 22 24 510 190200 24510190200 Southwest Baltimore
2 2208 137 229 124 52 78 87 50 80 13 217 66 159 205 167 146 398 24 510 220100 24510220100 Inner Harbor/Fed...
3 632 3 20 0 39 7 0 29 8 9 44 29 98 111 63 94 78 24 510 230300 24510230300 South Baltimore
4 836 102 28 101 64 104 76 41 40 47 72 28 60 19 27 15 12 24 510 250207 24510250207 Cherry Hill
{% endraw %}

As you can see, our Census data will now have a CSA appended to it.

{% raw %}
# Save Data to User Specified File
outFile = input("Please enter the new Filename to save the data to ('acs_csa_merge_test': " )
merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL) 
Please enter the new Filename to save the data to ('acs_csa_merge_test': asdfsaf
{% endraw %}

Final Result

{% raw %}
flag = input("Enter a URL? If not ACS data will be used. (Y/N):  " )
if (flag == 'y' or flag == 'Y'):
  df = pd.read_csv( input("Please enter the location of your Principal file: " ) )
else:
  tract = input("Please enter tract id (*): " )
  county = input("Please enter county id (510): " )
  state = input("Please enter state id (24): " )
  tableId = input("Please enter acs table id (B19001): " ) 
  year = input("Please enter acs year (18): " )
  saveAcs = input("Save ACS? (True/False): " )
  df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)

print( 'Principal Columns ' + str(df.columns))

print('Crosswalk Example: https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv')

crosswalk = pd.read_csv( input("Please enter the location of your crosswalk file: " ) )
print( 'Crosswalk Columns ' + str(crosswalk.columns) + '\n')

left_on = input("Left on: " )
right_on = input("Right on: " )
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " )

merged_df = pd.merge(df, crosswalk, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)

# Save the data
# Save the data
saveFile = input("Save File ('Y' or 'N'): ")
if saveFile == 'Y' or saveFile == 'y':
  outFile = input("Saved Filename (Do not include the file extension ): ")
  merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL);
Enter a URL? If not ACS data will be used. (Y/N):  y
Please enter the location of your Principal file: https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv
Principal Columns Index(['TRACT2010', 'GEOID2010', 'CSA2010'], dtype='object')
Crosswalk Example: https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv
Please enter the location of your crosswalk file: https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv
Crosswalk Columns Index(['TRACT2010', 'GEOID2010', 'CSA2010'], dtype='object')

Left on: TRACT2010
Right on: TRACT2010
How: (‘left’, ‘right’, ‘outer’, ‘inner’) inner
Save File ('Y' or 'N'): n
{% endraw %} {% raw %}
merged_df
GEOID2010_x CSA2010_x GEOID2010_y CSA2010_y
0 24510010100 Canton 24510010100 Canton
1 24510010200 Patterson Park N... 24510010200 Patterson Park N...
2 24510010300 Canton 24510010300 Canton
3 24510010400 Canton 24510010400 Canton
4 24510010500 Fells Point 24510010500 Fells Point
... ... ... ... ...
196 24510280402 Edmondson Village 24510280402 Edmondson Village
197 24510280403 Beechfield/Ten H... 24510280403 Beechfield/Ten H...
198 24510280404 Allendale/Irving... 24510280404 Allendale/Irving...
199 24510280500 Oldtown/Middle East 24510280500 Oldtown/Middle East
200 0 Baltimore City 0 Baltimore City

201 rows × 4 columns

{% endraw %} {% raw %}
# Change some those parameters
tract = '*'
county = '510'
state = '24'

# Specify the download parameters the function will receieve here
tableId = 'B19001'
year = '17'
saveAcs = False
{% endraw %}

Advanced

Intro

The following Python function is a bulked out version of the previous notes.

  • It contains everything from the tutorial plus more.
  • It can be imported and used in future projects or stand alone.

Description: add columns of data from a foreign dataset into a primary dataset along set parameters.

Purpose: Makes Merging datasets simple

Services

  • Merge two datasets without a crosswalk
  • Merge two datasets with a crosswalk
{% raw %}
{% endraw %} {% raw %}

mergeDatasets[source]

mergeDatasets(left_ds=False, right_ds=False, crosswalk_ds=False, use_crosswalk=True, left_col=False, right_col=False, crosswalk_left_col=False, crosswalk_right_col=False, merge_how=False, interactive=True)

{% endraw %}

Function Explanation

Input(s):

  • Dataset url
  • Crosswalk Url
  • Right On
  • Left On
  • How
  • New Filename

Output: File

How it works:

  • Read in datasets
  • Perform Merge

  • If the 'how' parameter is equal to ['left', 'right', 'outer', 'inner']

    • then a merge will be performed.
  • If a column name is provided in the 'how' parameter
    • then that single column will be pulled from the right dataset as a new column in the left_ds.

Function Diagrams

{% raw %}
#@title Run: Diagram the mergeDatasets()

%%html
<img src="https://charleskarpati.com/images/class_diagram_merge_datasets.png">
{% endraw %} {% raw %}
#@title Run: mergeDatasets Flow Chart

%%html
<img src="https://charleskarpati.com/images/flow_chart_merge_datasets.png">
{% endraw %} {% raw %}
#@title Run: Gannt Chart  mergeDatasets()

%%html
<img src="https://charleskarpati.com/images/gannt_chart_merge_datasets.png">
{% endraw %} {% raw %}
#@title Run: Sequence Diagram  mergeDatasets()

%%html
<img src="https://charleskarpati.com/images/sequence_diagram_merge_datasets.png">
{% endraw %}

Function Examples

Interactive Example 1

{% raw %}
# Table: FDIC Baltimore Banks
# Columns: Bank Name, Address(es), Census Tract
left_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSHFrRSHva1f82ZQ7Uxwf3A1phqljj1oa2duGlZDM1vLtrm1GI5yHmpVX2ilTfMHQ/pub?gid=601362340&single=true&output=csv'
left_col = 'Census Tract'

# Table: Crosswalk Census Communities
# 'TRACT2010', 'GEOID2010', 'CSA2010'
right_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv'
right_col='TRACT2010'

merge_how = 'outer'
interactive = True
use_crosswalk = True

merged_df = mergeDatasets( left_ds=left_ds, left_col=left_col, 
              right_ds=right_ds, right_col=right_col, 
              merge_how='left', interactive =True, use_crosswalk=use_crosswalk )
 Handling Left Dataset
Please provide a new dataset URL: https://docs.google.com/spreadsheets/d/e/2PACX-1vSHFrRSHva1f82ZQ7Uxwf3A1phqljj1oa2duGlZDM1vLtrm1GI5yHmpVX2ilTfMHQ/pub?gid=601362340&single=true&output=csv
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py in _input_request(self, prompt, ident, parent, password)
    728             try:
--> 729                 ident, reply = self.session.recv(self.stdin_socket, 0)
    730             except Exception:

/usr/local/lib/python3.6/dist-packages/jupyter_client/session.py in recv(self, socket, mode, content, copy)
    802         try:
--> 803             msg_list = socket.recv_multipart(mode, copy=copy)
    804         except zmq.ZMQError as e:

/usr/local/lib/python3.6/dist-packages/zmq/sugar/socket.py in recv_multipart(self, flags, copy, track)
    474         """
--> 475         parts = [self.recv(flags, copy=copy, track=track)]
    476         # have first part already, only loop while more to receive

zmq/backend/cython/socket.pyx in zmq.backend.cython.socket.Socket.recv()

zmq/backend/cython/socket.pyx in zmq.backend.cython.socket.Socket.recv()

zmq/backend/cython/socket.pyx in zmq.backend.cython.socket._recv_copy()

/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/checkrc.pxd in zmq.backend.cython.checkrc._check_rc()

KeyboardInterrupt: 

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-3-e39c26c3e796> in <module>()
     15 merged_df = mergeDatasets( left_ds=left_ds, left_col=left_col, 
     16               right_ds=right_ds, right_col=right_col,
---> 17               merge_how='left', interactive =True, use_crosswalk=use_crosswalk )

<ipython-input-1-474dd398e9f9> in mergeDatasets(left_ds, right_ds, crosswalk_ds, use_crosswalk, left_col, right_col, crosswalk_left_col, crosswalk_right_col, merge_how, interactive)
    300     return left_ds
    301 
--> 302   return main( left_ds, right_ds, crosswalk_ds, use_crosswalk, left_col, right_col, crosswalk_left_col, crosswalk_right_col, merge_how, interactive )

<ipython-input-1-474dd398e9f9> in main(left_ds, right_ds, crosswalk_ds, use_crosswalk, left_col, right_col, crosswalk_left_col, crosswalk_right_col, merge_how, interactive)
    217 
    218     if(interactive):print('\n Handling Left Dataset');
--> 219     left_ds, left_col, dfStatus, colStatus = handleDatasetAndColumns(left_ds, left_col, interactive)
    220     if ( dfStatus and colStatus and interactive):
    221       print('Left Dataset and Columns are Valid');

<ipython-input-1-474dd398e9f9> in handleDatasetAndColumns(df, col, interactive)
    156     dfStatus = colStatus = True
    157     # Ensure A Dataset is being handled
--> 158     df = processDataset(df, interactive)
    159     if ( not checkDataSetExists(df) ): dfStatus = False
    160 

<ipython-input-1-474dd398e9f9> in processDataset(df, interactive)
    145     df = retrieveDatasetFromUrl(df)
    146     # If !DF and interactive, re-processDataset w/a new URL
--> 147     df = retrieveUrlForDataset(df, interactive)
    148     return df
    149 

<ipython-input-1-474dd398e9f9> in retrieveUrlForDataset(df, interactive)
    137     if ( not dsExists and interactive):
    138       df = input("Please provide a new dataset URL: " );
--> 139       return processDataset(df, interactive)
    140 
    141   # Ensures a Pandas DF is returned.

<ipython-input-1-474dd398e9f9> in processDataset(df, interactive)
    145     df = retrieveDatasetFromUrl(df)
    146     # If !DF and interactive, re-processDataset w/a new URL
--> 147     df = retrieveUrlForDataset(df, interactive)
    148     return df
    149 

<ipython-input-1-474dd398e9f9> in retrieveUrlForDataset(df, interactive)
    136     if ( not dsExists and not interactive): return False;
    137     if ( not dsExists and interactive):
--> 138       df = input("Please provide a new dataset URL: " );
    139       return processDataset(df, interactive)
    140 

/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py in raw_input(self, prompt)
    702             self._parent_ident,
    703             self._parent_header,
--> 704             password=False,
    705         )
    706 

/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py in _input_request(self, prompt, ident, parent, password)
    732             except KeyboardInterrupt:
    733                 # re-raise KeyboardInterrupt, to truncate traceback
--> 734                 raise KeyboardInterrupt
    735             else:
    736                 break

KeyboardInterrupt: 
{% endraw %} {% raw %}
merged_df.head()
TRACT2010 GEOID2010_x CSA2010_x GEOID2010_y CSA2010_y
0 10100 24510010100 Canton 24510010100 Canton
1 10200 24510010200 Patterson Park North & East 24510010200 Patterson Park North & East
2 10300 24510010300 Canton 24510010300 Canton
3 10400 24510010400 Canton 24510010400 Canton
4 10500 24510010500 Fells Point 24510010500 Fells Point
{% endraw %}

Example 1.5 ) Get CSA and Geometry with a Crosswalk.

{% raw %}
# Primary Table
# Description: I created a public dataset from a google xlsx sheet 'Bank Addresses and Census Tract' from a workbook of the same name.
# Table: FDIC Baltimore Banks
# Columns: Bank Name, Address(es), Census Tract
left_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSHFrRSHva1f82ZQ7Uxwf3A1phqljj1oa2duGlZDM1vLtrm1GI5yHmpVX2ilTfMHQ/pub?gid=601362340&single=true&output=csv'
left_col = 'Census Tract'

# Alternate Primary Table
# Description: Same workbook, different Sheet: 'Branches per tract' 
# Columns: Census Tract, Number branches per tract
# left_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSHFrRSHva1f82ZQ7Uxwf3A1phqljj1oa2duGlZDM1vLtrm1GI5yHmpVX2ilTfMHQ/pub?gid=1698745725&single=true&output=csv'
# lef_col = 'Number branches per tract'

# Crosswalk Table
# Table: Crosswalk Census Communities
# 'TRACT2010', 'GEOID2010', 'CSA2010'
crosswalk_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv'
use_crosswalk = True
crosswalk_left_col = 'TRACT2010'
crosswalk_right_col = 'GEOID2010'

# Secondary Table
# Table: Baltimore Boundaries
# 'TRACTCE10', 'GEOID10', 'CSA', 'NAME10', 'Tract', 'geometry'
right_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTPKW6YOHPFvkw3FM3m5y67-Aa5ZlrM0Ee1Fb57wlGuldr99sEvVWnkej30FXhSb3j8o9gr8izq2ZRP/pub?output=csv'
right_col ='GEOID10'

merge_how = 'geometry'
interactive = True
merge_how = 'outer'

merged_df = mergeDatasets( left_ds=left_ds, left_col=left_col, 
              use_crosswalk=use_crosswalk, crosswalk_ds=crosswalk_ds,
              crosswalk_left_col = crosswalk_left_col, crosswalk_right_col = crosswalk_right_col,
              right_ds=right_ds, right_col=right_col, 
              merge_how=merge_how, interactive = interactive )

merged_df.head()
 Handling Left Dataset
Please provide a new dataset URL: https://docs.google.com/spreadsheets/d/e/2PACX-1vSHFrRSHva1f82ZQ7Uxwf3A1phqljj1oa2duGlZDM1vLtrm1GI5yHmpVX2ilTfMHQ/pub?gid=601362340&single=true&output=csv
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py in _input_request(self, prompt, ident, parent, password)
    728             try:
--> 729                 ident, reply = self.session.recv(self.stdin_socket, 0)
    730             except Exception:

/usr/local/lib/python3.6/dist-packages/jupyter_client/session.py in recv(self, socket, mode, content, copy)
    802         try:
--> 803             msg_list = socket.recv_multipart(mode, copy=copy)
    804         except zmq.ZMQError as e:

/usr/local/lib/python3.6/dist-packages/zmq/sugar/socket.py in recv_multipart(self, flags, copy, track)
    474         """
--> 475         parts = [self.recv(flags, copy=copy, track=track)]
    476         # have first part already, only loop while more to receive

zmq/backend/cython/socket.pyx in zmq.backend.cython.socket.Socket.recv()

zmq/backend/cython/socket.pyx in zmq.backend.cython.socket.Socket.recv()

zmq/backend/cython/socket.pyx in zmq.backend.cython.socket._recv_copy()

/usr/local/lib/python3.6/dist-packages/zmq/backend/cython/checkrc.pxd in zmq.backend.cython.checkrc._check_rc()

KeyboardInterrupt: 

During handling of the above exception, another exception occurred:

KeyboardInterrupt                         Traceback (most recent call last)
<ipython-input-4-6d9c4070a229> in <module>()
     34               crosswalk_left_col = crosswalk_left_col, crosswalk_right_col = crosswalk_right_col,
     35               right_ds=right_ds, right_col=right_col,
---> 36               merge_how=merge_how, interactive = interactive )
     37 
     38 merged_df.head()

<ipython-input-1-474dd398e9f9> in mergeDatasets(left_ds, right_ds, crosswalk_ds, use_crosswalk, left_col, right_col, crosswalk_left_col, crosswalk_right_col, merge_how, interactive)
    300     return left_ds
    301 
--> 302   return main( left_ds, right_ds, crosswalk_ds, use_crosswalk, left_col, right_col, crosswalk_left_col, crosswalk_right_col, merge_how, interactive )

<ipython-input-1-474dd398e9f9> in main(left_ds, right_ds, crosswalk_ds, use_crosswalk, left_col, right_col, crosswalk_left_col, crosswalk_right_col, merge_how, interactive)
    217 
    218     if(interactive):print('\n Handling Left Dataset');
--> 219     left_ds, left_col, dfStatus, colStatus = handleDatasetAndColumns(left_ds, left_col, interactive)
    220     if ( dfStatus and colStatus and interactive):
    221       print('Left Dataset and Columns are Valid');

<ipython-input-1-474dd398e9f9> in handleDatasetAndColumns(df, col, interactive)
    156     dfStatus = colStatus = True
    157     # Ensure A Dataset is being handled
--> 158     df = processDataset(df, interactive)
    159     if ( not checkDataSetExists(df) ): dfStatus = False
    160 

<ipython-input-1-474dd398e9f9> in processDataset(df, interactive)
    145     df = retrieveDatasetFromUrl(df)
    146     # If !DF and interactive, re-processDataset w/a new URL
--> 147     df = retrieveUrlForDataset(df, interactive)
    148     return df
    149 

<ipython-input-1-474dd398e9f9> in retrieveUrlForDataset(df, interactive)
    137     if ( not dsExists and interactive):
    138       df = input("Please provide a new dataset URL: " );
--> 139       return processDataset(df, interactive)
    140 
    141   # Ensures a Pandas DF is returned.

<ipython-input-1-474dd398e9f9> in processDataset(df, interactive)
    145     df = retrieveDatasetFromUrl(df)
    146     # If !DF and interactive, re-processDataset w/a new URL
--> 147     df = retrieveUrlForDataset(df, interactive)
    148     return df
    149 

<ipython-input-1-474dd398e9f9> in retrieveUrlForDataset(df, interactive)
    136     if ( not dsExists and not interactive): return False;
    137     if ( not dsExists and interactive):
--> 138       df = input("Please provide a new dataset URL: " );
    139       return processDataset(df, interactive)
    140 

/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py in raw_input(self, prompt)
    702             self._parent_ident,
    703             self._parent_header,
--> 704             password=False,
    705         )
    706 

/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py in _input_request(self, prompt, ident, parent, password)
    732             except KeyboardInterrupt:
    733                 # re-raise KeyboardInterrupt, to truncate traceback
--> 734                 raise KeyboardInterrupt
    735             else:
    736                 break

KeyboardInterrupt: 
{% endraw %}

Here we can save the data so that it may be used in later tutorials.

{% raw %}
string = 'test_save_data_with_geom_and_csa'
merged_df.to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)
{% endraw %}

Download data by:

  • Clicking the 'Files' tab in the left hand menu of this screen. Locate your file within the file explorer that appears directly under the 'Files' tab button once clicked. Right click the file in the file explorer and select the 'download' option from the dropdown.

In the next tutorial you will learn how to load this data as a geospatial dataset so that it may be mapped and mapping functionalities may be applied to it.

You can upload this data into the next tutorial in one of two ways.

1)

  • uploading the saved file to google Drive and connecting to your drive path

OR.

2)

  • 'by first downloading the dataset as directed above, and then navigating to the next tutorial. Go to their page and:
  • Uploading data using an file 'upload' button accessible within the 'Files' tab in the left hand menu of this screen. The next tutorial will teach you how to load this data so that it may be mapped.

Interactive Example 2

{% raw %}
# When the prompts come up input the values not included from Interactive Example 1 and you will get the same output.
# This is to demonstrate that not all parameters must be known prior to executing the function.

mergeDatasets( left_ds=left_ds, left_col=left_col, right_ds=right_ds, interactive =True )
{% endraw %} {% raw %}
mergedDataset = mergeDatasets( left_ds=left_ds, left_col=left_col, use_crosswalk=use_crosswalk, right_ds=right_ds, right_col=right_col, merge_how = merge_how, interactive = interactive )
{% endraw %} {% raw %}
mergedDataset.dtypes
{% endraw %}

Interactive Run Alone

{% raw %}
mergeDatasets()
{% endraw %}

Preconfigured Example 1

{% raw %}
# Census Crosswalk
# 'TRACT2010', 'GEOID2010', 'CSA2010'
left_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vREwwa_s8Ix39OYGnnS_wA8flOoEkU7reIV4o3ZhlwYhLXhpNEvnOia_uHUDBvnFptkLLHHlaQNvsQE/pub?output=csv'

# Baltimore Boundaries
# 'TRACTCE10', 'GEOID10', 'CSA', 'NAME10', 'Tract', 'geometry'
right_ds = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vTPKW6YOHPFvkw3FM3m5y67-Aa5ZlrM0Ee1Fb57wlGuldr99sEvVWnkej30FXhSb3j8o9gr8izq2ZRP/pub?output=csv'
# The Left DS Cols will map to the first three Right DS Cols listed
left_col = 'GEOID2010'
right_col = 'GEOID10'
merge_how = 'outer'
interactive = True
{% endraw %} {% raw %}
mergeDatasets( left_ds=left_ds, left_col=left_col, right_ds=right_ds, right_col=right_col, merge_how = merge_how, interactive = interactive )
{% endraw %}