There are four ways to merge/combine between different DataFrames in pandas: concatenating, appending, merging and joining. Each has its own use cases and best practice. We discussed in Tutorial 11A the both methods of Concatenating and appending. In this tutorial, we discuss both merge and join.
Concat and append have limited capabilities in joining tables through keys and considering the inter-relationship between tables. Merge and join on the other hand combine DataFrames based on a key. According to concepts of relational databases like SQL, there are three types of relationships between tables:
One-to-one: When each primary key value relates to only one (or no) record in the related table.
One-to-many: When the primary key table contains only one record that relates to none, one, or many records in the related table.
Many-to-many: When each record in both tables can relate to any number of records (or no records) in the other table.
We will discuss in the following how merge manages the three types of relationships. The following is an example of using merge for one-to-many relationship between table respresnts customer details and shopping history for each customer.
import pandas as pd
customers= pd.DataFrame({'Customer_ID': ['1', '2', '3', '4'],
'First_Name': ['A1','A2','A3','A4'],
'Last_Name': ['B1', 'B2', 'B3', 'B4']})
shopping_history= pd.DataFrame({'Customer_ID': ['1', '1', '1', '4','5'],
'Product_ID':['100','200','300','400','500'],
'product': ['Oil','Sugar','Tea','Milk','Eggs']})
merged_df= pd.merge(customers,shopping_history)
print(customers)
print(shopping_history)
print(merged_df)
merged_df= pd.merge(customers,shopping_history, on="Product_ID")
print (merged_df)
Note that if the key is not specified, merge uses the overlapping column names as the keys.
When there are no common columns between the DataFrames, you need to specify the key to merge on. Use on, left_on and right_on attributes to define the key in DataFrames.
customers2= pd.DataFrame({'CID': ['1', '2', '3', '4'],
'First_Name': ['A1','A2','A3','A4'],
'Last_Name': ['B1', 'B2', 'B3', 'B4']})
merged_onkey= pd.merge(customers2,shopping_history,left_on='CID',right_on='Customer_ID')
print (customers2)
print (shopping_history)
print (merged_onkey)
In the joined table, some records are deleted because they do not have a corresponding record in the original DataFrame (such as customer 2,3,5 and Eggs). This happens because the deafult method for merging is the inner join. For zero information loss, you can use outer join instead. The outer join could be full outer (getting full information from both DataFrames), left (only from the left DataFrame) or right (using left, right methods).
print (customers)
print (shopping_history)
merged_outer= pd.merge(customers,shopping_history, how='outer')
print (merged_outer)
merged_left= pd.merge(customers,shopping_history, how='left')
print (merged_left)
merged_right= pd.merge(shopping_history,customers, how='right')
print (merged_right)
A more complicated relationship to manage using merge method is the many-to-many. Consider the example of customers and products. The customer DataFrame contains information about customers, while products has details about differnt grocery items. One customer can buy non or many products, and any product can be purchased by non or many customers. Pandas manage this relationship using merge method. Like SQL's JOIN clause, pandas.merge allows two DataFrames to be joined on one or more keys. The function provides a series of parameters (on, left_on, right_on, left_index, right_index) allowing you to specify the columns or indexes on which to join. In version 0.17.0. Pandas added the argument indicator. If True, a Categorical-type column called _merge will be added to the output object that takes on values. Here is a simple example.
df1 = pd.DataFrame({'key': ['a', 'a', 'b', 'c', 'a', 'b','c'], 'd1': range(7)})
df2 = pd.DataFrame({'key': ['d', 'd', 'b', 'b', 'b', 'a'], 'd1': range(6)})
print (df1)
print (df2)
merge= pd.merge(df1,df2, how='outer',indicator='True')
print (merge)
print (df1)
print (df2)
merge= pd.merge(df1,df2, how='outer',indicator='True', on='d1')
print (merge)
In cases that there is a meaningful index in one of the columns, this index can replace the original DataFrame index. By default, set index returns a new DataFrame, so you will have to specify if you would like the changes to occur in place.
customers.set_index('Customer_ID', inplace=True)
shopping_history.set_index('Customer_ID', inplace=True)
print (customers)
print (shopping_history)
The DataFrame also has a convenient join method for merging on the index. This is used when you have objects with similar row labels, but different columns.
joined= customers.join(shopping_history,how="outer")
print (customers)
print (shopping_history)
print (joined)
We use this method when we want to “patch” values in one object from values for matching indices in the other. Note that this method only takes values from the right DataFrame if they are missing in the left DataFrame. Consider the following example:
import numpy as np
data1= pd.DataFrame([[1,3,np.nan],[np.nan,10,np.nan],[np.nan,5,3]])
data2= pd.DataFrame([[10,np.nan,4],[np.nan,5,3],[2,4,np.nan]])
print (data1)
print (data2)
data= data1.combine_first(data2)
print (data)
Hierarchical indexing provides a more structure way of presenting tabular data. There are two main methods for pivoting data with Hierarchical indexing.
import pandas as pd
import numpy as np
tuples = list(zip(*[['L1_A', 'L1_A', 'L1_B', 'L1_B','L1_C', 'L1_C', 'L1_D', 'L1_D'], ['L2_1', 'L2_2', 'L2_1','L2_2','L2_1', 'L2_2', 'L2_1','L2_2']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
data = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
print (data)
result=data.stack()
print (result)
data.unstack()
joined= customers.join(shopping_history, how= 'outer')
print (shopping_history)
print (customers)
print (joined)
Exercise: Try to implement the same code using pivot (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html)
The DataFrame method 'duplicated' returns a Boolean Series indicating whether each row is a duplicate or not:
data.duplicated()
data = pd.DataFrame({'k1': ['Milk'] * 3 + ['Cheese'] * 4, 'k2': [1, 1, 2, 3, 3, 4, 4]})
print (data)
data.duplicated()
cleandata= data.drop_duplicates()
print (cleandata)
As you may have noticed, applying drop_duplicates on the whole DataFrame considrs all othe columns together. We can alternatively specify which column we want to capture the duplication at.
v1= data.drop_duplicates(['k1'])
print (v1)
v2= data.drop_duplicates(['k2'])
print (v2)
Pandas enable modifying the current attribute name using map or renaming methods. Example is as follows:
idf= pd.DataFrame(np.arange(12).reshape((3, 4)),
index=['nsw', 'vic', 'tasmania'],
columns=['day1', 'day2', 'day3', 'day4'])
print (idf)
idf.index.map(str.upper)
idf.rename(index=str.upper, columns=str.upper)
idf
Please note rename keeps the original value of the data unchanged. To change the original DataFrame, you need to do that in place. You can also rename a subset of columns.
#idf.index=idf.index.map(str.upper)
#OR
idf.rename(index=str.upper, columns=str.upper, inplace=True)
idf
More details and examples can be found in http://pandas.pydata.org/pandas-docs/stable/merging.html and "Python for Data Analysis" book pages 177-193. https://nikolaygrozev.wordpress.com/2015/07/01/reshaping-in-pandas-pivot-pivot-table-stack-and-unstack-explained-with-pictures/