Tutorial 10A: Merging DataFrames

This part discusses the basic and essential practical steps for integrating data from various sources.

Dealing with data from different sources requires essentially the integration between these sources to have a full information view. For example, if you collect data about house pricing separately in each suburb, you might need to "concatenate" data together to get a view of the house prices all over Victoria and even Australia. However, each data collection might have different attributes. How to merge them together to get a complete yet precise (with no duplication) presentation of the whole collection? Consider another example for sales department, when you have customer details in one table and product details in another table. How can we use Python (pandas library) to execute the join between the two DataFrame and manage the many to many relationship. This is exactly about implementing database techniques for merging tables using python on a general data.

Panda library has offered methods to manage the data integration task. In the following, we will discuss each method with examples.

There are four ways to merge/combine between different DataFrames in pandas:

  • concatenating,
  • appending,
  • merging, and
  • joining.

Each has its own use cases and best practice. We present concatenating and appending in this tutorial, the other two methods will be discussed in the next tutorial (week 11B). In all of these methods, we assume data is fetched or scrapped from the web (as explained in week 10) or it is stored locally on the machine. We also assume that the data from each source is stored in a DataFrame structure. We start first with the easiest way for merging DataFrame with concatenating.

Methods for integrating data with Pandas:

1. Concatenating:

This refers to gluing together data from different DataFrames by stacking them either vertically or side by side. Consider the following example.

Create a dataframe

In [ ]:
#Import module
import pandas as pd

# Create a dataframe
df1= pd.DataFrame({'Student_ID': ['1', '2', '3', '4'],
                      'First_Name': ['A1','A2','A3','A4'],
                    'Last_Name': ['B1', 'B2', 'B3', 'B4']},
                  index=[1,2,3,4])
df1

Create a second dataframe

In [ ]:
df2= pd.DataFrame({'Student_ID': ['4', '5', '6', '7','8'],
                      'First_Name': ['A4','A5','A6','A7','A8'],
                    'Last_Name': ['B4', 'B5', 'B6', 'B7','b8']},
                 index= [4,5,6,7,8])
df2

Create a third dataframe

In [ ]:
df3= pd.DataFrame({'Student_ID': ['9', '10', '11', '12'],
                    'Last_Name': ['B9', 'B10', 'B11', 'B12'],
                  'address':['AD9','AD10','AD11','AD12']})
df3

Join the two dataframes along rows

In [ ]:
concat_1_2= pd.concat([df1,df2])
concat_1_2

Join the two dataframes along columns

In [ ]:
concat_1_2= pd.concat([df1,df2], axis=1)
concat_1_2
In [ ]:
concat_1_2_3= pd.concat([df1,df2,df3])
concat_1_2_3

As explained in this example, concatenation glues different DataFrames together without considering the index of any or duplication that might cause. When concatenating, you need to make sure that all DataFrames have the same headers. Otherwise, the values of any column that does not exist in the union of headers of concatenated DataFrame will be replaced with NAN values, as shown when concatenated df3 with df1 and df2.

Note also, concatenation assigns the default index when the index was not provided in the original DataFrame (as in df3).
"ignore_index" method is used in case you want to concatenate the DataFrames and ignore the indexes (because they are not meaningful)

In [ ]:
result = pd.concat([df1, df3], ignore_index=True)
print(df1)
print(df3)
print(result)

The concatenation is very useful when you have data (with the same attributes) coming from different sources i.e., house prices collected from each suburb. So we glue them together to get a view for house prices all over victoria. We can also add another key column to indicate the source of each chunk of data in a hierarchical way. The example below explains this as follows:

In [ ]:
concat_1_2= pd.concat([df1,df2], keys=['source1','source2'])
print(concat_1_2)

An important use case for concatenating is when different perspectives of the same data are collected, so we need to concatenate all the information together to get all the details in one DataFrame. The following example illustrates the concatenation of DataFrame represents landlord profile with another DataFrame for the property details. The concatenation in this case has to be implemented along the horizontal axis (axis=1) rather than the vertical one, while the default concatenation axis is 0 (vertical).

In [ ]:
df4=pd.DataFrame({'Student_ID': ['1', '2', '3', '4'],
                      'Addres': ['AD1','AD2','AD3','AD4'],
                    'year': ['Y1', 'Y2', 'Y3', 'Y4']},
                 index={1,2,3,4}
                 )
concat_1_4_horizontal= pd.concat([df1,df4],axis=1)
print(concat_1_4_horizontal)
concat_2_4_horizontal= pd.concat([df2,df4],axis=1)
print(concat_2_4_horizontal)

Concatenating df2 and df4 results in many NAN values because of the non-overlapping in index. Therefore, joining the DataFrames is very useful in this case. Concatenation gives you the three options to handle the other axes too (apart from the one we concatenate on) using outer or inner options or using a specific index. The outer option is the default gets the union of data and grantees zero loss. The previous example shows the outer join of df2 and df4. On the other hand, inner gets the intersection between the two DataFrames. Last but not least, the “join_axes” argument is used for joining with a specific index.

In [ ]:
joint_result= pd.concat([df2, df4], axis=1, join='inner')
print(joint_result)
joint_result2= pd.concat([df2,df4],axis=1,join_axes= [df2.index])
print(df2)
print(df4)
print(joint_result)
print(joint_result2)

2. Append:

Append method in Series and DataFrames is a shortcut of concatenating. It is easy to use but not efficient in terms of performance. When appending a DataFrame, the original DataFrame remains in memory and a new appended one is created. In the appending process, the indexes must be disjoint but the columns do not need to be.

In [ ]:
appended_df= df1.append([df2,df3])
print(appended_df)

You can mix/concatenate both Series and DataFrames. Concat (and therefore append) makes a full copy of the data, and that constantly reusing this function can create a significant performance hit. In the following section, we will introduce merge and join as a more efficient way for merging DataFrames in Tutorial 6.

In [ ]: