{inner, outer, left, right}
In 2008, Wes Mckinney was at the Hedge Fund AQR and developed a small piece of software which became the pre-cursor to Pandas, developing and finalising it later on. Since then, Pandas has become one of the most important Python libraries that most Data Scientists (if not all) use on a daily basis.
It allows users to manage and manipulate data with levels of efficiency not seen before. From loading data, managing missing data, pivoting and reworking data: Panda’s doesn’t (really) achieve anything new, but, existing tools were inefficient, whereas Panda’s makes light work of these complicated tasks.
Now regardless of whether you use SQL or Pandas, you need to know how to join tables. A table join is a process by which you combine two separate ‘tables’ (or in Pandas land, DataFrames) together.

Let’s play with some actual data and say we have the following DataFrame (or in SQL land, Table):
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'],'C': ['C0', 'C1', 'C2', 'C3'],'D': ['D0', 'D1', 'D2', 'D3']},index=[0, 1, 2, 3])
which looks something like this:

Note that the index values have been set to [0,1,2,3]
. Remember that an index
to a table is an identifier to the location of a particular row in a table. For example, your office may be on the 3rd floor
in an building. But in some other buildings, the may call the 3rd floor
something arbitrary “Floor C
” for example. In this case, Floor C
and Floor 3
are equivalent, their just identifiers of a location, like an index.
Say we want to combine df1
with the following DataFrame, df4
:
df4 = pd.DataFrame({'B': ['B2', 'B3', 'B6', 'B7'], 'D': ['D2', 'D3', 'D6', 'D7'],'F': ['F2', 'F3', 'F6', 'F7']}, index=[2, 3, 6, 7])
which looks like this:

Now depending on how we want to combine the data depends on what type of a join we use. Note that we generally ‘join’ tables (or DataFrames) based on their index.
If we want to combine the data so we only have index rows that are in both A and B, then we would use an ‘inner’ join.
Inner Join
So as you can see, here we simply use the pd.concat
function to bring the data together, setting the join
setting to 'inner’
:
result = pd.concat([df1, df4], axis=1, join='inner')
which as can be seen, only has index rows [2,3]
. That’s because index values of 2
and 3
exist in both DataFrames 1
and 4
.

Now all the other rows (or index values) in each DataFrame that are not in the other DataFrame (e.g. df1
contains index row values of [0, 1]
which are not in df4,
these are thrown away).
Now if we want to retain all the index values from both DataFrame, we can use an Outer Join.
Outer Join
So when we change the join
setting to outer
, we can see that now, all index values are present ({0,1,2,3}
from df1
, and {2,3,6,7}
from df2
).

Let’s now move onto something which some of you may have deduced:
Left Join
Say we want to stick to the index values in df1
, for that, we would then do a left join:
df1.merge(df4, how=’left’)
This would give us the corresponding DataFrame:

and likewise on the other side:
Right Join
Say we want to stick to the index values in df4
, for that, we would then do a right join:
df1.merge(df4, how=’right’)
This would give us the corresponding DataFrame:

So there you go! Definitely take this away and have a go yourself. Joins are a bit confusing and it takes a few different tries to fully get your head around it but try to remember the following for if you have two tables A
and B
:
- Inner joins only take index rows for what’s in
A
andB
- Outer joins make a new table that contain index values for both
A
orB
, and set the missing segments tonan
. - Left joins make a new table that contain index values for both only the ‘left’ table, which in our case would be
df1
, orA
. - Right joins make a new table that contain index values for both only the ‘right’ table, which in our case would be
df4
, or B.
Thanks for reading again!! Let me know if you have any questions and I’ll be happy to help.
Keep up to date with my latest work here!
Leave a Reply