Pandas Basics

16 min

Actually, I don’t want to use pandas DataFrame at all. It always feels difficult to use, maybe my skill level hasn’t reached that point yet. But many times I have to use it for reading and writing lab Excel files and outputting to Excel. The seaborn I used before is also based on pandas, so let me review the basic operations again.

Reading, Writing and Creating

Data Reading and Writing

SQL

  • Reading
import pandas as pd
from sqlalchemy import create_engine
db = create_engine("mysql+pymysql://username:password@localhost:port(3306)/database_name?charset=utf8")
sql = "select * from text"
df = pd.read_sql(sql, db, index_col="index") # index_col sets index column, auto-generated by default
  • Writing
sql.to_sql(df, name='test', con=db,
                 if_exists="append", # If table exists: append, replace (delete and recreate), fail (don't insert)
                 index=False # Don't insert df index to database
                 )

Excel

  • Reading
df = pd.read_excel(r'file_path',
                   sheet_name='specify sheet, default is first',
                   index=False, # Don't read index from excel, auto-generate new index
                   index_col=0, # Set column 0 as index
                   header=0, # Set row n as columns, default is 0, can be None (auto-generate 0-n columns)
                   usecols=[0, 2] # Only import columns 0, 2
                   )
  • Writing
'''
Write to different sheets
'''
# Create workbook
excelWriter = pd.ExcelFile('file_path/test.xlsx')
# Write to workbook
df.to_excel(
    excelWriter,
    sheet_name='',
    index=False, # Don't write df index to excel
    encoding='utf-8',
    columns=['a', 'b'], # Specify columns to write
    na_rep=0, # Missing value handling (fill with 0)
    inf_rep=0, # Infinity handling (fill with 0)
)
# Save (won't take effect without saving)
excelWriter.save()
'''
Direct write
'''
df.to_excel('file_path/test.xlsx') # Parameters: index, encoding, columns, na_rep, inf_rep

CSV

  • Reading
df = pd.read(
    r'file_path/test.csv',
    sep="", # Specify delimiter, default is comma
    nrows=2, # Specify number of rows to read
    encoding='utf-8',
    engine='python', # Add this when path contains Chinese characters
    usecols=[0, 2], # Only import columns 0, 2
    index_col=0, # Set column 0 as index
    header=0 # Set row n as columns, default is 0, can be None
)
  • Writing
df.to_csv(
    r'file_path/test.csv',
    index=False, # Don't write index column
    columns=['a', 'b'], # Specify columns to write
    sep=',', # Set delimiter (default is comma)
    na_rep=0, # Fill missing values with 0
    encoding='utf-8',
)

TXT

  • Reading
pd.read_table(r'file_path/test.txt', sep='') # Can also read csv files

Source: https://www.modb.pro/db/26894

Creating

What annoys me most is that DataFrame doesn’t have functions like np.zeros, np.ones to initialize an empty DataFrame based on an existing one.

df_empty=pd.Dataframe(columns=['A','B','C','D'])

So one approach is to extract column names from an existing DataFrame and then create a new one.

Indexing

Pandas data indexing is like a book’s table of contents, allowing us to quickly find the chapter we want. For large amounts of data, creating reasonable and business-meaningful indexes is crucial for data analysis.

Understanding Indexes

Below is an example of indexes in a simple DataFrame:

pandas index
pandas index

Where:

  • Row index is the data’s index, column index points to a Series
  • DataFrame’s index is also the index of the Series that forms the series
  • Building indexes makes data more intuitive, e.g., each row is for a specific country
  • Building indexes facilitates data processing
  • Indexes allow duplicates, but typically shouldn’t in business scenarios

Sometimes data with many row and column levels will have multi-level indexes.

Building Indexes

Previously we learned that we can specify indexes when loading data to generate a DataFrame:

data = 'https://www.gairuo.com/file/data/dataset/team.xlsx'
df = pd.read_excel(data, index_col='name') # Set index to name
df
'''
      team  Q1  Q2  Q3  Q4
name
Liver    E  89  21  24  64
Arry     C  36  37  37  57
Ack      A  57  60  18  84
Eorge    C  93  96  71  78
Oah      D  65  49  61  86
'''

If we didn’t specify an index when loading, we can use df.set_index() to specify one:

df.set_index('month') # Set month as index
df.set_index(['month', 'year']) # Set month and year as multi-level index
'''
            sale
month year
1     2012    55
4     2014    40
      2013    84
10    2014    31
'''

s = pd.Series([1, 2, 3, 4])
df.set_index(s) # Specify an index
df.set_index([s, 'year']) # Specify both a custom index and existing field
df.set_index([s, s**2]) # Computed index

# Other parameters
df.set_index('month', drop=False) # Keep original column
df.set_index('month', append=True) # Keep original index
df.set_index('month', inplace=True) # Build index and overwrite df

Resetting Indexes

Sometimes we want to cancel existing indexes and start over, we can use df.reset_index():

df.reset_index() # Clear index
df.set_index('month').reset_index() # Equivalent to doing nothing
# Delete original index, month column is gone
df.set_index('month').reset_index(drop=True)
df2.reset_index(inplace=True) # Overwrite to take effect
# Cancel year level index
df.set_index(['month', 'year']).reset_index(level=1)
df2.reset_index(level='class') # Same as above using level index name
df.reset_index(level='class', col_level=1) # Column index
# Fill in specified name for non-existent level names
df.reset_index(level='class', col_level=1, col_fill='species')

Index Types

To accommodate various business data processing needs, indexes have different types defined for various data types:

Numeric Index

There are several types:

  • RangeIndex: Immutable index of monotonic integer range.
  • Int64Index: int64 type, immutable ndarray of ordered sliceable set.
  • UInt64Index: For unsigned integer labels
  • Float64Index: Float64 type
pd.RangeIndex(1,100,2)
# RangeIndex(start=1, stop=100, step=2)
pd.Int64Index([1,2,3,-4], name='num')
# Int64Index([1, 2, 3, -4], dtype='int64', name='num')
pd.UInt64Index([1,2,3,4])
# UInt64Index([1, 2, 3, 4], dtype='uint64')
pd.Float64Index([1.2,2.3,3,4])
# Float64Index([1.2, 2.3, 3.0, 4.0], dtype='float64')

Categorical Index

Categories can only contain a limited number of (usually fixed) possible values (categories). Think of it as an enum, like gender only has male and female, but appears in every row of data. Processing as text would be inefficient. The underlying implementation is pandas.Categorical.

pd.CategoricalIndex(['a', 'b', 'a', 'b'])
# CategoricalIndex(['a', 'b', 'a', 'b'], categories=['a', 'b'], ordered=False, dtype='category')

Categories will be explained in detail later, and only show their advantages with very large datasets.

Interval Index

pd.interval_range(start=0, end=5)
'''
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
              closed='right',
              dtype='interval[int64]')
'''

Multi Index

Will be explained in detail later in the tutorial.

arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
'''
MultiIndex([(1,  'red'),
            (1, 'blue'),
            (2,  'red'),
            (2, 'blue')],
           names=['number', 'color'])
'''

DateTime Index

# From one date continuously to another date
pd.date_range(start='1/1/2018', end='1/08/2018')
# Specify start time and periods
pd.date_range(start='1/1/2018', periods=8)
# Monthly periods
pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')
# Nested periods
pd.period_range(start=pd.Period('2017Q1', freq='Q'),
                end=pd.Period('2017Q2', freq='Q'), freq='M')

Timedelta Index

pd.TimedeltaIndex(data =['06:05:01.000030', '+23:59:59.999999',
                         '22 day 2 min 3us 10ns', '+23:29:59.999999',
                         '+12:19:59.999999'])
# Using datetime
pd.TimedeltaIndex(['1 days', '1 days, 00:00:05',
                   np.timedelta64(2, 'D'),
                   datetime.timedelta(days=2, seconds=2)])

Period Index

t = pd.period_range('2020-5-1 10:00:05', periods=8, freq='S')
pd.PeriodIndex(t,freq='S')

Index Object

Row and column indexes in Pandas are actually Index objects. Here’s how to create an index object:

Creating Objects

pd.Index([1, 2, 3])
# Int64Index([1, 2, 3], dtype='int64')
pd.Index(list('abc'))
# Index(['a', 'b', 'c'], dtype='object')
# Can define a name
pd.Index(['e', 'd', 'a', 'b'], name='something')

Viewing

df.index
# RangeIndex(start=0, stop=4, step=1)
df.columns
# Index(['month', 'year', 'sale'], dtype='object')

Properties

The following methods also apply to df.columns, since they are all index objects:

# Properties
df.index.name # Name
df.index.array # array
df.index.dtype # Data type
df.index.shape # Shape
df.index.size # Number of elements
df.index.values # array
# Others, less commonly used
df.index.empty # Is empty
df.index.is_unique # Has no duplicates
df.index.names # List of names
df.index.is_all_dates # All datetime
df.index.has_duplicates # Has duplicates
df.index.values # Index values array

Operations

The following methods also apply to df.columns, since they are all index objects:

# Methods
df.index.astype('int64') # Convert type
df.index.isin() # Check existence, see example below
df.index.rename('number') # Rename index
df.index.nunique() # Number of unique values
df.index.sort_values(ascending=False,) # Sort, descending
df.index.map(lambda x:x+'_') # Map function processing
df.index.str.replace('_', '') # String replace
df.index.str.split('_') # Split
df.index.to_list() # Convert to list
df.index.to_frame(index=False, name='a') # Convert to DataFrame
df.index.to_series() # Convert to series
df.index.to_numpy() # Convert to numpy
df.index.unique() # Deduplicate
df.index.value_counts() # Deduplicate with counts
df.index.where(df.index=='a') # Filter
df.index.rename('grade', inplace=False) # Rename index name
df.index.rename(['species', 'year']) # Multi-level, rename index names
df.index.max() # Maximum value
df.index.argmax() # Index of maximum value
df.index.any()
df.index.all()
df.index.T # Transpose, useful for multi-level indexes

Index Renaming

Modify row and column index names.

# Rename column indexes one by one
df.rename(columns={"A": "a", "B": "c"})
df.rename(str.lower, axis='columns')
# Rename row indexes
df.rename(index={0: "x", 1: "y", 2: "z"})
df.rename({1: 2, 2: 4}, axis='index')
# Modify data type
df.rename(index=str)
# Re-modify indexes
replacements = {l1:l2 for l1, l2 in zip(list1, list2)}
df.rename(replacements)

Index Name Renaming

Note, this modifies the name of the index, not the index or column names themselves:

s.rename_axis("animal") # Rename index
df.rename_axis(["dow", "hr"]) # Modify multi-level index names
df.rename_axis('info', axis="columns") # Modify row index name
# Modify multi-level column index names
df.rename_axis(index={'a': 'A', 'b': 'B'})
# Modify multi-level column row index names
df.rename_axis(columns={'name': 's_name', 'b': 'B'})
df.rename_axis(columns=str.upper) # Row index names to uppercase

Query and Modify

Data Inspection

When we first get data, we need to do a spot check. On one hand to understand the data structure, on the other hand to randomly check data quality issues. Common methods:

SyntaxOperationReturn Type
df.head(n)View first n rows of DataFrameDataFrame
df.tail(n)View last n rows of DataFrameDataFrame
df.sample(n)View n random samplesDataFrame

All of the above select entire rows.

View Head df.head()

After loading data, we usually need to look at the head data:

df.head()
out:
    name team  Q1  Q2  Q3  Q4
0  Liver    E  89  21  24  64
1   Arry    C  36  37  37  57
2    Ack    A  57  60  18  84
3  Eorge    C  93  96  71  78
4    Oah    D  65  49  61  86

# Can specify quantity
df.head(15)

View Tail df.tail()

View the last tail data.

df.tail()
out:
        name team  Q1  Q2  Q3  Q4
95   Gabriel    C  48  59  87  74
96   Austin7    C  21  31  30  43
97  Lincoln4    C  98  93   1  20
98       Eli    E  11  74  58  91
99       Ben    E  21  43  41  74

# Can specify quantity
df.tail(15)

View Sample df.sample()

df.sample() will randomly return one sample data.

df.sample()
out:
     name team  Q1  Q2  Q3  Q4
79  Tyler    A  75  16  44  63

# Can specify quantity
df.sample(15)

Data truncation:

# Remove data before and after index
df.truncate(before=2, after=4) # Only keep index 2-4
s.truncate(before="60", after="66")
df.truncate(before="A", after="B", axis="columns") # Select columns

Column Operations

Both of the following methods can represent a column:

df['name'] # Returns the column as a Series
df.name
df.Q1
# df.1Q cannot be used even if column name is 1Q
# df.my name cannot be called with spaces, can add underscore

Note that when the column name is a valid Python variable, it can be used directly as an attribute.

Selecting Rows and Columns

Sometimes we need to select partial columns or rows by condition. Common methods:

OperationSyntaxReturn Type
Select columndf[col]Series
Select row by indexdf.loc[label]Series
Select row by positiondf.iloc[loc]Series
Select rows by slicedf[5:10]DataFrame
Filter rows by exprdf[bool_vec]DataFrame

Next we will focus on these query methods.

Slicing []

We can use slicing like lists to select partial row data, but single index is not supported:

df[:2] # First two rows
df[4:10]
df[:] # All data, rarely used
df[:10:2] # By step
s[::-1] # Reverse order

Can also select columns:

df['name'] # Only one column, Series
df[['Q1', 'Q2']] # Select two columns
df[['name']] # Select one column, returns DataFrame, note difference from above

By Label .loc

df.loc() format is df.loc[,], expressions support the following forms:

Single label:

# Represents index, strings need quotes
df.loc[0] # Select row with index 0
df.loc[8]

Single list of labels:

df.loc[[0,5,10]] # Rows with index 0, 5, 10
df.loc[['Eli', 'Ben']] # If index is name
# Boolean selection, length must match index
df.loc[[False, True]*50] # Show rows where True, every other one

Slice with labels (includes start and stop):

df.loc[0:5] # Index slice, rows 0-5, including 5
df.loc['2010':'2014'] # If index is time, can query with string
df.loc[:] # All
# This method supports Series

Column filtering, must have row filtering:

dft.loc[:, ['Q1', 'Q2']] # All rows, Q1 and Q2 columns
dft.loc[:10, 'Q1':] # Rows 0-10, all columns after Q1

By Position .iloc

df.iloc is similar to df.loc, but only uses natural index (0-n index for rows and columns), not labels.

df.iloc[:3]
df.iloc[:]
df.iloc[2:20:3]
s.iloc[:3]

Get Specific Value .at

Similar to loc, but only gets a single specific value, structure is at[,]:

# Note: string index needs quotes
df.at[4, 'Q1'] # 65
df.at['lily', 'Q1'] # 65 assuming index is name
df.at[0, 'name'] # 'Liver'
df.loc[0].at['name'] # 'Liver'
# Get value of other column corresponding to specified column value
df.set_index('name').at['Eorge', 'team'] # 'C'
df.set_index('name').team.at['Eorge'] # 'C'
# Get value at specified index of column
df.team.at[3] # 'C'

Similarly, iat like iloc only supports numeric index:

df.iat[4, 2] # 65
df.loc[0].iat[1] # 'E'

.get can do dictionary-like operations, returns default value if no value (0 in example):

df.get('name', 0) # Is name column
df.get('nameXXX', 0) # 0, returns default value
s.get(3, 0) # 93, Series passes index returns specific value
df.name.get(99, 0) # 'Ben'

Expression Filtering

[] slicing can use expressions for filtering:

df[df['Q1'] == 8] # Q1 equals 8
df[~(df['Q1'] == 8)] # Not equal to 8
df[df.name == 'Ben'] # Name is Ben
df.loc[df['Q1'] > 90, 'Q1':] # Q1 greater than 90, only show Q1
df.loc[(df.Q1 > 80) & (df.Q2 < 15)] # and relationship
df.loc[(df.Q1 > 90) | (df.Q2 < 90)] # or relationship
df[df.Q1 > df.Q2]

The index part in df.loc can use expressions for data filtering.

df.loc[df['Q1'] == 8] # Equals 8
df.loc[df.Q1 == 8] # Equals 8
df.loc[df['Q1'] > 90, 'Q1':] # Q1 greater than 90, only show Q1
# Other expressions same as slicing

df.loc[:, lambda df: df.columns.str.len()==4] # Boolean sequence
df.loc[:, lambda df: [i for i in df.columns if 'Q' in i]] # Column name list
df.iloc[:3, lambda df: df.columns.str.len()==2] # Boolean sequence

Logical comparison functions:

df.eq() # Equal ==
df.ne() # Not equal !=
df.le() # Less than or equal <=
df.lt() # Less than <
df.ge() # Greater than or equal >=
df.gt() # Greater than >
# All support axis{0 or 'index', 1 or 'columns'}, default 'columns'
df[df.Q1.ne(89)] # Q1 not equal to 89
df.loc[df.Q1.gt(90) & df.Q2.lt(90)] # and relationship Q1>90 Q2<90

Other functions:

# isin
df[df.team.isin(['A','B'])] # Contains groups A and B
df[df.isin({'team': ['C', 'D'], 'Q1':[36,93]})] # Complex query, other values are NaN

Function Filtering

df[lambda df: df['Q1'] == 8] # Q1 is 8
df.loc[lambda df: df.Q1 == 8, 'Q1':'Q2'] # Q1 is 8, show Q1 Q2

where and mask

s.where(s > 90) # Non-matching values become NaN
s.where(s > 90, 0) # Non-matching values become 0
# np.where, greater than 80 is True otherwise False
np.where(s>80, True, False)
np.where(df.num>=60, 'pass', 'fail')

s.mask(s > 90) # Matching values become NaN
s.mask(s > 90, 0) # Matching values become 0

# Example: show divisible values, show negative for non-divisible
m = df.loc[:,'Q1':'Q4'] % 3 == 0
df.loc[:,'Q1':'Q4'].where(m, -df.loc[:,'Q1':'Q4'])

# Same number of rows and columns, returns an array
df.lookup([1,3,4], ['Q1','Q2','Q3']) # array([36, 96, 61])
df.lookup([1], ['Q1']) # array([36])

query

df.query('Q1 > Q2 > 90') # Write SQL-like where statement directly
df.query('Q1 + Q2 > 180')
df.query('Q1 == Q2')
df.query('(Q1<50) & (Q2>40) and (Q3>90)')
df.query('Q1 > Q2 > Q3 > Q4')
df.query('team != "C"')
df.query('team not in ("E","A","B")')
# For column names with spaces, use backticks
df.query('B == `team name`')

# Supports passing variables, e.g., 40 points above average
a = df.Q1.mean()
df.query('Q1 > @a+40')
df.query('Q1 > `Q2`+@a')

# df.eval() usage similar to df.query
df[df.eval("Q1 > 90 > Q3 > 10")]
df[df.eval("Q1 > `Q2`+@a")]

filter

Use filter to filter row and column names.

df.filter(items=['Q1', 'Q2']) # Select two columns
df.filter(regex='Q', axis=1) # Column names containing Q
df.filter(regex='e$', axis=1) # Ending with e
df.filter(regex='1$', axis=0) # Regex, index names containing 1
df.filter(like='2', axis=0) # Index contains 2
# Index starting with 2, column names containing Q
df.filter(regex='^2', axis=0).filter(like='Q', axis=1)

Merge and Add Rows/Columns

Adding Columns

Assume original data is as follows:

import pandas as pd
import numpy as np

df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]},
                  index=['dog', 'hawk'])
slen = len(df['num_legs'])
  1. Direct assignment
df['a'] = pd.Series(np.random.randn(slen), index=df.index) # Remember to add index
df['b'] = None # Add a column with None values
df['c'] = [2, 4] # Add list data

# c1 and c3 columns have same order, c2 is opposite, see below
df['c1'] = ['no', 'yes']
df.index = [1, 0]
df['c2'] = pd.Series(['no', 'yes'])
df['c3'] = pd.Series(['no', 'yes'], index=df.index)
  1. loc method
df.loc[:,'d'] = pd.Series(np.random.randn(slen), index=df.index)
df.loc[:, 'd'] = [2, 4]
  1. insert method

insert method column names cannot have duplicates, even updates are not allowed

df.insert(len(df.columns), 'e', pd.Series(np.random.randn(slen)), index=df.index)
df.insert(len(df.columns), 'ee', [1,2])
  1. assign method

assign method parameters can be Series, scalar, list, and can add multiple columns at once

df = df.assign(f=df.num_wings.mean())  # Use num_wings column mean as new column f
df = df.assign(A=df.num_wings.sum(), B=[1,2])  # Add columns A and B
  1. concat method
pd.concat([df, pd.Series(['yes', 'yes']).rename('t')], axis=1) # Add column t

Notes:

  • Each method’s parameters can be Series, scalar, or list
  • insert method’s new column name cannot be same as existing ones, even updating just-added columns will error
  • df['a']=pd.Series(['no', 'yes'] if index order is modified, defaults to Series index order, use index=df.index to specify original DataFrame index order

Adding Rows

import pandas as pd
import numpy as np

# Create empty DataFrame
df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
  1. Using loc
for i in range(4):
    df.loc[i] = [np.random.randint(-1, 1) for n in range(3)]
    # df.loc[i] = 5 add a record with all values as 5
  1. Using append
df.append({'lib': 2, 'qty1': 3, 'qty2': 4}, ignore_index=True)

# append can also directly add DataFrame
df2 = pd.DataFrame([[1,2,3], [2,3,4]], columns=['lib', 'qty1', 'qty2'])
df.append(df2, ignore_index=True)  # ignore_index=True will ignore df2's index
  1. Regenerate DataFrame

Loop to save data to be added as dictionaries in a list, then create DataFrame from list

row_list = []
input_rows = [[1,2,3], [2,3,4]] # Data to insert
for row in input_rows:
    dict1 = dict(lib=row[0], qty1=row[1], qty2=row[2]) # Convert data to dict
    row_list.append(dict1) # Save to list
df = pd.DataFrame(row_list)

Merge

append() Vs. concat()

When concatenating or merging DataFrames, there are generally two ways: vertical and horizontal. Simply put, vertical means connecting two or more DataFrames vertically (top to bottom) into one DataFrame, without any operations even if index and column have duplicates - just crude vertical concatenation. Horizontal considers if there are same indexes, it will merge all column data on the same index together, similar to using Excel’s V-lookup to merge data from two tables with same id but different data.

Common functions for concatenating and merging DataFrames are append(), concat() and merge(). append() can only do vertical concatenation, while concat() and merge() can do both. concat() defaults to vertical concatenation (same as append()), to use concat() for horizontal merge you need to declare axis variable. Default: concat(axis=0) vertical, concat(axis=1) horizontal.

concat() Vs. merge()

After discussing concat() and append() horizontal and vertical concatenation above, let’s talk about the difference between concat() and merge(). Both can do horizontal and vertical merging, but there are some differences in usage and output.

Here we introduce the concept of join. concat() default join method is outer join, while merge() default is inner join. Another important difference is that concat() merges by index, while merge() merges by column label. If column name is set as index, you need to remove it before merging, otherwise KeyError will occur.

# concat with outer join (default)
pd.concat([population, unemployment], axis=1)

# concat with inner join
pd.concat([population, unemployment], axis=1, join='inner')

For merge operations:

# merge() default is inner join, parameter is 'how' not 'join'
pd.merge(population, unemployment, left_on='ZipCode', right_on='Zip')

# If column is set as index, need to reset_index() first
population = population.reset_index()
unemployment = unemployment.reset_index()
pd.merge(population, unemployment, left_on='ZipCode', right_on='Zip', how='outer')

# Using same column name with on parameter
population.rename(columns={'ZipCode':'Zip'}, inplace=True)
pd.merge(population, unemployment, on=['Zip'], how='outer')

join() Vs. concat()

join has four merge methods: how='left', how='right', how='inner' and how='outer'. merge() also has all these methods. merge() is the most powerful of the four functions, but in practice you don’t always need to use merge() - sometimes append() and concat() may be more convenient.

# join default merge method is how='left'
population.join(unemployment)

# join with right
population.join(unemployment, how='right')

df1.join(df2, how=‘left’) means merge based on the left DataFrame. join and concat both merge by index, so corresponding index must exist. concat lacks left and right merge methods compared to join, but outer and inner results are identical:

population.join(unemployment, how='outer')
pd.concat([population, unemployment], join='outer', axis=1)
# Above two results are the same

population.join(unemployment, how='inner')
pd.concat([population, unemployment], join='inner', axis=1)
# Above two results are the same

Summary of append(), concat(), join() and merge()

append()

Syntax: df1.append(df2)

Description: append() simply stacks two DataFrames vertically, no index needed.

concat()

Syntax: pd.concat([df1, df2])

Description: concat() can merge multiple rows or columns horizontally or vertically, can use inner or outer method, needs index.

join()

Syntax: df1.join(df2)

Description: join() can use multiple merge methods, besides inner and outer can also use left and right, these operations also need index.

merge()

Syntax: pd.merge([df1, df2])

Description: Most versatile merge function. No index needed.

merge_ordered() Function

merge_ordered() function can perform two operations with one function: merge() and sort_values().

pd.merge_ordered(hardware, software, on=['', ''], suffixes=['', ''], fill_method='ffill')