Download data aggregation in python and more Summaries Programming Languages in PDF only on Docsity!
Python | Pandas dataframe.aggregate()
Python is a great language for doing data analysis, primarily because of the fantastic
ecosystem of data-centric Python packages. Pandas is one of those packages and makes
importing and analyzing data much easier.
Dataframe.aggregate() function is used to apply some aggregation across one or more
column. Aggregate using callable, string, dict, or list of string/callables. Most frequently used
aggregations are:
sum: Return the sum of the values for the requested axis
min: Return the minimum of the values for the requested axis
max: Return the maximum of the values for the requested axis
Syntax: DataFrame.aggregate(func, axis=0, *args, **kwargs)
Parameters:
func : callable, string, dictionary, or list of string/callables. Function to use for aggregating
the data. If a function, must either work when passed a DataFrame or when passed to
DataFrame.apply. For a DataFrame, can pass a dict, if the keys are DataFrame column
names.
axis : (default 0) {0 or ‘index’, 1 or ‘columns’} 0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
Returns: Aggregated DataFrame
For link to CSV file Used in Code, click here
Example #1: Aggregate ‘sum’ and ‘min’ function across all the columns in data frame.
importing pandas package
import pandas as pd
making data frame from csv file
df = pd.read_csv("nba.csv")
printing the first 10 rows of the dataframe
df[:10]
Aggregation works with only numeric type columns.
Applying aggregation across all the columns
sum and min will be found for each
numeric type column in df dataframe
df.aggregate(['sum', 'min'])
Output:
For each column which are having numeric values, minimum and sum of all values has been
found. For dataframe df , we have four such columns Number, Age, Weight, Salary.
Example #2:
In Pandas, we can also apply different aggregation functions across different columns. For
that, we need to pass a dictionary with key containing the column names and values
containing the list of aggregation functions for any specific column.
importing pandas package
import pandas as pd
making data frame from csv file
df = pd.read_csv("nba.csv")
We are going to find aggregation for these columns
df.aggregate({"Number":['sum', 'min'], "Age":['max', 'min'], "Weight":['min', 'sum'], "Salary":['sum']})
df = pd.read_csv("nba.csv")
Print the dataframe
df
Now apply the groupby() function.
applying groupby() function to
group the data on team value.
gk = df.groupby('Team')
Let's print the first entries
in all the groups formed.
gk.first()
Output :
Let’s print the value contained any one of group. For that use the name of the team. We use
the function get_group() to find the entries contained in any of the groups.
Finding the values contained in the "Boston Celtics" group
gk.get_group('Boston Celtics')
Output :
Example #2: Use groupby() function to form groups based on more than one category (i.e.
Use more than one column to perform the splitting).
importing pandas as pd
import pandas as pd
Creating the dataframe
df = pd.read_csv("nba.csv")
First grouping based on "Team"
Within each team we are grouping based on "Position"
gkk = df.groupby(['Team', 'Position'])
Print the first value in each group
gkk.first()
import pandas as pd import numpy as np
Create a sample dataframe
df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50), "gender": np.random.choice(["F", "M"], size=50), "age": np.random.randint(22, 60, size=50), "salary": np.random.randint(20000, 90000, size=50)}) df.index.name = "emp_id"
Calculate mean data of gender groups
df.groupby('gender').mean().add_prefix('mean_')
Output:
Example 2: Performing multiple aggregate operations using the aggregate function
( DataFrameGroupBy.agg ) which accepts a string, function or a list of functions.
Import required libraries
import pandas as pd import numpy as np
Create a sample dataframe
df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50), "gender": np.random.choice(["F", "M"], size=50), "age": np.random.randint(22, 60, size=50), "salary": np.random.randint(20000, 90000, size=50)})
df.index.name = "emp_id"
Calculate min, max, mean and count of salaries
in different departments for males and females
df.groupby(['dept', 'gender'])['salary'].agg(["min", "max", "mean", "count"])
Output:
Example 3: Specifying multiple columns and their corresponding aggregate operations as
follows.
Import required libraries
import pandas as pd import numpy as np
Create a sample dataframe
df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50), "gender": np.random.choice(["F", "M"], size=50), "age": np.random.randint(22, 60, size=50), "salary": np.random.randint(20000, 90000, size=50)}) df.index.name = "emp_id"
Calculate mean salaries and min-max age of employees
in different departments for gender groups
Create bins or groups and apply operations
The cut method of Pandas sorts values into bin intervals creating groups or categories.
Aggregation or other functions can then be performed on these groups. Implementation of
this is shown below:
Example : Age is divided into age ranges and the count of observations in the sample data is
calculated.
Import required libraries
import pandas as pd import numpy as np
Create a sample dataframe
df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50), "gender": np.random.choice(["F", "M"], size=50), "age": np.random.randint(22, 60, size=50), "salary": np.random.randint(20000, 90000, size=50)}) df.index.name = "emp_id"
Create bin intervals
bins = [20, 30, 45, 60]
Segregate ages into bins of age groups
df['categories'] = pd.cut(df['age'], bins, labels=['Young', 'Middle', 'Old'])
Calculate number of observations in each age category
df['age'].groupby(df['categories']).count()
Output:
Transformation
Transformation is performing a group-specific operation where the individual values are
changed while the shape of the data remains same. We use the transform() function to do so.
Example :
Import required libraries
import pandas as pd import numpy as np
Create a sample dataframe
df = pd.DataFrame({"dept": np.random.choice(["IT", "HR", "Sales", "Production"], size=50), "gender": np.random.choice(["F", "M"], size=50), "age": np.random.randint(22, 60, size=50), "salary": np.random.randint(20000, 90000, size=50)}) df.index.name = "emp_id"
Calculate mean difference by transforming each salary value
df['mean_sal_diff'] = df['salary'].groupby( df['dept']).transform(lambda x: x - x.mean()) df.head()
Output: