F M A

Data Wrangling

with pandas

Cheat Sheet

http://pandas.pydata.org

Syntax –Creating DataFrames

Tidy Data –A foundation for wrangling in pandas

In a tidy

data set:

F M A

Each variable is saved

in its own column

Each observation is

saved in its own row

Tidy data complements pandas’s vectorized

operations. pandas will automatically preserve

observations as you manipulate variables. No

other format works as intuitively with pandas.

Reshaping Data –Change the layout of a data set

M A F

M A

pd.melt(df)

Gather columns into rows. df.pivot(columns='var', values='val')

Spread rows into columns.

pd.concat([df1,df2])

Append rows of DataFrames

pd.concat([df1,df2], axis=1)

Append columns of DataFrames

df=df.assign(a=[1:3], b=[4:6])

Add new columns to DataFrame.

df=df.sort_values('mpg')

Order rows by values of a column (low to high).

df=df.sort_values('mpg',

ascending=False)

Order rows by values of a column (high to low).

df=df.rename(columns=

{'y':'year'})

Rename the columns of a DataFrame

df=df.sort_index()

Sort the index of a DataFrame

Subset Observations (Rows) Subset Variables (Columns)

df = pd.DataFrame(

{"a" : [4 ,5, 6],

"b" : [7, 8, 9],

"c" : [10, 11, 12]},

index = [1, 2, 3])

Specify values for each column.

df = pd.DataFrame(

[[4, 7, 10],

[5, 8, 11],

[6, 9, 12]],

index=[1, 2, 3],

columns=['a', 'b', 'c'])

Specify values for each row.

df = pd.DataFrame(

{"a" : [4 ,5, 6],

"b" : [7, 8, 9],

"c" : [10, 11, 12]},

index = pd.MultiIndex.from_tuples(

[('d',1),('d',2),('e',2)],

names=['n','v'])))

Create DataFrame with a MultiIndex

Method Chaining

Most pandas methods return a DataFrame so that

another pandas method can be applied to the

result. This improves readability of code.

df = (pd.melt(df)

.rename(columns={

'variable' : 'var',

'value' : 'val'})

.query('val >= 200')

)

df[df.Length > 7]

Extract rows that meet logical criteria.

df.drop_duplicates()

Remove duplicate rows (only considers columns).

df.sample(frac=0.5)

Randomly select fraction of rows.

df.sample(n=10)

Randomly select n rows.

df.iloc[10:20]

Select rows by position.

df.nlargest(10, 'value')

Select and order top n entries.

Logic in Python (and pandas)

Less than

Not equal to

Greater than

df.column.isin

(values)

Group membership

Equals

pd.isnull

(obj)

NaN

Less than or equals

Pd.notnull

(obj)

Is not

NaN

Greater than or equals

&,|,~,^,

df.any(),df.all()

Logical

and, or, not, xor, any, all

http://pandas.pydata.org/ This cheat sheet inspired by RstudioData Wrangling Cheatsheet (https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

df[['width','length','species']]

Select multiple columns with specific names.

df['width'] or df.width

Select single column with specific name.

df.filter(regex='regex')

Select columns whose name matches regular expression regex.

df.loc[:,'x2':'x4']

Select all columns between x2 and x4 (inclusive).

df.iloc[:,[1,2,5]]

Select columns in positions 1, 2 and 5 (first column is 0).

df.loc[df['a'] > 10, ['a','c']]

Select rows meeting logical condition, and only the specific columns .

regex (Regular Expressions) Examples

\.'

Matches strings containing

a period '.'

'Length$'

Matches strings ending with word 'Length'

'^Sepal'

Matches strings beginning with the word 'Sepal'

'^x[1

-5]$'

Matches strings beginning with 'x' and ending with 1,2,3,4,5

''^(?!Species$).*'

Matches strings except

the string 'Species'

Data Wrangling with Pandas Cheat Sheet, Cheat Sheet of Computer Programming

Related documents

Partial preview of the text

Download Data Wrangling with Pandas Cheat Sheet and more Cheat Sheet Computer Programming in PDF only on Docsity!

Data Wrangling

with pandas

Cheat Sheet

http://pandas.pydata.org

Syntax – Creating DataFrames

Tidy Data – A foundation for wrangling in pandas

In a tidy

data set:

Each variable is saved

in its own column

Each observation is

saved in its own row

Tidy data complements pandas’s vectorized

operations. pandas will automatically preserve

observations as you manipulate variables. No

other format works as intuitively with pandas.

Reshaping Data – Change the layout of a data set

M A

Subset Observations (Rows) Subset Variables (Columns)

Method Chaining

Summarize Data Make New Variables Combine Data Sets

of rows in DataFrame.

of distinct values in a column.

adf bdf

ydf zdf