Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Theory and Practice of Data Cleaning: Regular Expressions, Cheat Sheet of Mathematics

A comprehensive introduction to regular expressions, covering both theoretical foundations and practical applications. It explores the theory of regular expressions, including their base elements, regular languages, and regular grammars. The document then delves into the practical aspects of regular expressions, demonstrating their use in data extraction and transformation. It highlights the importance of avoiding false negatives and false positives, emphasizing the need for careful regex construction. The document also discusses the concept of regex golf, highlighting the challenges and importance of finding the right regex for specific tasks. Finally, it emphasizes the division of labor between regex for syntax and code for semantics, advocating for a balanced approach to data cleaning.

Typology: Cheat Sheet

2022/2023

Uploaded on 09/25/2024

julie-cheonsa
julie-cheonsa 🇺🇸

1 document

1 / 15

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Theory and Practice
of Data Cleaning
Regular Expressions: From Theory to Practice
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Theory and Practice of Data Cleaning: Regular Expressions and more Cheat Sheet Mathematics in PDF only on Docsity!

Theory and Practice

of Data Cleaning

Regular Expressions: From Theory to Practice

Theory of Regular Expressions

  • Base elements:
    • empty set , ε empty string , and Σ alphabet of characters
  • For regular expressions R , S , the following are regular expressions :
    • R | S alternation
    • R S concatenation
    • R * Kleene star
    • (R) parentheses (can be omitted with precedence rules )
  • Regular languages
    • generated by regular (Type-3) grammars
    • recognized (accepted) by a finite automaton
    • expressed by regular expressions

Introduction to Regular Expressions (Regex)

Theory & Practice

  • Theory of regular expressions:
    • Brief introduction where regular expressions come from …
  • Practice of regular expressions:
    • What you need to know to get started with regex in practice!
  • Demonstration of regular expressions

Practice of Regular Expressions

  • Use case: Extract (then transform) data from text
    • pi = - 0.314159265e+
    • e = 0.2718281828E+
  • This regex will do the trick: [-+]?[0-9].?[0-9]+([eE][-+]?[0- 9 ]+)?*
    • Character set [ … ] matches any single character
    • Optional element ...? matches 0 or 1 occurrence
    • Range [0-9] matches any single character in this range
    • **(Kleene) Star ... *** matches 0 or more occurrences
    • Dot. matches any character (execept line breaks)
    • Escape character \ ... take next character literally ( no special meaning)
    • Capturing group (...) group multiple tokens; capture group for backreference

RegEx Matching as a Sport: RegEx Golf

https://xkcd.com/1313/

Division of Labor:

RegEx for Syntax; Code for Semantics

  • Getting “the right” regex can be quite a balancing act
    • … making RegEx Golf a real sport
  • Even if there is a (near) exact regex solution, it might be

really difficult to get right, debug, maintain, etc.

Character Classes

-. match any character except newline

  • \w \d \s match a word, digit, whitespace character, respectively
  • \W \D \S match a non-word, non-digit, non-whitespace character
  • [abc] any of a, b, or c
  • [^abc] match a character other than a, b, or c
  • [a-g] match a character between a, b, …, g

Anchors

  • ^abc match abc at the start of the string
  • abc$ match abc at the end of the string
  • xyz\b match xyz at a word boundary
  • xyz\B match xyz if not at a word boundary

Groups

  • ([0-9]+)\s([a-z]+)* two capture group s
  • \ 1 backreference to group #
  • \ 2 \ 1 first group #2, then #1 (simple palindrome)

Using Groups for Transformations

  • Groups and backreferences are often used in transformations
  • (\d{2})/(d{2})/(d{4}) three capture groups for MM/DD/YYYY
  • $3-$1-$2 insert captured results as: YYYY-MM-DD
  • Use for example in Python, OpenRefine, …