Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

CSCI 443: Introduction to Data Science and Numpy, Slides of Computer Science

This lecture introduces the field of data science, outlining its key components and applications. It distinguishes between data scientists, data engineers, and data analysts, highlighting their roles and responsibilities. The lecture also covers the tools used in data science, including databricks, apache spark, and python. It provides an overview of databricks notebooks and their functionalities, emphasizing the use of python for data analysis and visualization. The lecture concludes with a homework assignment that involves setting up a databricks account, creating a notebook, and exploring data frames and visualization techniques.

Typology: Slides

2023/2024

Uploaded on 02/20/2025

ryan-hoffman-3
ryan-hoffman-3 🇺🇸

2 documents

1 / 33

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CSCI 443: LECTURE 1
INTRODUCTION, AND
NUMPY
Professor David Harrison
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21

Partial preview of the text

Download CSCI 443: Introduction to Data Science and Numpy and more Slides Computer Science in PDF only on Docsity!

CSCI 443: LECTURE 1

INTRODUCTION, AND

NUMPY

Professor David Harrison

TODAY: INTRODUCTION

  • Who am I?
  • Syllabus
  • What is data science?
  • What is data engineering?
  • What does this course cover?
  • What tools will we use?
  • Some review of practical statistics
  • Chapter 1 Practical Statistics
    • Up to page 18 in chapter 1. 2024 PRESENTATION TITLE 2

SYLLABUS

2024 CSCI 443 4

WHAT IS DATA SCIENCE?

2024 CSCI 443 5

Data science encompasses the entire lifecycle of data

processing and analysis, including data collection,

cleaning, exploration, modeling, and interpretation. Its

focus is on extracting insights and knowledge from data

and involves developing methods of recording, storing,

and analyzing data to effectively extract useful

information.

WHAT IS DATA SCIENCE?

2024 CSCI 443 7

AI

Machine Learning

Deep Learning

Data Science

Big Data

Data Analytics, Visualization

This

class

focuses

on data

science.

DATA SCIENTISTS VS DATA ENGINEERS

  • Typically 2-5 data engineers to each data scientist.
  • Data scientist is often both internally and externally facing.
  • Data scientist interfaces with key stakeholders to
    • Define a hypothesis, problem, question, …
    • Design metrics
    • Design the experiments to answer the question.
    • Work with data engineers to understand, clean, and analyze data.
  • Data engineers build it:
    • Data collection
    • Data Warehousing / Data Lakes
    • Cloud computing
    • Data pipelines
    • Dashboards and automated reporting
    • Data governance and security. 2024 CSCI 443 8

DATA SCIENTISTS VS DATA ANALYSTS

  • Data scientists are both engineers and statisticians. - work on more complex and abstract tasks, such

as developing new analytics methods, predictive

models, and machine learning algorithms.

  • Often involved in research and development.
  • Data analysts are skilled in data manipulation and visualization.
  • Often use processes put in place by a data

scientist.

  • Often use tools implemented by data engineers.
  • Often support executives and sales
  • May be externally facing.
  • Often not well versed in engineering 2024 CSCI 443 10

TOOLS

20XX CSCI 443 11 .

GITHUB 20XX CSCI 443^13 Example files I create during class will be put on github. The project is at https://github.com/dosirrah/CSCI443_25S _AdvancedDataScience You will need to create a Github account independent of your olemiss accounts. GitHub is free for our purposes. I highly recommend committing any code you create to GitHub.

NO GITLAB…

20XX CSCI 443 14 Last year I used the department Gitlab. This semester I will only use github.

DATABRICKS

20XX CSCI 443^16 Community edition is free. Offers a single instance with limited cap abilities, but should be adequate for teaching.

DATABRICKS

20XX CSCI 443^17 Community edition is free. You do not need an AWS or Azure account. You do not need to sign up for the 14 - day trial.

DATABRICKS

20XX CSCI 443^19 https://community.cloud.databricks.com Once logged in, you should see options to start a notebook and to imp ort data. Ignore “Upgrade now”

WHY DATABRICKS?

20XX CSCI 443^20 Databricks provides cluster management and a notebook (akin to Jupyter) interface to Apache Spark. Spark unifies:

  • Batch processing
  • Real-time data processing
  • Stream analytics (trending, dashboards, etc.)
  • Machine learning
  • Interactive SQL
  • Successor and extension to what was traditionally done with Hadoop or other map- reduce systems.