

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Some info, notes and slides for csci 443
Typology: Slides
1 / 3
This page cannot be seen from the preview
Don't miss anything!
Due: Jan 30 at 11:00 pm.
The purpose of this homework is to get used to using the tools.
Setup an account with github.com and checkout out the class repository. The repository contains an exported Databricks Notebook. https://github.com
The class repository is at
https://github.com/dosirrah/CSCI443_25S_AdvancedDataScience Figure out how to clone a repository from the command-line, from a git client, or using PyCharm. If you like to develop Python locally rather than fully in a notebook, I suggest using PyCharm from jetbrains. If you have used IntelliJ then PyCharm should be familar, although PyCharm is optimized for use with Python. If you do not already have PyCharm installed, go to https://www.jetbrains.com/community/education/#students and apply for a “Free Education License.” Do not download the trial. You can get a free full license for educational purposes if you apply using a .edu email address to setup your account. On the specific page, scroll down to “Apply Now” and fill out the form. You will get an email almost immediately with instrutions on what to do.
You can link PyCharm to your GitHub account allowing you to clone repositories to your local system. Once you have cloned the repository, locate the file hw1/Hello World Notebook.dbc
There is nothing to turn in for this problem. It is just a step toward completion of the next problems.
Setup a Databricks account. https://community.cloud.databricks.com/login.html
As with PyCharm, avoid using the trial of the full version and instead use the free community edition. There are some slides outlining how to sign up in the Lecture 1 slides.
One you have an account, upload the “Hello World Notebook.dbc” obtained from github to Databricks. From the notebook interface, select “Run all”. This will request you attach to a cluster. You will have to create new cluster.
Create an account with kaggle. Kaggle is a great source for small datasets used in data science competitions. Many of the datasets have associated forums with discussion on how to work with the data. Download from kaggle the training titanic dataset train.csv from https://www.kaggle.com/competitions/titanic/data# From the Databricks Notebook, select “File -> Upload data to DBFS…” DBFS stands for DataBricks File System and is an abstraction on top of other file systems.
When I perform this upload, by default the files are placed in
/FileStore/shared_uploads/harrison@cs.olemiss.edu/
You can confirm that an upload is successful from within the Notebook by creating a python cell and running the following: display(dbutils.fs.ls("/FileStore/shared_uploads/harrison@cs.olemiss.edu/")) Replace harrisonb@cs.olemiss.edu with the path you used. In my Databricks Notebook I see dbfs:/FileStore/shared_uploads/harrison@cs.olemiss.edu/train.csv train.csv 61194 1706306094000
Extend the “Hello World” notebook from within Databricks to load train.csv into a DataFrame. Output the first 10 rows of the DataFrame.
Starting from the “Hello World” notebook from within Databricks, plot a his- togram of the ages of passengers on the Titanic using matplotlib using bins each spanning 5 years of age.