- Lab 0: Tools
- Lab 1: Tidy Data
- Lab 2: Data Engineering
- Lab 3: Geo-Visualisation
- Lab 4: Networks and Spatial Weights
- Lab 5: Linear Regression
- Lab 6: Clustering
- Lab 7: Points
DOWNLOAD ALL THE LABS:
You can download all the labs, associated datasets, html files describing the output of the labs, and homework notebooks using the links below.
Lab 0 - Tools
IMPORTANT This is a supplementary notebook that covers many basics of the tools we will use in the course but does not explain anything directly related to Urban Data Science.
Students are encouraged to read it once before getting started with the other notebooks and then keep it as a reference throughout the rest of the course. There are some basic Python operations in there that act as a refresher, practice or learning material.
If you want to explore further by yourself the contents presented in this tutorial, the following pointers are good places to start:
- [Video] “Python as Super Glue for the Modern Scientific Workflow”, keynote speech by Prof. Joshua Bloom from UC Berkley about how Python is used in Astronomy research.
- Gallery of interesting notebooks: a wealth of examples of Jupyter (formerly called IPython) notebooks.
- (Downey, 2012): very good general introduction to Python as a programming language and to the algorithmic way of thinking. The book is freely available in HTML and PDF.
- Downey, A. (2012). Think Python - How to Think Like a Computer Scientist. Green Tea Press.
Lab 1 - Tidy Data
This session uses the “Census socio-demographics” dataset of Liverpool, United Kingdom in two parts. The dataset for this lab is provided in the zipped lab files above.
- Table of LSOA areas in Liverpool with population counts by World region. The table is derived from the CDRC Census data pack.
- Collection of socio-demographic characteristics from the 2011 Census for the city of Liverpool.
- A good extension of this session is (Wickham, 2014). The paper is published under an Open Access license, so it is freely available on the journal’s site, but the author has also made available a public repository with the data and code used in the paper. Keep in mind the paper, and the code that comes with it is based on R, not on Python.
- [Visualization] Python library
- (McKinney, 2012): An excellent introduction to Python for data analysis, with plenty of examples and code snippets (Publisher’s page link).
- NY Times article about the importance of cleaning data.
- Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10).
- McKinney, W. (2012). Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. O’Reilly Media, Inc.
Lab 2 - Data Engineering
This session uses two datasets which are provided in the zipped lab files above.
- A dataset about wines from different countries, download from Kaggle.
- A dataset scraped and collected from Goodreads.
Lab 3 - Geo-Visualisation
The homework exercises are embedded within the lab files itself. You have to complete the exercises as you go along understanding the rest of the code. There are two files for this lab,
eda. Since this geocomputational lab is not as straightforward as other python code, a solution set is also provided for questions indicated in the lab.
This session uses multiple datasets which are provided in the zipped lab files above.
- A “Census socio-demographics” dataset as well as the Ordnance Survey (OS) Geodata Pack.
- An “Index of Multiple Deprivation”" dataset as well as the Ordnance Survey (OS) Geodata Pack. Scores, ranks, and components of the 2015 Index of Multiple Deprivation (IMD). Source: CDRC’s English Indices of Deprivation 2015 Geodata Pack for the city of Liverpool (UK).
- Additionally, you will need the raster file for the basemap of Liverpool. This has been assembled by Dani Arribas-Bel from the OS VectorMap District (Backdrop Raster), and it is licensed as OpenData.
- Simple datasets on
mysteryare also provided.
- A good introduction to the
geopandasproject is provided by Kelsey Jordahl, the project’s founder in this set of slides from a 2015 talk and the companion repository.
- An additional great resource is this 4h. workshop by Carson Farmer.
Lab 4 - Networks and Spatial Weights
This session uses multiple datasets which are all provided in the zipped lab files above.
- An “Index of Multiple Deprivation”" dataset used in previous labs.
- A Brexit dataset.
This is the dataset of the results of the 2016 referendum vote to leave the EU, at the local authority level. All the necessary data have been assembled for convenience in a single file that contains geographic information about each local authority in England, Wales and Scotland, as well as the vote attributes. The file is in the modern geospatial format GeoPackage, which presents several advantages over the more traditional shapefile (chief among them, the need of a single file instead of several).
The source data used to compile the file linked above include:
- Electoral Commission data on the EU referendum results (
- Local Authority District boundaries (
Watch the section on spatial weights of the SciPy'16 tutorial on Geographic Data Science with PySAL.
[YouTube - Min 1:02:55 to 1:25:40]
Watch the section on ESDA of the SciPy'16 tutorial on Geographic Data Science with PySAL.
[YouTube - Min 1:25:40 to 1:49:20] [Online materials]
Lab 5 - Linear Regression
Since this lab is not as straightforward as other python code, a solution set is provided for questions indicated in the lab and homework. It is better if you try yourself and get feedback from your peers, and then look at the solutions.
This session uses multiple data files.
- A dataset downloaded from Kaggle on stats about the premiere league.
- A Boston housing dataset and its training set companion.
- An IMDB cast dataset.
- A car dataset.
- A cab dataset.
These sets are not that relevant to global urban issues but simple to work with on small regression practice sets.
Lab 6 - Clustering
This session uses the “AirBnb listing for Inner London - MSOA level” dataset.
This dataset contains information for AirBnb properties for the area of Inner London aggregated at the MSOA level. It has been prepared by Dani Arribas-Bel using as the original source the full listing of AirBnb locations for London provided by Inside AirBnb. Same as the source, the dataset is released under a CC0 1.0 Universal License.
For every polygon, the following variables are provided:
id: MSOA unique identifier.
accommodates: average property capacity in the MSOA.
bathrooms: average number of bathrooms in the properties within the MSOA.
bedrooms: average number of bedrooms in the properties within the MSOA.
beds: average number of beds in the properties within the MSOA.
number_of_reviews: average number of reviews received by the properties within the MSOA.
reviews_per_month: average number of reviews per month received by the properties within the MSOA.
review_scores_ratings: average rating score received by the properties within the MSOA.
review_scores_accuracy: average accuracy score received by the properties within the MSOA.
review_scores_cleanliness: average cleanliness score received by the properties within the MSOA.
review_scores_checkin: average checkin score received by the properties within the MSOA.
review_scores_communication: average communication score received by the properties within the MSOA.
review_scores_location: average location score received by the properties within the MSOA.
review_scores_value: average value score received by the properties within the MSOA.
property_count: total number of AirBnb properties listed withing the MSOA.
The lab also uses an additional file that contains the boundary lines of the London boroughs provided in the data folder as well.
- Watch the section on spatial clustering of the SciPy'16 tutorial on Geographic Data Science with PySAL.
[YouTube - Min 2:30:00 to 3:02:00]
- Although a bit more advanced, the documentation for
scikit-learn, a world-class Python library for machine learning, is excellent and includes many examples that cover the entire functionality set of the library.
Lab 7 - Points
This lab uses a sample of geo-referenced locations of photographs taken in Tokyo.
- Watch the section on points of the SciPy'16 tutorial on Geographic Data Science with PySAL.
[YouTube - Min 1:50:00 to 2:30:00]
- A very good resource for kernel density estimation in Python is provided in this blog post by Jake Vanderplas.