I get asked this a lot by students so decided to make a little list in here that I can add to and point people towards. https://github.com/awesomedata/awesome-public-datasetshttps://www.kaggle.com/datasetshttps://datasetsearch.research.google.com/https://cloud.google.com/bigquery/public-datahttps://cloud.google.com/public-datasetshttps://registry.opendata.aws/https://data.world/data
I find myself having to refer to this enough times that I decided to make a little infographic I can just more easily link to 🙂
The Setup This is a little one I was surprised to see. Recently I had a need to do some pretty basic feature engineering to a pandas dataframe prior to training some models. Basically I needed to take differences of each column, apply some smoothing, and then add a number of lagged columns for each … Continue reading Numpy Feature Engineering – 2x Speed Up Over Pandas!
An actual market basket I found in my Google photos. tl; dr; if you find yourself doing some association rule mining using mlxtend but finding it a bit slow then checkout PyFIM - here is a colab I made to get you started. I have recently been looking to do some market basket analysis ("Association … Continue reading Market basket analysis in Python
don't mind if i do Here is a thing i helped build in work that i'm fairly happy with: https://www.linkedin.com/posts/andrewm4894_netdata-introducing-our-first-netdata-cloud-activity-6712008465574887424-SlIr Now, onto the next thing!
I've recently been playing around with some time series clustering tasks and came across the tslearn library. I was interested in seeing how easy it would be to get up and running some of the clustering functionality that is already built into tslearn, turns out it was quite easy and straight forward, perfect blog post … Continue reading Time series clustering with tslearn
I've been doing some work that necessitated using the same statistical test from spicy lots of times on a fairly wide pandas dataframe with lots of columns. I spent a bit too much time googling around for the most efficient ways to do this, and even more time re-writing things various way before realizing i … Continue reading Premature Optimization
Terraform is magic, i may be a little late to the game on this one and i'm sure it has it's fair share of haters (i've seen some have a love hate relationship with it, maybe i'm still in my honeymoon period). But from my point of view as a Data Scientist/ML Engineer playing around … Continue reading Terraform is Magic + r/MachineLearning Links
I was looking around a bit and could not really find any datasets behind the daily updates from the Irish government that get posted here. In particular i was thinking the break out tables of numbers by different dimensions might be of use for anyone looking to analyse the data. So here is a python … Continue reading Ireland Covid19 Data
This took me waaay too long to work out today and i was thinking it could make a nice little interview coding type question (which i'd probably fail). Suppose you have 10,000 rows of data and need to continually train and retrain a model training on at most 1,000 rows at a time and retraining … Continue reading A little brainteaser (or i’m an idiot)