07-11, 12:00ā12:30 (Europe/Amsterdam), Else (1.3)
Many of us know scikit-learn for it's ability to construct pipelines that can do .fit().predict(). It's an amazing feature for sure. But once you dive into the codebase ... you realise that there is just so much more.
This talk will be an attempt at demonstrating some extra features in scikit-learn, and it's ecosystem, that are less common but deserve to be in the spotlight.
In particular I hope to discuss these things that scikit-learn can do:
- sparse datasets and models
- larger than memory datasets
- sample weight techniques
- image classification via embeddings
- tabular embeddings/vectorisation
- data deduplication
- pipeline caching
If time allows I may also touch on extra topics.
There may be an opportunity to live code some of these examples, but if live coding is not possible it'd be preferable to know this ahead of time.
It would really help to be somewhat familiar with scikit-learn.
Vincent is a senior data professional, and recovering consultant, who worked as an engineer, researcher, team lead, and educator in the past. Iām especially interested in understanding algorithmic systems so that one may prevent failure. As such, he prefers simpler solutions that scale and worry more about data quality than the number of tensors we throw at a problem. He's also well known for creating calmcode as well as a small dozen of open-source packages.
He's currently employed at probabl where he works together with scikit-learn core maintainers to improve the ecosystem of tooling.