View profile

nibble.ai weekly - Issue #25: Continuous Delivery for ML, Moonshots vs boring, Reproductibility crisis...

Revue
 
This week's top pick is Martin Fowler's Continuous Delivery for Machine Learning about automating the
 

nibble.ai dispatch

September 19 · Issue #25 · View online
Curated essays about the future of Data Science. Production Data Science and learning resources for continuous learning. Covers Data Science, Data Engineering, MLOps & DataOps. Curated by people at https://nibble.ai/

This week’s top pick is Martin Fowler’s Continuous Delivery for Machine Learning about automating the end-to-end lifecycle of Machine Learning applications.
Machine Learning applications are becoming popular in our industry, however the process for developing, deploying, and continuously improving them is more complex compared to more traditional software, such as a web service or a mobile application. They are subject to change in three axis: the code itself, the model, and the data. Their behaviour is often complex and hard to predict, and they are harder to test, harder to explain, and harder to improve. Continuous Delivery for Machine Learning (CD4ML) is the discipline of bringing Continuous Delivery principles and practices to Machine Learning applications.

Forget moonshots and think boring?
Forget moonshots and think boring?
Forget moonshots and think boring
If enterprises ever want to see the benefits of AI, they must embrace the mundane.
I’m sure the moonshots are possible if you’re a tech giant and you have billions of dollars to spend on experimenting. However, even Jeff Bezos admitted the bulk of their AI investments are ‘quietly but meaningfully improving core operations.’
Artificial Intelligence Confronts a ‘Reproducibility’ Crisis
Machine-learning systems are black boxes even to the researchers that build them. That makes it hard for others to assess the results.
Pineau is trying to change the standards. She’s the reproducibility chair for NeurIPS, a premier artificial intelligence conference. Under her watch, the conference now asks researchers to submit a “reproducibility checklist” including items often omitted from papers, like the number of models trained before the “best” one was selected, the computing power used, and links to code and datasets.
The rise of the term MLOps
Properly Operationalized Machine Learning is the New Holy Grail
While the end-to-end ML lifecycle has always been pitched as an actual “cycle”, to date there has been limited success in actually managing this end-to-end process at enterprise level scale.
Libraries
sk-dist
Distributed Scikit-Learn meta-estimators in PySpark. github.com
Learning resources
Python Patterns
A series of high-quality blog posts about Python patterns.
Your Guide to the CPython Source Code
CPython, the most popular Python runtime is written in human-readable C and Python code. This tutorial will walk you through the CPython source code.
Apache Spark: core concepts, architecture and internals
This post covers core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes the architecture and main components of Spark Driver.
ML Tutorial: Gaussian Processes
From the Basics to the State-of-the-Art
A Machine Learning Tutorial by Richard Turner at Imperial College London.
Data Engineering Principles
Build frameworks not pipelines
An interesting talk about applied data engineering principles that can be used to build robust easily manageable data pipelines and data products.
We hope you’d find some interesting stuff! Feel free to reach out to me for suggestion (or just to say hi): [email protected]
Florent
Did you enjoy this issue?
If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue