View profile

Breaking the Jargons - Issue #8

Parul Pandey
Parul Pandey
A whirlwind tour of some open-source libraries for managing data, building pipelines, visualizing high-dimensional data, and more…

Hi there,
Glad to bring you the eighth edition of ‘Breaking the Jargons.’ Newsletter. This month’s content is a bit different yet valuable. There are no tutorials, research paper summaries, or tip of the month. For a change, I’m presenting you with a collection of four useful open-source libraries that could be a great addon to your machine learning stack.
1. Fugue
Fugue is a unified interface for distributed compute that ports Python, Pandas, and SQL code to Spark or Dask. The simplest way to use Fugue is by porting the function -transform(). In the example below, let’s say we want to remove stop words from sentences.
Fugue brings this function to Spark (or Dask) with one function call. Even on local Spark/Dask, it will parallelize on all available cores.
Output
2. Ploomber
Ploomber is an open-source framework that helps data scientists build data pipelines fast and deploy them anywhere. Ploomber eliminates the need for time-consuming code refactoring into production. It integrates seamlessly with Jupyter and favorite IDEs like VSCode, Spyder, and Pycharm.
Developing Maintainable Data Pipelines With Jupyter and Ploomber I PyData Chicago I September Meetup
Developing Maintainable Data Pipelines With Jupyter and Ploomber I PyData Chicago I September Meetup
📚 Resources
3. Clustergrammer
Vizgen's MERFISH Mouse Brain Receptor Map Colab Notebook Tutorial
Vizgen's MERFISH Mouse Brain Receptor Map Colab Notebook Tutorial
Clustergrammer is a web-based tool for visualizing high-dimensional data (e.g., a matrix) as an interactive and shareable hierarchically clustered heatmap. It is also available as a Jupyter Lab widget.
Here is an interactive notebook on Colab that explores the Mouse Brain Receptor Atlas MERFISH data that uses Clustergrammer2 for visualizing high-dimensional gene data.
4. Crowd-Kit
Crowd-Kit is a powerful Python library that implements commonly-used aggregation methods for crowdsourced annotation and offers relevant metrics and datasets. They also have a free online course at Coursera, which teaches you efficient and scalable data labeling for ML and various business processes.
📚 Resources
That is all for this edition. See you with another roundup next month. You can subscribe to receive the newsletter directly in your mailbox every month or share it with someone who could find it helpful.
Until next month,
Parul
Did you enjoy this issue? Yes No
Parul Pandey
Parul Pandey @pandeyparul

Breaking down data science jargon, an article a time.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.