A Case for Isolated Virtual Environments with PySpark

2021-02-10T09:09:39+00:00

This blogpost motivates the use of virtual environments with Python and then shows how they can be a handy tool when deploying PySpark jobs to managed clusters.

This blog post motivates the use of virtual environments with Python and then shows how they can be a handy tool when deploying PySpark jobs to managed clusters.

A Case for Isolated Virtual Environments with PySpark2021-02-10T09:09:39+00:00

Working efficiently with Jupyter Notebooks

2021-02-10T09:10:21+00:00

Being in the data science domain for quite some years, I have seen good Jupyter notebooks but also a lot of ugly ones. Follow these best practices to to work more efficiently with your notebooks and strike the perfect balance between text, code and visualisations.

If you have ever done something analytical or anything closely related to data science in Python, there is just no way you have not heard of or IPython or Jupyter not

Working efficiently with Jupyter Notebooks2021-02-10T09:10:21+00:00

24/7 Spark Streaming on YARN in Production

2021-02-10T09:10:35+00:00

We have been running Spark Streaming on Apache Hadoop™ YARN in production for close to a year now. This is what we learned.

At a large client in the German food retailing industry, we have been running Spark Streaming on Apache Hadoop™ YARN in production for close to a year now. Overall, S

24/7 Spark Streaming on YARN in Production2021-02-10T09:10:35+00:00

HyperLogLog on Spark Streaming – Schätzung von Kardinalitäten innerhalb eines Datenstroms

2021-02-10T09:10:52+00:00

Untersuchung der Implementierung und Praxistauglichkeit von HyperLogLog auf Apache Spark Streaming mithilfe eines einfachen Prototyps.

Im Rahmen eines Research-Projektes wurde die Implementierung und Praxistauglichkeit von HyperLogLog auf Apache Spark Streaming mithilfe eines einfachen Prototyps unte

HyperLogLog on Spark Streaming – Schätzung von Kardinalitäten innerhalb eines Datenstroms2021-02-10T09:10:52+00:00