Python: Cleanup all Conda environments

Python: Cleanup all Conda environments

Conda is a popular open-source package manager and environment management system used for Python programming language. It is widely used by data scientists and machine learning engineers for managing packages and dependencies. In some cases, you may need to remove all environments in Conda. This can be helpful when you want to start with a clean slate or when you want to switch to a different version of Conda. In this blog post, we will discuss how to remove all environments in Conda.

Before we begin, it is important to note that removing all environments in Conda will delete all your existing environments and their associated packages. This process is irreversible, so it is recommended that you create a backup of your environment and package information before proceeding with the removal process. You can follow the steps below to achieve your goal.

Step 1: Open the Anaconda prompt or terminal The first step to removing all environments in Conda is to open the Anaconda prompt or terminal. You can do this by searching for “Anaconda prompt” or “Anaconda terminal” in your operating system’s search bar.

Step 2: Deactivate all active environments Before you can remove all environments, you must first deactivate all active environments. You can do this by running the following command in the Anaconda prompt or terminal:

conda deactivate

This will deactivate all active environments, and you will see that your prompt or terminal no longer displays the name of the active environment.

Step 3: Remove all environments To remove all environments in Conda, you can use the conda env list command to list all existing environments, and then use a loop to remove each environment one by one. Run the following commands in the Anaconda prompt or terminal:

conda env list

This command will list all the environments that you have created in Conda. Make sure that you have deactivated all active environments before running the following command:

for env in $(conda env list | awk '{print $1}' | grep -v '#' | grep -v 'base' | grep -v 'Name'); do
    conda remove --name $env --all -y
done

This loop will remove all environments except for the base environment, which is the default environment created by Conda.

Step 4: Confirm removal After running the loop to remove all environments, you can confirm that all environments have been removed by running the conda env list command again. This command should now display only the base environment.

In conclusion, removing all environments in Conda is a simple process that involves deactivating all active environments and then using a loop to remove each environment one by one. It is important to note that this process is irreversible, so it is recommended that you create a backup of your environment and package information before proceeding with the removal process.

Docker:  Stop all containers now!

Docker: Stop all containers now!

Docker has changed our world. A very useful component in the toolbox of any Software Engineer or perhaps, any forward-thinking company. What is Docker? Why Docker? and other questions relating to the topic starts here: Docker

Sometimes, you are just too annoyed with the state of your machine and need to stop all running containers, here is a simple command:

#stop all running docker containers
docker stop $(docker ps -a -q)
#remove all docker containers
docker rm $(docker ps -a -q)

Unfortunately, that’s still not enough for the lazy engineer, simply add this as an alias in your bash/zsh profile and then you can call that command in a consistent and easy manner.

#!/usr/bin sh 
#I am either .bashrc or .zshrc or .bash_profile or whatever profile you use
alias docker_stop_all='docker stop $(docker ps -a -q)'
alias docker_remove_all='docker rm $(docker ps -a -q)'

Hope this helps.

Bash:  Delete History in Terminal

Bash: Delete History in Terminal

Sometimes while working on the terminal or commandline as you know it, the history becomes clogged with simple common commands like ls -al and you want to remove X number of lines to keep your history compact and relevant.  Here’s a simple command that works:

for i in {1..total_count}; do history -d start_from; done

This deletes a total of total_count  amount of history starting from line number start_from in your history.   Make your Shell clean again!

Usage:

for i in {1..40}; do history -d 1000; done

Enjoy.

Jest: Ignore CSS files while running tests

Jest: Ignore CSS files while running tests

To ensure Jest ignores CSS files when running tests, simply follow the steps below:

  1.  Install a module: `identity-obj-proxy`

npm install -D identity-obj-proxy

2.   Update your jest config(jest.config.json)

{
  "jest":{
     "moduleNameMapper": {
      "\\.(jpg|jpeg|png|gif|eot|otf|webp|svg|ttf|woff|woff2|mp4|webm|wav|mp3|m4a|aac|oga)$": "/__mocks__/fileMock.js",
      "\\.(css|less|scss|sass)$": "identity-obj-proxy"
    },
  }
}

Now run your tests again.   You should not see failures due to CSS files anymore.

Compare Millions of messages using Spark(PySpark)

Compare Millions of messages using Spark(PySpark)

Using the Power of Apache Spark(PySpark) to Verify Loads of Messages

Imagine a scenario where you have a system costing you about £30K to run and maintain coupled with the fact that it is a necessary component in your company’s audit trail hence it cannot deprecated just like that — as this will be an expensive business decision.

Overview of Legacy System

All events in the system emit messages for audit purposes and the messages are immediately available, for initially 7 days and then 30 days. The messages are archived once the 30 days period is achieved. Archive location of storage is simply AWS S3.

Stored messages can be retrieved from S3 as at when needed. 

Various teams in the organisation need access to these audit messages and the messages are all in JSON format with the content not sorted i.e:

Message 1:

{ “username”: “bonzo”, “location”: “London”, “subject”: “mathematics”, “activityTimestamp”: “2017–02–01:21:32:00Z”}

Message 2:

{ “username”: “iyabo”, , “subject”: “english literature”, “activityTimestamp”: “2017–05–09:06:03:09Z”, “location”: “Abeokuta”}

A project to replace this legacy system has been initiated and will use Apache Kafka as a platform to produce and consume these audit events. Kafka is an open-source platform and if you stick with the free version rather than subscribe to the services provided by Confluent , the parent company, then you have to play at a lot with this beauty called Kafka. If you want to know more about Kafka, simply visit Confluent.io (don’t click away now!)

The legacy system produces audit messages in JSON format hence the new system being will use the same message protocol. The real challenge is to ensure that the new system produces exactly the same message(s) generated by the legacy system so that when legacy system is switched-off, no audit message is missing. In short, like for like system.

The key question that must be answered is this:

“Do we have all messages?”

If yes, then let’s get some messages randomly from legacy systems and confirm our new system has the exact message.

To verify massive data (BigData), you use either Stir and Compare(Sampling) or Minus queries. In simple terms, Stir and Compare, takes a sample from source data and checks if that chunk of data is present in the target data. Simple? This strategy has limitations as we are talking about big data — it won’t fit in Excel spreadsheet (over 1 Million rows!).

As for Minus Query strategy, it’s simply using machines to do the job. Run a query on source data and target data, then subtract one result-set from the other, the outcome is the set of differences between the two data sources. Please note the queries may be in different query languages (SQL/HQL — data people you understand).

Back to the task.

You can go about setting up complex tools and running scripts and any other thing that makes you feel you’re using technology to solve the problem.

Hold on, there’s a simple solution: Apache Spark. You can checkout a quick overview and quick start on Spark here and the PySpark flavour can be found here.

Let’s assume you’re up and running with PySpark, then you need to load the data you need to compare. Let’s call the 2 sets of data, primary and secondary. Primary being the source data(legacy) and Secondary the new source of data.

In Spark, we use RDD(Resilient Distributed Datasets) to hold the data and in this case I will use rdd1 and rdd2.

Other Posts

Elasticsearch:  Bulk ingest data

Elasticsearch: Bulk ingest data

Often you think of a solution to a simple problem and once you come up with that solution you realise you need to apply this to a large dataset.   In this post, I will explain how I deployed a simple solution to a larger dataset while preparing the system for future growth.  Here’s the state of play before changes in my client’s eCommerce system:

Existing System:

  1. Login to aggregator’s portal to retrieve datafeed URI
  2. Login to customer admin interface to create or update merchant details
  3. Create cron job to pull data from partner URI after inital setup
  4. Cron job dumps data in MySQL
  5. Client shopping UI presents search field and filters to customers to search and use
  6. Search result is extremely slow (homepage: 2.03 s, search results page:35.73 s, product page:28.68 s ).  Notice the search results and product pages are completely unacceptable

Proposed System:

Phase 1:

  1. Follow steps 1-4 of existing system
  2. Export MySQL data as csv
  3. Create an instance of Elasticsearch with an index to store product data
  4.   Export MySQL data as CSV
  5. Create script that bulk insert the exported data into Elasticsearch
  6.   On command line, search Elasticsearch instance using various product attributes (product name, type, category, size etc.).  Check the time speed of search results.

Phase 2:

  1. Build a search interface that uses Elasticsearch
  2. Display search results with pagination
  3. Add filters to search results
  4. AB Test existing search interface and Elasticsearch based and compare conversion (actual sales)
  5. Switch on the best solution – Elasticsearch

A few libraries already exists that can solve some of these challenges e.g

  • Elasticsearch-CSV – https://www.npmjs.com/package/elasticsearch-csv
  • SearchKit – https://github.com/searchkit/searchkit
  • PingDom – https://tools.pingdom.com

In the next post, I will dive deep into how I used Elasticsearch-CSV to quickly ingest merchant data and the response I got

Elasticsearch:  Bulk ingest data

Elasticsearch Cleansing

By running a custom built Elasticsearch on AWS, you have to do everything on the console.   AWS has it’s Elasticsearch offering but I had this project handed over to me and it’s running an old instance of Elasticsearch before AWS has its own.
Data pollution is a common problem and you have to know exactly what to do to ensure effective cleansing of such data when it happens.  So, I had a case of polluted data that if not treated will put my client on a very bad state – such that the customers can sue my client.   First and foremost, the data pollution was not my fault.  With that out of the way, I had to trace the journey of the data to identify the source of the pollution.  Let me describe the system a bit, so you get the picture.  The infrastructure has 4 main components.  The first component is a system that generates CSV files based on user searches.  The second component, inserts each user search field value in an database(sort of).   The third component picks up the generated CSV files, populates an instance of Elasticsearch and deletes the CSV file after 3 hours in which case 2 other new files have been added to the CSV repository.
Commands to do the job
# # Elasticsearch Monitoring
# Cluster Health
# Green: excellent
# Yellow: one replica is missing
# Red: at least one primary shard is down
curl -X GET http://localhost:9200/_cluster/health | python -m json.tool
curl -X GET http://${ip_address}:9200/_cluster/health | python -m json.tool

# Specific Cluster Health
curl -XGET http://localhost:9200/_cluster/health?level=indices | python -m json.tool

# Check Status via colours - green, yellow, red
curl -XGET http://localhost:9200/_cluster/health?wait_for_status=green | python -m json.tool

# Shard level
curl -XGET http://localhost:9200/_cluster/health?level=shards | python -m json.tool

curl -XGET http://localhost:9200/_all/_stats | python -m json.tool

# Bikes
curl -XGET http://localhost:9200/bike_deals/_stats | python -m json.tool

# Cars
curl -XGET http://localhost:9200/car_deals/_stats | python -m json.tool

# Multiple indices check
curl -XGET http://localhost:9200/bike_deals,car_deals/_stats | python -m json.tool

# Check Nodes
curl -XGET http://localhost:9200/_nodes/_stats | python -m json.tool

# DELETE all deals on specific index on Elastic
curl -XDELETE 'http://localhost:9200/bike_deals/?pretty=true' | python -m json.tool #powerful! Be careful!!!!

curl -XDELETE 'http://localhost:9200/bike_deals/_query' -d '{ "query" : { "match_all" : {} } }' | python -m json.tool
curl -XDELETE 'http://localhost:9200/car_deals/_query' -d '{ "query" : { "match_all" : {} } }'

What to Improve in 2016

What to Improve in 2016

ing”logs on Read an article on Oreilly.com about Software Engineers and it really resonates with me – feel free to read the article yourself.  As a  Software Engineer, I have discovered the miracle of self development.  No company or manager can stop you if you choose the path of “continuous” self development.  The world of programming used to be a mystery to me – I only discovered computers after my first degree!  The question then was, how can texts and machines do so much?  I remember a colleague at Shazam encouraging me to pick up a  programming language.  Don’t get me wrong, I was a Manual Test engineer back then and I love find bugs.  It gives my great pleasure and my colleagues were aware of it.  It was all manual testing until I find a bug and I had to check the server logs – for proper bug reporting – everything  changes, I seem to be lost in the world of “Why is this happening?”

I took up the challenge and found a book on C (that was in early 2005!) – I was carefully advised to ignore it and go for an Object Oriented language.  Java was the king then and I started studying and writing code – for real!  I became excited about the fact that I could write some lines of code – hence having a better understanding of what’s happening in the backend.  This greatly improved my communication of bugs to the developers.  So rather than just say “the server did not return any message” it became “there was an exception thrown due to null entry on the IVR (then paste the stack trace of the exception)”.  That was the start.

Since that nudged-start in 2005, I have added to my arsenal as a Software Engineer, new languages, test frameworks, web frameworks to mention but a few.  So what do I need to improve in 2016?  

Managing people and technology.  More Cloud technologies — maybe AWS certification (to make me more serious :)), building and deploy SaaS apps with TDD, BigData plus Analytics, Leadership for engineers and I might peep into the Software Architect world.   I think having a bird’s eye view of software systems might be a good thing.   I will review this at the end of the year.   Have a great year engineering softwares that make a difference!

 

BigData at the Commandline

BigData at the Commandline

BigData and Agile seem not to be friendly in the past but that is no more the case.  One of the important points in processes data is data integrity.   Assuming you are pulling data from an API(Application Programming Interface) and performing some processing on the result before dumping as utf-8 gzipped csv files on  Amazon’s S3.  The task is to confirm that the files are properly encoded(UTF-8), each file has the appropriate headers, each row in each file do not have missing data and finally produce a report with filenames, column count, records count and encoding type.   There are many languages today and we can use any BUT speed is of great importance.  also, we want to have a Jenkins (Continuous Integration Server) job running.

I have decided to use Bash to perform these checks and will do it twice!  First, I will use basic Bash commands and then will use the csvkit (http://csvkit.readthedocs.org/).  The other tool in the mix is the AWS commandline tool(aws-cli)

Bash: List all files but ignore some

Bash: List all files but ignore some

On the commandline (terminal) in the *nix world, when you need to list all the files in a directory but ignore some based on file extension e.g. pdf, sh, tsv etc. Then the command below is quite appropriate. Remember to update the list of extensions you want to ignore i.e. “sh|tsv|rb|properties”

ls -l | grep -Ev '\.(sh|tsv|rb|properties)$' | column

Let me know if this is useful.