Elasticsearch Cleansing

Elasticsearch Cleansing

By running a custom-built Elasticsearch on AWS, you have to do everything on the console.   AWS has its Elasticsearch offering but I had this project handed over to me and it’s running an old instance of Elasticsearch before AWS had its own.

Data pollution is a common problem and you have to know exactly what to do to ensure effective cleansing of such data when it happens.  So, I had a case of polluted data that if not treated will put my client in a very bad state – such that the customers can sue my client.   First and foremost, the data pollution was not my fault.  With that out of the way, I had to trace the journey of the data to identify the source of the pollution.  Let me describe the system a bit, so you get the picture.  The infrastructure has 4 main components.  The first component is a system that generates CSV files based on user searches.  The second component inserts each user search field value in a database(sort of).   The third component picks up the generated CSV files, populates an instance of Elasticsearch and deletes the CSV file after 3 hours in which case 2 other new files have been added to the CSV repository.

# # Elasticsearch Monitoring

# Cluster Health

# Green: excellent

# Yellow: one replica is missing

# Red: at least one primary shard is down

curl -X GET http://localhost:9200/_cluster/health | python -m json.tool

curl -X GET http://${ip_address}:9200/_cluster/health | python -m json.tool

# Specific Cluster Health

curl -XGET http://localhost:9200/_cluster/health?level=indices | python -m json.tool

# Check Status via colours - green, yellow, red

curl -XGET http://localhost:9200/_cluster/health?wait_for_status=green | python -m json.tool

# Shard level

curl -XGET http://localhost:9200/_cluster/health?level=shards | python -m json.tool

curl -XGET http://localhost:9200/_all/_stats | python -m json.tool

# Bikes

curl -XGET http://localhost:9200/bike_deals/_stats | python -m json.tool

# Cars

curl -XGET http://localhost:9200/car_deals/_stats | python -m json.tool

# Multiple indices check

curl -XGET http://localhost:9200/bike_deals,car_deals/_stats | python -m json.tool

# Check Nodes

curl -XGET http://localhost:9200/_nodes/_stats | python -m json.tool

# DELETE all deals on specific index on Elastic

curl -XDELETE 'http://localhost:9200/bike_deals/?pretty=true' | python -m json.tool #powerful! Be careful!!!!

curl -XDELETE 'http://localhost:9200/bike_deals/_query' -d '{ "query" : { "match_all" : {} } }' | python -m json.tool

curl -XDELETE 'http://localhost:9200/car_deals/_query' -d '{ "query" : { "match_all" : {} } }'