Elasticsearch:  Bulk ingest data

Elasticsearch: Bulk ingest data

Often you think of a solution to a simple problem and once you come up with that solution you realise you need to apply this to a large dataset.   In this post, I will explain how I deployed a simple solution to a larger dataset while preparing the system for future growth.  Here’s the state of play before changes in my client’s eCommerce system:

Existing System:

  1. Login to aggregator’s portal to retrieve datafeed URI
  2. Login to customer admin interface to create or update merchant details
  3. Create cron job to pull data from partner URI after inital setup
  4. Cron job dumps data in MySQL
  5. Client shopping UI presents search field and filters to customers to search and use
  6. Search result is extremely slow (homepage: 2.03 s, search results page:35.73 s, product page:28.68 s ).  Notice the search results and product pages are completely unacceptable

Proposed System:

Phase 1:

  1. Follow steps 1-4 of existing system
  2. Export MySQL data as csv
  3. Create an instance of Elasticsearch with an index to store product data
  4.   Export MySQL data as CSV
  5. Create script that bulk insert the exported data into Elasticsearch
  6.   On command line, search Elasticsearch instance using various product attributes (product name, type, category, size etc.).  Check the time speed of search results.

Phase 2:

  1. Build a search interface that uses Elasticsearch
  2. Display search results with pagination
  3. Add filters to search results
  4. AB Test existing search interface and Elasticsearch based and compare conversion (actual sales)
  5. Switch on the best solution – Elasticsearch

A few libraries already exists that can solve some of these challenges e.g

  • Elasticsearch-CSV – https://www.npmjs.com/package/elasticsearch-csv
  • SearchKit – https://github.com/searchkit/searchkit
  • PingDom – https://tools.pingdom.com

In the next post, I will dive deep into how I used Elasticsearch-CSV to quickly ingest merchant data and the response I got

Elasticsearch:  Bulk ingest data

Elasticsearch Cleansing

By running a custom built Elasticsearch on AWS, you have to do everything on the console.   AWS has it’s Elasticsearch offering but I had this project handed over to me and it’s running an old instance of Elasticsearch before AWS has its own.
Data pollution is a common problem and you have to know exactly what to do to ensure effective cleansing of such data when it happens.  So, I had a case of polluted data that if not treated will put my client on a very bad state – such that the customers can sue my client.   First and foremost, the data pollution was not my fault.  With that out of the way, I had to trace the journey of the data to identify the source of the pollution.  Let me describe the system a bit, so you get the picture.  The infrastructure has 4 main components.  The first component is a system that generates CSV files based on user searches.  The second component, inserts each user search field value in an database(sort of).   The third component picks up the generated CSV files, populates an instance of Elasticsearch and deletes the CSV file after 3 hours in which case 2 other new files have been added to the CSV repository.
Commands to do the job
# # Elasticsearch Monitoring
# Cluster Health
# Green: excellent
# Yellow: one replica is missing
# Red: at least one primary shard is down
curl -X GET http://localhost:9200/_cluster/health | python -m json.tool
curl -X GET http://${ip_address}:9200/_cluster/health | python -m json.tool

# Specific Cluster Health
curl -XGET http://localhost:9200/_cluster/health?level=indices | python -m json.tool

# Check Status via colours - green, yellow, red
curl -XGET http://localhost:9200/_cluster/health?wait_for_status=green | python -m json.tool

# Shard level
curl -XGET http://localhost:9200/_cluster/health?level=shards | python -m json.tool

curl -XGET http://localhost:9200/_all/_stats | python -m json.tool

# Bikes
curl -XGET http://localhost:9200/bike_deals/_stats | python -m json.tool

# Cars
curl -XGET http://localhost:9200/car_deals/_stats | python -m json.tool

# Multiple indices check
curl -XGET http://localhost:9200/bike_deals,car_deals/_stats | python -m json.tool

# Check Nodes
curl -XGET http://localhost:9200/_nodes/_stats | python -m json.tool

# DELETE all deals on specific index on Elastic
curl -XDELETE 'http://localhost:9200/bike_deals/?pretty=true' | python -m json.tool #powerful! Be careful!!!!

curl -XDELETE 'http://localhost:9200/bike_deals/_query' -d '{ "query" : { "match_all" : {} } }' | python -m json.tool
curl -XDELETE 'http://localhost:9200/car_deals/_query' -d '{ "query" : { "match_all" : {} } }'