Swathi V

About Swathi V

Loves Art and Technology! Would like to blog and share.. Involved in Apache Hadoop and its ecosystem. Eager to be a part of Big Data Revolution.

Monitoring S3 uploads for a real time data

If you are working on Big Data and its bleeding edge technologies like Hadoop etc., the primary thing you need is a “dataset” to work on. So, this data can be reviews, blogs, news, social media data (Twitter, Facebook etc), domain specific data, research data, forums, groups, feeds, fire hose data etc. Generally, companies reach the data vendors to fetch such kind of data.

Normally, these data vendors dump the data into a shared server kind of environment. For us to use this data for processing with MapReduce and so forth, we move them to S3 for storage first and processing next. Assume, the data belong to social media such as Twitter or Facebook, then the data can be dumped according to the date format directory. Majority of the cases, its the practice.

Also assuming 140-150GB /day being dumped in a hierarchy like 2013/04/15 ie. yyyy/mm/dd format, stream of data, how do you

  • upload them to s3 in the same hierarchy to a given bucket?
  • monitor the new incoming files and upload them?
  • save the space effectively on the disk?
  • ensure the reliability of uploads to s3?
  • clean if the logging is enabled to track?
  • re-try the failed uploads?

These were some of the questions, running at the back of my mind, when I wanted to automate the uploads to S3. Also, I wanted 0 human intervention or at-least the least!

So, I came up with

A big thanks! This helped me with monitoring part and it works so great!

  • few of my own scripts.

What are the ingredients?

  •  Installation of s3sync. I have just used one script of s3cmd here and not s3sync in real. May be in future — so I have this.
  • Install Ruby from the repository
    $ sudo apt-get install ruby libopenssl-ruby
    Confirm with the version
    $ ruby -v
     
    Download and unzip s3sync
    $ wget http://s3.amazonaws.com/ServEdge_pub/s3sync/s3sync.tar.gz
    $ tar -xvzf s3sync.tar.gz
     
    Install the certificates.
    $ sudo apt-get install ca-certificates
    $ cd s3sync/
     
    Add the credentials to the s3config.yml for s3sync to connect to s3.
    $ cd s3sync/
    $ sudo vi s3config.yml
    aws_access_key_id: ABCDEFGHIJKLMNOPQRST
    aws_secret_access_key: hkajhsg/knscscns19mksnmcns
    ssl_cert_dir: /etc/ssl/certs
     
    Edit aws_access_key_id and aws_secret_access_key to your own credentials.
  • Installation of Watcher.
  • Goto https://github.com/greggoryhz/Watcher
    Copy https://github.com/greggoryhz/Watcher.git to your clipboard
    Install git if you have not
     
    Clone the Watcher
    $ git clone https://github.com/greggoryhz/Watcher.git
    $ cd Watcher/
  • My own wrapper scripts.
  • cron

Next, having set up of the environment ready, lets make some common “assumptions”.

  • Data being dumped will be at /home/ubuntu/data/ — from there it could be 2013/04/15 for ex.
  • s3sync is located at /home/ubuntu
  • Watcher repository is at /home/ubuntu

Getting our hands dirty…

  • Goto Watcher and set the directory to be watched for and corresponding action to be undertaken.
  • $ cd Watcher/
    Start the script,
    $ sudo python watcher.py start
    This will create a .watcher dirctory at /home/ubuntu
    Now,
    $ sudo python watcher.py stop
    
    Goto the .watcher directory created and 
    set the destination to be watched for and action to be undertaken
    in jobs.yml ie. watch: and command:
    
    # Copyright (c) 2010 Greggory Hernandez
    
    # Permission is hereby granted, free of charge, to any person obtaining a copy
    # of this software and associated documentation files (the "Software"), to deal
    # in the Software without restriction, including without limitation the rights
    # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    # copies of the Software, and to permit persons to whom the Software is
    # furnished to do so, subject to the following conditions:
    
    # The above copyright notice and this permission notice shall be included in
    # all copies or substantial portions of the Software.
    
    # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
    # THE SOFTWARE.
    
    # ---------------------------END COPYRIGHT--------------------------------------
    
    # This is a sample jobs file. Yours should go in ~/.watcher/jobs.yml
    # if you run watcher.py start, this file and folder will be created
    
    job1:
      # a generic label for a job.  Currently not used make it whatever you want
      label: Watch /home/ubuntu/data for added or removed files
    
      # directory or file to watch.  Probably should be abs path.
      watch: /home/ubuntu/data
    
      # list of events to watch for.
      # supported events:
      # 'access' - File was accessed (read) (*)
      # 'atrribute_change' - Metadata changed (permissions, timestamps, extended attributes, etc.) (*)
      # 'write_close' - File opened for writing was closed (*)
      # 'nowrite_close' - File not opened for writing was closed (*)
      # 'create' - File/directory created in watched directory (*)
      # 'delete' - File/directory deleted from watched directory (*)
      # 'self_delete' - Watched file/directory was itself deleted
      # 'modify' - File was modified (*)
      # 'self_move' - Watched file/directory was itself moved
      # 'move_from' - File moved out of watched directory (*)
      # 'move_to' - File moved into watched directory (*)
      # 'open' - File was opened (*)
      # 'all' - Any of the above events are fired
      # 'move' - A combination of 'move_from' and 'move_to'
      # 'close' - A combination of 'write_close' and 'nowrite_close'
      #
      # When monitoring a directory, the events marked with an asterisk (*) above
      # can occur for files in the directory, in which case the name field in the
      # returned event data identifies the name of the file within the directory.
      events: ['create', 'move_from', 'move_to']
    
      # TODO:
      # this currently isn't implemented, but this is where support will be added for:
      # IN_DONT_FOLLOW, IN_ONESHOT, IN_ONLYDIR and IN_NO_LOOP
      # There will be further documentation on these once they are implmented
      options: []
    
      # if true, watcher will monitor directories recursively for changes
      recursive: true
      
      # the command to run. Can be any command. It's run as whatever user started watcher.
      # The following wildards may be used inside command specification:
      # $$ dollar sign
      # $watched watched filesystem path (see above)
      # $filename event-related file name
      # $tflags event flags (textually)
      # $nflags event flags (numerically)
      # $dest_file this will manage recursion better if included as the dest (especially when copying or similar)
      #     if $dest_file was left out of the command below, Watcher won't properly
      #     handle newly created directories when watching recursively. It's fine
      #     to leave out when recursive is false or you won't be creating new
      #     directories.
      # $src_path is only used in move_to and is the corresponding path from move_from
      # $src_rel_path [needs doc]
      command: sudo sh /home/ubuntu/s3sync/monitor.sh $filename
  • Create a script called monitor.sh to upload to s3 in s3sync directory as below.
    • The variables you may like to change is s3bucket path in “s3path” in monitor.sh
    • This script will upload the new incoming file detected by the watcher script in the reduced redundancy storage format. (you can remove the header — provided you are not interested to store in RRS format)
    • The script will call s3cmd ruby script to upload recursively and thus maintains the hierarchy ie. yyyy/mm/dd format with files *.*
    • It will delete the file successfully uploaded to s3 from the local path — to save the disk space.
    • The script would not delete the directory, as it will be taken care by yet another script re-upload.sh, which acts as a backup for the failed uploads to be uploaded again to s3.
    • Goto s3sync directory
      $ cd ~/s3sync
      $ sudo vim monitor.sh
      
      #!/bin/bash
      ##...........................................................##
      ## script to upload to S3BUCKET, once the change is detected ##
      ##...........................................................##
      
      
      ## AWS Credentials required for s3sync ##
      export AWS_ACCESS_KEY_ID=ABCDEFGHSGJBKHKDAKS
      export AWS_SECRET_ACCESS_KEY=jhhvftGFHVgs/bagFVAdbsga+vtpmefLOd
      export SSL_CERT_DIR=/etc/ssl/certs
      
      #echo "Running monitor.sh!"
      echo "[INFO] File or directory modified = $1 "
      
      ## Read arguments
      PASSED=$1
      
      # Declare the watch path and  S3 destination path
      watchPath='/home/ubuntu/data'
      s3path='bucket-data:'
      
      # Trim watch path from PASSED
      out=${PASSED#$watchPath}
      outPath=${out#"/"}
      
      echo "[INFO] ${PASSED} will be uploaded to the S3PATH : $s3path$outPath"
      
      if   [ -d "${PASSED}" ]
      then  echo "[SAFEMODE ON] Directory created will not be uploaded, unless a file exists!"
      elif [ -f "${PASSED}" ]
      then ruby /home/ubuntu/s3sync/s3cmd.rb --ssl put $s3path$outPath ${PASSED}  x-amz-storage-class:REDUCED_REDUNDANCY; #USE s3cmd : File
      else echo "[ERROR] ${PASSED} is not valid type!!";
           exit 1
      fi
      
      RETVAL=$?
      [ $RETVAL -eq 0 ] && echo "[SUCCESS] Upload successful! " &&
      if   [ -d "${PASSED}" ]
      then echo "[SAFEMODE ON] ${PASSED} is a directory and its not deleted!";
      elif [ -f "${PASSED}" ]
      then sudo rm -rf ${PASSED}; echo "[SUCCESS] Sync and Deletion successful!";
      fi
      
      [ $RETVAL -ne 0 ] && echo "[ERROR] Synchronization failed!!"
  • Create a script called re-upload.sh which will upload the failed file uploads.
    • This script ensures that the files that are left over from monitor.sh (failed uploads — this chance is very less. May be 2-4 files/day. — due to various reasons.), will be uploaded to s3 again with the same hierarchy in RRS format.
    •  Post successful upload, deletes the file and hence the directory if empty.
    • Goto s3sync directory.
      $ cd s3sync
      $ sudo vim re-upload.sh
       
      #!/bin/bash
      ##.........................................................##
      ## script to detect failed uploads of other date directories
      ## and re-try ##
      ##.........................................................##
       
      ## AWS Credentials required for s3sync ##
      export AWS_ACCESS_KEY_ID=ABHJGDVABU5236DVBJD
      export AWS_SECRET_ACCESS_KEY=hgvgvjhgGYTfs/I5sdn+fsbfsgLKjs
      export SSL_CERT_DIR=/etc/ssl/certs
       
      # Get the previous date
      today_date=$(date -d "1 days ago" +%Y%m%d)
      year=$(date -d "1 days ago" +%Y%m%d|head -c 4|tail -c 4)
      month=$(date -d "1 days ago" +%Y%m%d|head -c 6|tail -c 2)
      yday=$(date -d "1 days ago" +%Y%m%d|head -c 8|tail -c 2)
       
      # Set the path of data
      basePath="/home/ubuntu/data"
      datePath="$year/$month/$yday"
      fullPath="$basePath/$datePath"
      echo "Path checked for: $fullPath"
       
      # Declare the watch path and S3 destination path
      watchPath='/home/ubuntu/data'
      s3path='bucket-data:'
       
       
      # check for left files (failed uploads)
      if [ "$(ls -A $fullPath)" ]; then
      for i in `ls -a $fullPath/*.*`
      do
      echo "Left over file: $i";
      if [ -f "$i" ]
      then out=${i#$watchPath};
      outPath=${out#"/"};
      echo "Uploading to $s3path/$outPath";
      ruby /home/ubuntu/s3sync/s3cmd.rb --ssl put $s3path$outPath $i x-amz-storage-class:REDUCED_REDUNDANCY; #USE s3cmd : File
      RETVAL=$?
      [ $RETVAL -eq 0 ] && echo "[SUCCESS] Upload successful! " &&
      sudo rm -rf $i &&
      echo "[SUCCESS] Deletion successful!"
      [ $RETVAL -ne 0 ] && echo "[ERROR] Upload failed!!"
      else echo "[CLEAN] no files exist!!";
      exit 1
      fi
      done
      else
      echo "$fullPath is empty";
      sudo rm -rf $fullPath;
      echo "Successfully deleted $fullPath"
      exit 1
      fi
       
      # post failed uploads -- delete empty dirs
      if [ "$(ls -A $fullPath)" ]; then
      echo "Man!! Somethingz FISHY! All (failed)uploaded files will be deleted. Are there files yet!??";
      echo "Man!! I cannot delete it then! Please go check $fullPath";
      else
      echo "$fullPath is empty after uploads";
      sudo rm -rf $fullPath;
      echo "Successfully deleted $fullPath"
      fi
  • Now, more dirtiest work — Logging and cleaning logs.
    • All the “echo” created in monitor.sh can be found in ~/.watcher/watcher.log when the watcher.py is running.
    • This log helps us initially and may be later too, to backtrack errors or so.
    • Call of duty – Janitor for cleaning logs. To do this, we can use cron to run a script at sometime. I was interested to run – Every Saturday at 8.00 AM
    • Create a script to clean log as “clean_log.sh” in /home/ubuntu/s3sync
  • Time for cron
    $ crontab -e
     
    Add the following lines at the end and save.
     
    # EVERY SATURDAY 8:00AM clean watcher log
    0 8 * * 6 sudo sh /home/ubuntu/s3sync/clean_log.sh
    # EVERYDAY at 10:00AM check failed uploads of previous day
    0 10 * * * sudo sh /home/ubuntu/s3sync/re-upload.sh
  • All set! logging clean happens every Saturday 8.00 AM and re-upload script runs for the previous day, to check if files exist and does the cleaning accordingly.
  • Let’s start the script
  • Goto Watcher repository
    $ cd ~/Watcher
    $ sudo python watcher.py start
     
    This will create ~/.watcher directory and has watcher.log in it,
    when started.

So, this assures successful uploads  to S3.
 

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

Leave a Reply


+ nine = 13



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close