How To: VirtualBox Shared Folders

When using virtual machines, you will likely want to setup a mapping of a local folder on your computer to a virtual machine folder. This is a good way to move files from your machine onto the VM and vice versa.  Here are steps to set that up with VirtualBox using a Centos image (in this case it is the Cloudera Sandbox VM).

  1. From the VM, go to the VirtualBox menu and choose Devices -> Insert Guest Additions CD Image…VirtualBox Shared Folders 1
  2. If the CD image does not start automatically then select the drive from the file browser and run “autorun.sh”.  This will install the add-ons needed.
  3. Then go to Devices -> Shared Folders -> Shared Folders Settings and setup your folder.  For this example we’ll use a local folder called “installs”.VirtualBox Shared Folders 2VirtualBox Shared Folders 3
  4. Restart virtual machine
  5. You can now find your folder under /media/sf_<foldername> and you’ll probably need elevated permissions.  So for my example the command “sudo ls -l /media/sf_installs” can be used to view files and “sudo cp /media/sf_installs/<filename> ~/” can be used to copy files to a folder local to the VM.VirtualBox Shared Folders 4

Bonus info: Once guest additions are installed you can also setup clipboard sharing to let you copy and paste from your machine to the VM, this is done by going to Devices -> Shared Clipboard and choosing your option (such as Bidirectional).



SQL on Hadoop: Getting Started

This is my presentation from SoCal Code Camp – San Diego.  Hopefully the slides are helpful, the commands probably won’t copy and paste perfectly into the terminal but please reach out with any questions.

Here is plain text of the commands I used:

Data Development Environment – Windows

A first step to developing with modern technologies such as Big Data systems and NoSQL systems is getting your development environment setup. I like to have many of the tools available locally on my laptop so I can feel free to experiment without breaking a shared server or running up a large bill on the cloud platform used to host the machine.  Originally I started on a Windows machine and being somewhat new to these technologies tried a few paths that didn’t go very well.  Here is some guidance if you are trying to get going with Python and Hadoop or other open source data platforms using a Windows laptop.

Python 2.7

A very popular programming language for processing data is Python and much of the ETL we write uses Python for flexibility (compared to SSIS which relies heavily on knowing the data model). It is simpler than Java and much easier to read, so if maintainability is important (which it should be) then it is a great option. You can use Python 2 or Python 3, but some third party modules are not compatible with Python 3. Python can be installed directly on Windows, Mac, or Linux. I prefer using it with Linux or Mac because of the other command line features and the popularity in the developer community.

Sublime Text 2

This is the text editor I use to write Python modules and edit JSON files, as well as any other type of text file. This is what others on our data team use and when I saw it I was impressed with how easy it makes it to read the code. It is not an IDE, it is an awesome text editor. There are other options for text editors (gedit, emacs, vim), but Sublime Text 2 works well. If you want an IDE instead then PyCharm is one that was recommended in the Python Fundamentals course on Pluralsight.

Oracle VM VirtualBox

After trying to get things working properly with Cygwin Terminal and setting up a Linux VM with VMware Player, I saw VirtualBox was a high ranked virtual machine option and had the easiest time setting it up. One big requirement I had was to be able to copy and paste from my local to the virtual machine and I had trouble getting that capability set up on my VMWare machine.

Linux CentOS 7

Linux works well for developing and running Python code, plus you can install many open source projects on it such as Apache Hadoop. I chose CentOS because of its similarity to RedHat which is supported by most databases and open source projects.  I found many examples for installing Python modules and client libraries on Linux, as well as plenty of information on installing Hadoop as a single node instance. I did not face as many barriers as Cygwin presented, so once I made the jump to Linux I was finally able to focus on the programming instead of the system setup.
So try out this setup and check out my Resources page for ideas of what to learn once the environment is ready and hit me up with questions if you get stuck.