So it took the weekend to initialize the cluster. I ended up having to break the process up into 3 jobs (for reasons I won’t get into right now), after I decided to go from 500 groups of folders to 1000. Here is the top level for the cluster, as it sits today.
The structure, as mentioned previously is /user/cfleshner/folder/block, where folder is folderx (where x = 1 to 1000), and block, where block is blockx (where x=1 to 25). For a total allocated 2.5TB.
As presumed, block, is a single file that is 1GB. The contents of each block file, at this point is random (or “sparse” data), and thus, has me worried a bit as to how I will organize things in the future, especially with respect to how slowly the i/o requests (via hadoop) are serviced in the cluster, using native hdfs functions – as embedded in the script listed below.
[ More ideas on this later ]
In order to avert potential problems, and to assist in development, I want to work with clean platters of data (5GB over 6 drives, including dual replication), and therefore I will be running an additional job today. This job will overwrite binary zeroes to each and every block files in the cluster.
Here is a link to the currently allocated data nodes. Keep in mind, each Linux box that has been allocated to the cluster (dataNodes), has in it, a 2TB hard drive, so we are stopping allocations of the test environment at about 50% of resource availability on each machine.
Since I am best experienced with PHP, I wrote the script using PHP, in order to allocate the cluster and initialize the blocks by integrating Command Scripts (using the ksh interpreter), for optimizing the checks of top level folders, and running jobs asynchronously.
I used a combination of creating Linux Based Jobs, using a template method (the template is actually a string with linux code in it), (to sub in the appropriate folder and block names). Since my interface to the cluster at this junction is just hdfs, I needed a way to automate the process of creating the folder structure and my script below fit the bill nicely.
As mentioned in a previous post, my servers handled 25 concurrent hdfs put requests, without much trouble, but this will depend on what hardware you use, if you decide to allocate a similar data structure, and use my script, and what is going on within the server (especially when checking for existing jobs running)… You may need to tweak it a bit.
The entire script is listed here.
So in summary, next, I will initialize all 1000 blocks of cluster data to binary zeroes, instead of the sparse data from 1Gb.dat. The new file, binary_zeros.txt, (affirmed with HxD Editor), will be substituted in, and I’ll forego creating folders again, by commenting out the portions that execute code that generated and checked for folders and sub folders, keeping the asynchronous benefits intact, because we do not need to create folders the second time around.
I’ll be back tomorrow to describe how that went.