So Hadoop, was a bit of a learning curve as to setup. I used the tutorial on setting up a 3 node cluster on linode.com, and it was kind of a pain in the *** to setup up.
People on the web warned that it might be better just to use a VM on Oracle, that contains Hadoop pre-installed, but I decided I wanted to know what was going with it, so did a native install on my Laptop, which is running Linux. The laptop has 6GB of RAM, so it is adequate to function as the Namenode.
The laptop (master-node), is an older AMD machine, that is connected to 6 additional linux servers, all running the 16.0x version of Ubuntu Server.
After I was able to finally get start-dws to run with the datanodes, things went much faster.
The trick though was writing the script which would allocate the necessary KWB file system (on Hadoop).
I wanted to stress-test the cluster, to see how many concurrent jobs could run, while writing to Hadoop. It turned out the magic number for my hardware was 25 jobs.
The idea was to run no more than 25 jobs at a time (which were writing the 1GB files). The tricky part was to make sure that Linux wasn’t overrun with too many jobs. Initially I was greedy, and thought 1000 jobs might be fine to run in concurrency on the master-node, but quickly found out that that was too many, and Hadoop came to screeching halt, as java choked, and corrupted my cluster.
So I needed to restore the cluster to a working condition, and this was not an easy task. After Googling around I discovered that you could delete the folder containing the data (the nameNode and dataNode areas). But in continuing with this, I discovered that it is best to also delete the log files on the namenode and the datanode (for all 7 machines, in my case), so using the for loop command line, and having keyless SSH running, I was able to write single script that will reset the cluster.
I had tried renaming the top data level folder (where the name/data nodes were), but this was leaving previous folders laying around, not only on the master-node but on the datanodes too, and by having addressed the log file issue (clearing them), all worked fine, that is to say, I could get Hadoop up again – but lost all of the data. This was no problem because, right now I’m thinking more about optimizing fixed length cluster segments, than worrying about data recovery.
Hadoop has been up a day or so now, and I’m 25% there towards realizing the goal of achieving 5TB of addressable storage for my web application. As it continues to initialize, I’ve discovered that the hdfs commands are really slow, however, I’m hoping that the 1GB sweet spot, will work nicely with the KWB application as I bundle up urls.
The idea will be to accumulate them on the main server until a 1GB section is full with urls, and THEN port them over to Hadoop.
The replication factor for cluster was set at 2, so my storage requirements are 5TB x 2 (10TB) – which are my hardware limitation at this juncture.
[post the math for how many urls I’ll be able to include, conservatively]
So what I’m doing is I’m rolling my own sort of file system out of a series of folders allocated on Hadoop.
I’ll be using command line HDFS, in order to write to the filesystem, and interfacing with the data via PHP, since KWB is written entirely in PHP. The pretext for this was allocation of the Hadoop cluster, which I coded in PHP.
The trick will be to figure out the best place in the system to write the full csv of urls, within the KWB existing application code.
Further analysis of the data construct will be necessary, as I continue on with my development, suffice to say that the construct will contain two discreet areas. The full url of the resource, and the new <META> info, that will be associated with the url, and the end user.
Since each user can potentially reference the same url, a pointer system is best employed in this situation, and the pointers will refer to unique META constructs for each end user.