Cluster Setup Lessons Learned

So Hadoop, was a bit of a learning curve as to setup. I used the tutorial on setting up a 3 node cluster on linode.com, and it was kind of a pain in the *** to setup up.

People on the web warned that it might be better just to use a VM on Oracle, that contains Hadoop pre-installed, but I decided I wanted to know what was going with it, so did a native install on my Laptop, which is running Linux. The laptop has 6GB of RAM, so it is adequate to function as the Namenode.

The laptop (master-node), is an older AMD machine, that is connected to 6 additional linux servers, all running the 16.0x version of Ubuntu Server.

After I was able to finally get start-dws to run with the datanodes, things went much faster.

The trick though was writing the script which would allocate the necessary KWB file system (on Hadoop).

I wanted to stress-test the cluster, to see how many concurrent jobs could run, while writing to Hadoop. It turned out the magic number for my hardware was 25 jobs.

The idea was to run no more than 25 jobs at a time (which were writing the 1GB files). The tricky part was to make sure that Linux wasn’t overrun with too many jobs. Initially I was greedy, and thought 1000 jobs might be fine to run in concurrency on the master-node, but quickly found out that that was too many, and Hadoop came to screeching halt, as java choked, and corrupted my cluster.

So I needed to restore the cluster to a working condition, and this was not an easy task. After Googling around I discovered that you could delete the folder containing the data (the nameNode and dataNode areas). But in continuing with this, I discovered that it is best to also delete the log files on the namenode and the datanode (for all 7 machines, in my case), so using the for loop command line, and having keyless SSH running, I was able to write single script that will reset the cluster.

I had tried renaming the top data level folder (where the name/data nodes were), but this was leaving previous folders laying around, not only on the master-node but on the datanodes too, and by having addressed the log file issue (clearing them), all worked fine, that is to say, I could get Hadoop up again – but lost all of the data. This was no problem because, right now I’m thinking more about optimizing fixed length cluster segments, than worrying about data recovery.

Hadoop has been up a day or so now, and I’m 25% there towards realizing the goal of achieving 5TB of addressable storage for my web application. As it continues to initialize, I’ve discovered that the hdfs commands are really slow, however, I’m hoping that the 1GB sweet spot, will work nicely with the KWB application as I bundle up urls.

The idea will be to accumulate them on the main server until a 1GB section is full with urls, and THEN port them over to Hadoop.

The replication factor for cluster was set at 2, so my storage requirements are 5TB x 2 (10TB) – which are my hardware limitation at this juncture.

[post the math for how many urls I’ll be able to include, conservatively]

So what I’m doing is I’m rolling my own sort of file system out of a series of folders allocated on Hadoop.

I’ll be using command line HDFS, in order to write to the filesystem, and interfacing with the data via PHP, since KWB is written entirely in PHP. The pretext for this was allocation of the Hadoop cluster, which I coded in PHP.

The trick will be to figure out the best place in the system to write the full csv of urls, within the KWB existing application code.

Further analysis of the data construct will be necessary, as I continue on with my development, suffice to say that the construct will contain two discreet areas. The full url of the resource, and the new <META> info, that will be associated with the url, and the end user.

Since each user can potentially reference the same url, a pointer system is best employed in this situation, and the pointers will refer to unique META constructs for each end user.

Onward – 5TB’s

So, rather then expounding on history, I’ve decided I would dive right in to what I’ve been working on recently. Hadoop.

It was a bit of a learning curve to install on my COTS hardware, but in the end I’m happy to have accomplished the install. As we speak, I am initializing Hadoop to use with my Keyword Buttons Applicaiton.

It has been six hours so far, writing out folder after folder of 1GB sections of data. This is my big data area, that I will be using to write to.

These servers are old hardware with 2TB hard drives per Linux box, so it isn’t costing me an arm and a leg in order to allocate large amounts of storage. When setting up Hadoop, it is known that it handles large files much better than smaller files, so even though by today ‘s standards 1GB isn’t that large I decided on using that size.

I set the replication factor to two, since I want to optimize the usage of my 10TB available, which basically gives me RAID1, on 5TB of storage.

I won’t have to use AWS APIS, but rather native Hadoop functions in order to access the data. The goal is to integrate Keyword Buttons i/o to hadoop, so as to be able to write “Big Data”.

It is my goal then, to be able to optimize the Hadoop cluster for usage in Keyword Buttons, hereafter KWB.

The cluster system I am laying out has the format /user/cfleshner/folder/block, where folder is a numbered folder and block is a numbered block. Each folder contains 25 blocks, and each block is a single 1GB file.

The idea is to use the entire cluster (5TB), as a bucket for the urls which will be appended with meta data. It is cheaper to do it this way, than to use AWS, or other cloud based services.

Introduction

My name is Chris L. Fleshner and I live in Omaha, Nebraska. I am 56 years old at the time of this writing. I am “enamored” with technology, and for good reason. It’s cool!

This blog is devoted to explaining the concept of “Keyword Buttons“, a project I have been working on, as an employee, a hobbyist, an innovator, a Patent Draftsman, and as an author of an abandon patent application.

Interestingly, the actual birth of this project, occurred to me in retrospect, when yesterday, I decided I’d write this blog.

It all began back in 1985 or so, and until now, I haven’t seen the relationship to the work I’ve done in this area while working at the Principal Financial Group, formerly The Bankers Life, in Des Monies, Iowa, and what has now become my primary interest. Talk about beating a dead horse!

But the fact is, I’ve enjoyed the ride and want to share my story nevertheless, because the horse isn’t dead yet…

So saunter down and saddle up, as I describe the systems I’ve developed using modern era tools, and my legacy programming experience.

In particular, I’ll describe my experience with having coded a conventional (non-oop) web application running on a home based server, primarily coded in PHP, and running on an in-house Apache Web Server – and most recently my new Hadoop 10TB available, server cluster, where I intend on storing end-user-initiated urls, harvested from the web, (using Keyword Buttons) and encoding them with meta data key references – which incidentally is critical to understanding the premise of Keyword Buttons itself.

Ultimately, the data harvested will be useful to those researching topic areas where the end-user (the one using the meta data) can make a cognitive connection between the intended scope on the resource layer, and the keywords (and associated topic), in order to discover something interesting and useful. More on this later.

The connection to 1985 and the present will be more apparent as I wander down memory lane, talking about my life experiences while programming in COBOL, on MVS/XA, making use of in-house written 370 assembly language utilities, to allocate huge amounts of RAM (back then around 2GB), in order to provide the ability to “key”, or identify macro tables. Basically, to automate a job that was previously done manually. Or even touch on my experience with Atari Basic, back in the 1970’s.

Nearly four decades of technological experience will be discussed in this blog, and I’m truly looking forward to the experience of sharing it with the public.

I enjoy sharing technical information that can help the readers of this blog discover and learn more, not only about using current open source technologies, or other tools, but also to reflect on lessons learned using “old school” hardware and software scenarios, that still are relevant today.

This blog will also serve as a personal memoir, of notable experiences occurring over several decades, relating to technology, from when I first touched the Keyboard of a TRS 80, in the mall in Sioux City Iowa, to my first PC running Windows 95, through today, using my bare-metal non-cloud 6 node 12 TB cluster, running on Ubuntu Server (you can see Hadoop running here).

And perhaps most importantly, this blog, will work toward describing the present day tech I’ve used while developing the Keyword Buttons system layers. Things that went well, and things that cost a lot of time w/o much payoff.

I like to think my situation is unique, because my end-game isn’t defined. There is a Japanese meme for that, but I’ll have to google it and plug it in later here.

By leaving the assignment of defining useful data up to the end user, the libraries of associated content will be unique for each end user (eg. some may use images for deep learning, some may use PDF files for analyzing articles, etc…) .

The expert on what META data will be applied and generated, within their particular discipline, or area of research, will depend on what the end user determines is adequate to answer their particular questions. That is to say, their input data will depend on what they are trying to determine, and Keyword Buttons can play a role in assembling lists of data, that can be used in their processes.

I encourage you to contact me, and ask questions or otherwise engage me and others who participate on this blog, because I am all about sharing knowledge, not hiding or hoarding it.

Thank you for your interest in Keyword Buttons, and please read on as I tell my story.

To get started please watch this 8 minute video. This is a very simple introduction to demonstrate lists of public content urls, being assembled by using a list of keywords. Enjoy!

Chris L. Fleshner, Developer