Rethinking the Cluster Config and File System operations…
So after a bit of thought and experimentation, I have decided on 100 folders, each containing 25 files. That is 2.5TB (with 2 replication factor).
Each file in the cluster is 1Gib, and is initialized with actual data (in particular binary zeros from dev/zero from Linux).
The idea is to use silver search (Ag), as an intermediary, to the actual locations of the data on the dataNode platters. By having allocated the files using actual data (as opposed to sparse), I will be able to examine the disk (outside of the hadoop environment – within Linux), and from there determine if it is possible to populate the data areas with data, independently from hadoops fs functions (which are slow) from my Keyword Buttons Application.
This is highly experimental, and likely I will drop the idea of using an alternative to writing to the hadoop cluster in this way, but I did want to at least try the idea out, to see how it would perform.
Right now (8:04am), the cluster is being re-formatted to fit the above criteria. Having previously created 1000 folders, and 25 files within those, it took a few commands to clear out those folders, (-SkipTrash) and also to re-write the first 100 folders with new data (that is 4 blocks of 256MB each).
The blocks play an important roll in how hadoop writes the information, and it is observable in real time, since I’m piling 25 writes into the OS asynchronously, on the laptop (header node). So as each file copies over (the new 1Gib zero-filled initialized file), you can see the status of the copies if you do a refresh in http://18.104.22.168:50070, and clicking on the Utilities Section where the file system is defined.
The cluster then should be ready to be tested for data using Ag, in a few hours. It appears that 25 instances of 1GB chunks takes about 3 minutes to write using the PHP script I’ve authored.
Instead, this time the script is updating the first 100 folders and the new file size is 1GB (as opposed to the smaller size that was there b4). If memory serves correctly, I was initializing each folder with a total of 1GB of storage (ie. 1GB / 25) per file. But now each folder contains 25 1GB files, and it appears haddop is doing 4 i/o operations (256MB), per file, when writing out a given single file.
So lets wait it out, and see if Ag can find binary data, and how it is organized. In retrospect, I’m glad it ended up the way it did, because Haddop will have had to allocate a subsequent “write” to the disk, in order to accommodate the new size (incresed size) on the folder entry. In this way, I won’t assume something about how storage is allocated, in terms of the how the files are oriented on the platter, in a way that may have caused problems later (falsifying the assumption of a fixed continuity in the data on the volumes).
However, even still, it may have been beneficial had I not had to rewrite the data, because likely it was contiguous previously, and would have simplified Ag searches due to that reason. But I’m not going to start over at this point, but it is worth noting that if *no changes* to the allocated storage had resulted in a static environment for updates, this could have been a potentially very powerful feature to update the data, and bypass hadoop fs entirely.
Nevertheless, we’ll see in a minute what Ag does with locating constant/contiguous binary values within the hadoop environment, on each Linux Box, using the Ag search functions under linux.