Loss of WP Content

We had a technical glitch, and a number of my posts have been lost. No worries though. We’ll be back blogging very soon with some new ideas on how KWB will be progressing.

Some of the things I’ll be writing about are topics related to Distributed File Systems. In particular, I have been working with Hadoop, CephFS, and now Moose.

I will try to describe the major takeaways I’ve experienced concerning installation and use of these systems, as relates to KWB.

Thanks for checking back on my blog! We’ll be back to it very soon.

SEO and Keywords

Being relatively new to SEO, I realized that finding keywords isn’t new.

A critical difference between finding keywords and finding lists of keywords comes down to “advertising”, or more specifically what people in the world have decided constitutes a “list of things”, not if that list of things is a complete entity in itself.

For example, the fact is there are X number of fruit species in the world, but not everyone is going to be interested, in advertising fruit #3522 (some arcane tastiness that isn’t really interesting to people other than those studying fruit).

Anyway it all boils down to outsourcing data and using sites like DataForSeo.

They have a good service, especially the API, so I may end up using them to run through my test cases. I’ve signed up for a test account and am beginning to play around with integrating this with my scripts, but it is expensive, and I can’t really afford it.

Rolling my own “Keyword Associations”, is not necessarily a good place to spend my time, so instead of pursing that, in the coming posts, I will be focusing more on the Big Data aspects of my solution.

I’m coming back around to researching Silver Searcher (Ag), and how rolling my own pseudo block-chain of urls, can optimize the TB’s I have on my cluster.

Next post will be regarding that!

Identifying Lists based on Definitions

Lists, by KWB standards are defined as a group of related nouns, pronouns or items having a relational connection, as in a “Topic”.

Also, these lists must have a clear definition for each “member” of the list.

To explore this idea, I have taken several examples and ran them through a script I’ve written over the weekend.

I’ve found the best way to play around with these ideas is to scrape dictionary.com for the definitions. Parsing the definitions and putting each definition in a separate file, helps to break the process down logically.

Since ambiguity, is the number one problem with tackling the issue of being able to assemble keyword lists, a separate folder will be allocated for each end user, primarily because of the diversified interests that can exists, and also to assist in debugging and development.

What I will do is the definitions will be in a _definitions folder for each end user, and the filename, will consist of $topic.$keyword.$x, where $x is a number between 1-N, where N is the Number of definitions for a given word.

This way, we can write script to cross-reference definitions, and look for common terms among those definitions. With this logic we can determine if one or more definitions fit into a “group” of definitions.

For example, for all fruit defined, it is likely the definitions will mention the word “fruit”, or some other common term that logically categorizes those terms.

My exploratory script (will publish later) is intended to discover these relationships, using real data defs, scraped from http://dictionary.com, in hopes that the subset of terms garnered from the site, is large enough to make the automation of list data worthwhile, for as many use cases as possible.

I will publish the results in the next few days.

More on KWB List Identification

Identifying lists, using a dictionary, may initially, only work for a subset of Lists. What needs to happen is to identify that subset of lists these ideas will work for.

Key to the identification process is the definition, and the language used for the definitions for each “component word”. A component word is a word that is a member of a list.

One thing worth noting is that in about all instances of lists, there will be one or more words that are “ambiguous”. For example, for a list of fruit, we have the the word lime, but definition #1 is regarding lime as in lime stone.

One area I will seek to explore is associated subsequent definitions with definitions in the group that do NOT have ambiguity, or at least the first definition is the one that describes the desired keyword definition.

Take the word “Apple” for example. For most dictionaries, we usually find the “bulbous fruit”, as definition 1.

The idea here is that if we find a common word in a non-ambigious word forms (in this case “fruit”), then likely the definition beyond ordinal 1, will yield, eventually the fruit form of lime.

The Intersection of Words…

So while brainstorming on the potential to assemble lists, pre·lim·i·nar·i·ly, based on keywords, I’ve been able to at least break down the areas of research that need to be addressed.

The “Topic” name, as described by the end user will need to be accurate when using an approach of duality.

That is to say, if a list of Dogs exists (pardon the canine references), but if the end user decides to call the list something like “Animals”, or some other word that is less specific, and doesn’t describe the list accurately, using a common vocabulary word, then the needle in a haystack approach of using keyword intersections isn’t going to work.

But what do I mean by keyword intersections?

Well specifically, the intersection function in PHP exists, where you can specify a list of keywords as an array in a call to the intersection function.

As in:

$result = intersect ($array1, $array2);

Using this approach we can specify the keyword definitions as words of arrays, which are actually in the case of Dogs the definitions for each of the dogs (reducing prepositions, or other non-related grammar).

So we would have the Airedale array, the Puli Array, the Yorkshire Terrier Array, etc.

The idea then would be by taking the intersection of the definitions of the dogs, and coupling also the Topic name, to each array, it could be reasoned that a decent Topic name could be derived. The common word (from the Topic Name), is the game changer. However this doesn’t always work.

Say we generalize a list as “Animals”, and use the same logic. Then, as definitions of various animals are given (via the online dictionary), then it should be noted that these definitions should follow a certain criterion when describing an animal – and for most dictionaries, unfortunately, this is not the case.

So then what might be proposed, is to control the dictionary and pre-modify it to yield key constructs that would best give us the intentional intersect best describing the Keyword Topic (independent from the actual end user, having described the list using a Topic Name).

Then when the intersection function is ran, the word MOST COMMON words in each array is selected, and thus identifies in some respect the type of list, or the subject of the list, because inherent in the definitions are common vocabulary words that can help to identify the topic.

On KWB Theory and Practice

The theory behind keyword buttons, and a critical aspect of its’ potential to succeed, as a Big Data Content Management System Model, is dependent on if we will be able to programatically assemble large lists of keywords, that are within a given topic. And that the subordinate keywords can also be of the “long tail variety”.

As described earlier, this is the main hurdle facing Keyword Buttons Today.

Several ideas have manifested in my mind since the inception of KWB, and the latest idea I wanted to share in this blog.

If we can assert that Keyword Information (long tail or otherwise) is scattered throughout the web, then certainly many “lists” of things that people are interested in also exist in these domains.

With the popularity of deep learning algorithms, and the libraries that enhance the common man to explore deep learning (at least at a base level using Python libs for example), we can envision a possibility to finally solving the issue of finding these hidden lists on the web – at least insofar as Keyword Buttons Development is concerned.

Many dictionary sites do a good job of finding content that is related to the definitions of the words, but not to a more abstract level as a “connected list”, that is lists of nouns, proper nouns, or even concepts, that are categorized under a given “list title”.

So by invoking search on the web, where various domains have many lists, presumably of “identical or near identical composition”, that is if a person is interested in “Dogs” for example, it is likely the Dog lover will have a list of Dogs on their site (or maybe not), and other Dog Lovers, may also have lists or links to lists of dogs.

What is asserted here is that assuming there are X+1 lists on the web where the list is duplicated on one or more sites, the strength of the list being a legitimate list is increased by a factor N, where N represents the similarities between these collective lists and other sites having identical or near identical lists.

So also key to this theory, is the “proximity” wherewith the list resides in the site code (or in terms of the rendered location(s) ) – which helps to authenticate code as being potentially of the “list flavor”. Logic prevails then that the variations of keywords among related websites, and in identifiable website areas, will have duplicate or near duplicate list entities.

As a caveat, the entire list may or may not be exposed via HTML, depending on the technology used in the particular site, hosting the list, but the hope is that there will be enough of those types of sites out there, which will expose their particular list via underlying HTML, and it really only takes two. Also the exposed may be partial, but this is another thing to concern ourselves with at a later point.

So it is proposed then, that by identifying keywords in areas of a website, and comparing those keywords to other websites, a percentage of “similarity”, can be asserted, which can yield a probability that the lists on the disparate sites are in fact lists “and ~ one and the same”, or at least having a high likelihood of those lists being relevant, in the construct of what they represent (lists of dogs for example).

Dictionary References

I’ve contacted a Professor at BYU, in the linguistics department in order to throw around a few ideas about how to best integrate a good dictionary in KWB.

This has been a thorn in my side for development of KWB for a long time. Many dictionaries out there are available, but are either expensive, or have copyright attachments, that frankly don’t make much sense to me.

When I first started KWB many years ago, I used a dictionary that was basically 26 files of tagged HTML, where each word had HTML tags, that delineated the definitions. The problem with this is many word endings, or other forms of words, or even more modern words were not included, not to mention “long tail keywords”…

Then I opted to lookup defs on Dictionary.com, and this is not really an option because I want a local storage area to contain the word definitions.

This all maps back to the need to generate lists from what I call “Topics”. Topics in my nomenclature are really just lists of things. Lists of dog breeds (The AKC registered Dog List, The list of National Footbal Teams, and on).

The idea here is that search can be targeted around “interests”, as opposed to just a single pointed term.

Finding content to fit the bill here is ongoing, and I’m trying to engage a few people on the web to come up with ideas for this.

One such person, is in the SEO area, and the other BYU professor seems somewhat interested, but likley will be too busy to help, so if anybody out there knows a good way to generate List content, of the type I’m describing, please contact me cfleshner@fleshner[dot]com.

Taking care of the network…

Well I’ve had to put the searching of Hadoop storage areas (under the guise of Ag), on hold for the last day or two.

In particular my website http://keywordbuttons.com has been the main task for the last couple of days.

I’ve decided I’d better setup a test server permanently, so I’m not bringing the live site up and down all the time. I do want to get some traction with these ideas, and with the site not being consistently up, I thought it was time.

So I devoted one of my six nodes of Hadoop to run Apache and the PHP scripting environment. http://keywordbuttons.com/phpinfo.php

I soon discovered that the second node (for the test environment) was not running PHP 7.3, so I had to go through the tasks of getting both environments the same – even though it was only one dot away (7.2).

Nevertheless, the test server is up, and I can access it as if it was a live site using the regular dns name http://keywordbuttons.com, because I tweaked the host file on my windows test machine, and KWB then will resolve to the non-rout-able IP on my internal LAN.

Keeping things identical is critical for me, because when I want to move my changes from the test server to the live server, I can just copy the entire content folder over, and everything should work identically, assuming my software requirements are the same on each node.

When I initially setup hadoop, I cloned the six 2TB hard drives from the same source, so the environments are identical, except for things I changed on the first node afterward (hence the PHP differential).

Anyway, now I’m feeling much better about moving forward, and seeking out ways to test Ag.

More on Ag tomorrow, but b4 I go down that road, I’m going to rewrite the hadoop files, with tagged initialization files. In this way I’ll have a more systematic way of searching the nodes for space when I want to insert harvested urls.

Also, concurrently during this time, I will be playing around with standard approaches to writing the harvested urls.

Rethinking the Cluster Config and File System operations…

So after a bit of thought and experimentation, I have decided on 100 folders, each containing 25 files. That is 2.5TB (with 2 replication factor).

Each file in the cluster is 1Gib, and is initialized with actual data (in particular binary zeros from dev/zero from Linux).

The idea is to use silver search (Ag), as an intermediary, to the actual locations of the data on the dataNode platters. By having allocated the files using actual data (as opposed to sparse), I will be able to examine the disk (outside of the hadoop environment – within Linux), and from there determine if it is possible to populate the data areas with data, independently from hadoops fs functions (which are slow) from my Keyword Buttons Application.

This is highly experimental, and likely I will drop the idea of using an alternative to writing to the hadoop cluster in this way, but I did want to at least try the idea out, to see how it would perform.

Right now (8:04am), the cluster is being re-formatted to fit the above criteria. Having previously created 1000 folders, and 25 files within those, it took a few commands to clear out those folders, (-SkipTrash) and also to re-write the first 100 folders with new data (that is 4 blocks of 256MB each).

The blocks play an important roll in how hadoop writes the information, and it is observable in real time, since I’m piling 25 writes into the OS asynchronously, on the laptop (header node). So as each file copies over (the new 1Gib zero-filled initialized file), you can see the status of the copies if you do a refresh in, and clicking on the Utilities Section where the file system is defined.

The cluster then should be ready to be tested for data using Ag, in a few hours. It appears that 25 instances of 1GB chunks takes about 3 minutes to write using the PHP script I’ve authored.

Instead, this time the script is updating the first 100 folders and the new file size is 1GB (as opposed to the smaller size that was there b4). If memory serves correctly, I was initializing each folder with a total of 1GB of storage (ie. 1GB / 25) per file. But now each folder contains 25 1GB files, and it appears haddop is doing 4 i/o operations (256MB), per file, when writing out a given single file.

So lets wait it out, and see if Ag can find binary data, and how it is organized. In retrospect, I’m glad it ended up the way it did, because Haddop will have had to allocate a subsequent “write” to the disk, in order to accommodate the new size (incresed size) on the folder entry. In this way, I won’t assume something about how storage is allocated, in terms of the how the files are oriented on the platter, in a way that may have caused problems later (falsifying the assumption of a fixed continuity in the data on the volumes).

However, even still, it may have been beneficial had I not had to rewrite the data, because likely it was contiguous previously, and would have simplified Ag searches due to that reason. But I’m not going to start over at this point, but it is worth noting that if *no changes* to the allocated storage had resulted in a static environment for updates, this could have been a potentially very powerful feature to update the data, and bypass hadoop fs entirely.

Nevertheless, we’ll see in a minute what Ag does with locating constant/contiguous binary values within the hadoop environment, on each Linux Box, using the Ag search functions under linux.

2500 (1GB) Blocks in 6 DataNodes

So it took the weekend to initialize the cluster. I ended up having to break the process up into 3 jobs (for reasons I won’t get into right now), after I decided to go from 500 groups of folders to 1000. Here is the top level for the cluster, as it sits today.

The structure, as mentioned previously is /user/cfleshner/folder/block, where folder is folderx (where x = 1 to 1000), and block, where block is blockx (where x=1 to 25). For a total allocated 2.5TB.

As presumed, block, is a single file that is 1GB. The contents of each block file, at this point is random (or “sparse” data), and thus, has me worried a bit as to how I will organize things in the future, especially with respect to how slowly the i/o requests (via hadoop) are serviced in the cluster, using native hdfs functions – as embedded in the script listed below.

[ More ideas on this later ]

In order to avert potential problems, and to assist in development, I want to work with clean platters of data (5GB over 6 drives, including dual replication), and therefore I will be running an additional job today. This job will overwrite binary zeroes to each and every block files in the cluster.

Here is a link to the currently allocated data nodes. Keep in mind, each Linux box that has been allocated to the cluster (dataNodes), has in it, a 2TB hard drive, so we are stopping allocations of the test environment at about 50% of resource availability on each machine.

Since I am best experienced with PHP, I wrote the script using PHP, in order to allocate the cluster and initialize the blocks by integrating Command Scripts (using the ksh interpreter), for optimizing the checks of top level folders, and running jobs asynchronously.

I used a combination of creating Linux Based Jobs, using a template method (the template is actually a string with linux code in it), (to sub in the appropriate folder and block names). Since my interface to the cluster at this junction is just hdfs, I needed a way to automate the process of creating the folder structure and my script below fit the bill nicely.

As mentioned in a previous post, my servers handled 25 concurrent hdfs put requests, without much trouble, but this will depend on what hardware you use, if you decide to allocate a similar data structure, and use my script, and what is going on within the server (especially when checking for existing jobs running)… You may need to tweak it a bit.

The entire script is listed here.

So in summary, next, I will initialize all 1000 blocks of cluster data to binary zeroes, instead of the sparse data from 1Gb.dat. The new file, binary_zeros.txt, (affirmed with HxD Editor), will be substituted in, and I’ll forego creating folders again, by commenting out the portions that execute code that generated and checked for folders and sub folders, keeping the asynchronous benefits intact, because we do not need to create folders the second time around.

I’ll be back tomorrow to describe how that went.