1

I have an array of which I am using some items to construct more arrays, a rough example follows.

$rows = [
    [1, 2, 3, 'a', 'b', 'c'],
    [4, 5, 6, 'd', 'e', 'f'],
    [4, 5, 6, 'g', 'h', 'i'],
];

$derivedData = [];

foreach ($rows as $data) {

    $key = $data[0] . '-' . $data[1] . '-' . $data[2];

    $derivedData['itemName']['count'] ++;
    $derivedData['itemName']['items'][$key]['a'] = $data[3];
    $derivedData['itemName']['items'][$key]['count'] ++;
}

Now if I dump the array it's going to look something like

derivedData: [
    itemName: [
        count: 3
        items: [
            1-2-3: [
                a: a,
                count: 1
            ],
            4-5-6: [
                a: g,
                count: 2
            ],
        ]
    ]
]

As you can see the keys in derivedData.itemName.count.items are strings. If I were to do something like this instead, would I gain any benefit?

$uniqueId = 0;
$uniqueArray = [];

$rows = [
    [1, 2, 3, 'a', 'b', 'c'],
    [4, 5, 6, 'd', 'e', 'f'],
    [4, 5, 6, 'g', 'h', 'i'],
];

$derivedData = [];

foreach ($rows as $data) {

    $uniqueArrayKey = $data[0] . '-' . $data[1] . '-' . $data[2];

    if (!isset($uniqueArray[$uniqueArrayKey])) {
        $uniqueArray[$uniqueArrayKey] = $uniqueId++;
    }

    $uniqueKey = $uniqueArray[$uniqueArrayKey];

    $derivedData['itemName']['count'] ++;
    $derivedData['itemName']['items'][$uniqueKey ]['a'] = $data[3];
    $derivedData['itemName']['items'][$uniqueKey ]['count'] ++;
}

Now I will have an array of indexes and the actual data array.

uniqueArray: [
    1-2-3: 0,
    4-5-6: 1
]

derivedData: [
    itemName: [
        count: 3
        items: [
            0: [
                a: a,
                count: 1
            ],
            1: [
                a: g,
                count: 2
            ],
        ]
    ]
]

The question I am asking myself is does PHP do this internally for me when using string keys, i.e. save them somewhere and reference them as pointers for the keys instead of copying them every time?

In other words - lets say I have variable $a, if I use that as a key in different arrays would the value of $a be used (and copied) for each array as key or the pointer in memory will be used, that is basically my question?

php_nub_qq
  • 15,199
  • 21
  • 74
  • 144

2 Answers2

2

In other words - lets say I have variable $a, if I use that as a key in different arrays would the value of $a be used (and copied) for each array as key or the pointer in memory will be used, that is basically my question?

Here comes the differences between PHP >=5.4 & PHP 7 and it depends on your environment. I'm not a PHP expert and my answer might be wrong but I have been programming extensions for PHP for quite a while and I am trying to answer your question based on my observation.

In zend_hash.c, the source of PHP 5.6.26, we could find this function:

ZEND_API int _zend_hash_add_or_update(HashTable *ht, const char *arKey, uint nKeyLength, void *pData, uint nDataSize, void **pDest, int flag ZEND_FILE_LINE_DC)
{
// omitted
        if (IS_INTERNED(arKey)) {
                p = (Bucket *) pemalloc(sizeof(Bucket), ht->persistent);
                p->arKey = arKey;
        } else {
                p = (Bucket *) pemalloc(sizeof(Bucket) + nKeyLength, ht->persistent);
                p->arKey = (const char*)(p + 1);
                memcpy((char*)p->arKey, arKey, nKeyLength);
        }
// omitted
}

It seems that whether to copy the string is determined on the value of IS_INTERNED(), so where is it? First of all, in ZendAccelerator.h, we can find:

#if ZEND_EXTENSION_API_NO > PHP_5_3_X_API_NO
// omitted
#else
# define IS_INTERNED(s)             0
// omitted
#endif

So the concept of "interned string" came into existence from PHP 5.4. The string will always be copied before and in PHP 5.3. But since PHP <=5.3 is really outdated, I'd like to leave it out from this answer. And what about PHP 5.4-5.6? In zend_string.h:

#ifndef ZTS

#define IS_INTERNED(s) \
        (((s) >= CG(interned_strings_start)) && ((s) < CG(interned_strings_end)))

#else

#define IS_INTERNED(s) \
        (0)

#endif

Oh, oh, hold on, another macro, where is it again? In zend_globals_macros.h:

#ifdef ZTS
# define CG(v) TSRMG(compiler_globals_id, zend_compiler_globals *, v)
int zendparse(void *compiler_globals);
#else
# define CG(v) (compiler_globals.v)
extern ZEND_API struct _zend_compiler_globals compiler_globals;
int zendparse(void);
#endif

So in PHP 5.4-5.6 without Zend Thread Safety, if the string has already been in the memory of this specific process, a reference would be used; however with ZTS, it will always be copied. (FYI, we seldom need ZTS in Linux).

To clarify, the $uniqueKey string in this case will not be interned, because it is created at runtime. Interning only applies to compile-time known (literal) strings. @NikiC thanks for clarification

What about PHP 7? In zend_hash.c, the source of PHP 7.0.11,

static zend_always_inline zval *_zend_hash_add_or_update_i(HashTable *ht, zend_string *key, zval *pData, uint32_t flag ZEND_FILE_LINE_DC)
{
        zend_ulong h;
        uint32_t nIndex;
        uint32_t idx;
        Bucket *p;

        IS_CONSISTENT(ht);
        HT_ASSERT(GC_REFCOUNT(ht) == 1);

        if (UNEXPECTED(!(ht->u.flags & HASH_FLAG_INITIALIZED))) {
                CHECK_INIT(ht, 0);
                goto add_to_hash;
        } else if (ht->u.flags & HASH_FLAG_PACKED) {
                zend_hash_packed_to_hash(ht);
        } else if ((flag & HASH_ADD_NEW) == 0) {
                p = zend_hash_find_bucket(ht, key);

                if (p) {
// omitted
                }
        }

        ZEND_HASH_IF_FULL_DO_RESIZE(ht);        /* If the Hash table is full, resize it */

add_to_hash:
        HANDLE_BLOCK_INTERRUPTIONS();
        idx = ht->nNumUsed++;
        ht->nNumOfElements++;
        if (ht->nInternalPointer == HT_INVALID_IDX) {
                ht->nInternalPointer = idx;
        }
        zend_hash_iterators_update(ht, HT_INVALID_IDX, idx);
        p = ht->arData + idx;
        p->key = key;
        if (!ZSTR_IS_INTERNED(key)) {
                zend_string_addref(key);
                ht->u.flags &= ~HASH_FLAG_STATIC_KEYS;
                zend_string_hash_val(key);
        }
// omitted
}

ZEND_API zval* ZEND_FASTCALL _zend_hash_str_add(HashTable *ht, const char *str, size_t len, zval *pData ZEND_FILE_LINE_DC)
{
        zend_string *key = zend_string_init(str, len, ht->u.flags & HASH_FLAG_PERSISTENT);
        zval *ret = _zend_hash_add_or_update_i(ht, key, pData, HASH_ADD ZEND_FILE_LINE_RELAY_CC);
        zend_string_release(key);
        return ret;
}

FYI,

#define ZSTR_IS_INTERNED(s)                 (GC_FLAGS(s) & IS_STR_INTERNED)

Wow, so PHP 7 actually introduces a new, amazing zend_string structure and it works around with RC and garbage collection! This is far more effective than that in PHP 5.6!

In a nutshell, if you use an existed string as the key in a hash table, and of course you keep it unchanged, in PHP <=5.3, very likely to be copied; in PHP 5.4 without ZTS, referenced; in PHP 5.4 with ZTS, copied; in PHP 7, referenced.

Additionally, I've found a great article for you to read (I'll read it later as well lol): http://jpauli.github.io/2015/09/18/php-string-management.html

Frederick Zhang
  • 3,593
  • 4
  • 32
  • 54
0

While I am making an assumption that the internals haven't changed for a while, this article, states they are basically hash tables with some nuances to avoid key collisions. So in a way, yes it does do what you're stating under the hood.

jardis
  • 687
  • 1
  • 8
  • 16
  • I didn't read the linked article line by line but I went through it. It says that everything in PHP is a hash table but when I was writing an extension for PHP, there're two different functions, `array_init()` and `ALLOC_HASHTABLE()` & `zend_hash_init()`, to initialize a `zval` as an array or hash table, so they are apparently different. – Frederick Zhang Oct 06 '16 at 13:56
  • Perhaps I've misunderstood the question, as it was rather open ended, @FrederickZhang. However, with the code provided PHP will try and optimize things on its own by creating a zval of possible hash keys and mapping them (based on the linked article). However I could be wrong. – jardis Oct 06 '16 at 14:15
  • I'm not quite sure about the question itself but to clarify that nowadays ZE is not using the exactly identical data structure to handle arrays and hash tables. – Frederick Zhang Oct 06 '16 at 14:23
  • Ah, yes it wouldn't be using the same data structure, I was more pointing out the optimization being made was somewhat already done under the hood (using a hash of key => index to more quickly search an hash for values). – jardis Oct 06 '16 at 15:33
  • I understand that, but I doubt that it does it for keys of different arrays? Lets say I have variable `$a`, if I use that as a key in *different* arrays would the value of `$a` be used (and copied) for each array as key or the pointer in memory will be used, that is basically my question? – php_nub_qq Oct 06 '16 at 19:13
  • @php_nub_qq - that is a good clarification, perhaps you should add that to the main question :) – jardis Oct 06 '16 at 22:29
  • All hash tables in PHP and the Zend Engine use the same type, `HashTable` (aka `zend_array`) internally. Array zvals contain a pointer to a HashTable. There are different functions partly for convenience (working on zvals rather than HashTables directly) and partly because of different assumptions about the content (HashTables don't necessarily contain zvals, but PHP arrays must). – Andrea Oct 29 '16 at 22:47